Writing a Grammar with ChatGPT
If you’ve seen my previous 3 posts, you know what this one is going to cover… After seeing ChatGPT solve the clone detection problem, generate a tokenizer in Python, and perform tokenization from a few examples and informal explanations, this morning I wanted to check if it was capable of generating an entire grammar for a formal language just from examples and informal explanations.
I would say that I am astonished, but I have exhausted my astonishment store. Here’s this morning’s conversation. This was held in the same session as yesterday’s conversation.
lopes@uci.edu>>> You mentioned ANTLR. Are you familiar with the ANTLR notation for grammars?
Yes, I am familiar with ANTLR notation for grammars. ANTLR (Another Tool for Language Recognition) is a powerful parser generator that takes a formal grammar as input and generates a parser for that grammar, which can then be used to parse input strings and generate an abstract syntax tree (AST). The notation for grammars in ANTLR is similar to the notation for context-free grammars, but it also includes some additional features such as embedded actions and predicates.
A simple example of an ANTLR grammar for a calculator is the following :
grammar Calculator;
expr: INT # int
| expr op=('*'|'/') expr # mulDiv
| expr op=('+'|'-') expr # addSub
;
op: '+'|'-'|'*'|'/';
INT : [0-9]+;
WS : [ \t\r\n]+ -> skip;
This grammar defines a single rule for an expression (expr), which can be either an integer or an expression followed by an operator (op) followed by another expression. The operator can be one of +, -, *, or /.
ANTLR notation uses a combination of context-free grammar and a set of predefined tokens. It also allows for the definition of lexer rules to handle whitespaces, comments and other elements.
Please let me know if you have any specific questions about ANTLR or if you need further clarification on how it works.
lopes@uci.edu>>> I will want you to generate a grammar in ANTLR notation for a language that I’m about to describe to you. Let me first give you concrete examples of pieces of that language.
– Booleans. There are only two Boolean values, specifically: TT and FF
– Numbers. Here are two examples of numbers: 9876, 3.956
– Strings. Here are examples of strings: “Hello world!”, “”, “123”
– Null value. This represents no value; its literal representation is VOID
– Expressions.
— Arithmetic. Here are examples of these: a + b, x-y, foo*bar, num / den, -something
— Comparison. Here are examples of these: something < other, something <= other, something > other, something >= other
— Logic. Here are examples of these: !something, something AND something, something OR something
— Note that expressions can compose. We can use ( ) to group stuff. For example: (a+b)/4. The precedence rules are the same as in C.
– Statements. Let me give you some examples of statements:
— PRINT “hello world”;
—
{
PRINT “hello”;
PRINT “world”;
}
— something == other;
— 1 + 3 / 6;
— Note that an expression followed by a semicolon (;) is a statement
– Variables. Variables are statements that start with the VAR keyword. Here are some examples:
— VAR s = “here I am”;
— VAR s;
— VAR x, y, z;
– Control Flow.
— Conditionals. Here is one example:
IF (something) {
PRINT “Here”;
}
ELSE {
PRINT “There”;
}
— While loop. Here is one example:
WHILE (a < b) {
PRINT a;
a = a + 2;
}
— For loop. Here is one example:
FOR (VAR a = 0; a <= 10; a = a + 1) {
PRINT a;
}
Alright, this is enough for now. Please generate the ANTLR grammar for this kind of language. If you have questions, ask.
Sure, here is an ANTLR grammar that should parse the language you've described:
grammar MyLang;
program: statement* EOF;
statement: PRINT statement_string ';'
| VAR variable_list ';'
| variable_assignment ';'
| expression ';'
| block_statement
| if_statement
| while_statement
| for_statement
;
block_statement: '{' statement* '}';
if_statement: IF '(' expression ')' block_statement (ELSE block_statement)?;
while_statement: WHILE '(' expression ')' block_statement;
for_statement: FOR '(' variable_assignment ';' expression ';' variable_assignment ')' block_statement;
variable_assignment: variable_name ('=' expression)?;
variable_list: variable_name (',' variable_name)*;
variable_name: IDENTIFIER;
expression: logic_expression
| comparison_expression
| arithmetic_expression
| unary_expression
| value
| '(' expression ')'
;
arithmetic_expression: value (('+' | '-' | '*' | '/' | '%') value)*;
comparison_expression: value (('<' | '>' | '<=' | '>=' | '==' | '!=') value)*;
logic_expression: unary_expression (('AND' | 'OR') unary_expression)*;
unary_expression: ('!' | '-')* value;
statement_string: STRING;
value: number
| BOOLEAN
| variable_name
| VOID
;
number: INT
| FLOAT
;
IDENTIFIER: [a-zA-Z][a-zA-Z0-9_]*;
INT: [0-9]+;
FLOAT: [0-9]+ '.' [0-9]+;
BOOLEAN: 'TT' | 'FF';
STRING: '"' (~["\r\n] | '\"')* '"';
VOID: 'VOID';
WS: [ \t\r\n]+ -> skip;
This grammar defines a top-level rule called "program" that consists of zero or more statements, followed by the end of the file (EOF). Each statement can be one of several types, such as a PRINT statement