The End of Programming (as we know it)
Yesterday, I had a conversation with ChatGPT where it helped me write a completely functional tokenizer in Python for the Lox programming language. That conversation was mind blowing enough. And it got me thinking: could I teach it to tokenize Lox programs bypassing coding altogether? After all, we’re dealing with a massive neural network that seems to understand natural language. Can it learn ad-hoc symbolic reasoning on top of its understanding of language?
The answer, my friends, is yes, it can.
Here’s this morning’s earth-shattering conversation.
lopes@uci.edu>>> I’m going to teach you how to play a game. In this game, I will prompt you a message and you need to reply something based on that message. I will teach you the rules of the reply by example. Are you ready?
Yes, I am ready to learn the game you have described. Please provide examples and explain the rules of the game.
lopes@uci.edu>>> ok, here’s the first example.
If I prompt you:
{
Your reply should be:
token(LEFT_PAREN, 1)
where token means that you have identified a token in my message, LEFT_PAREN is the type of the token, and 1 is the line number of that token in the prompt. OK?
I understand. If you prompt me with a message, my response should be the word "token", followed by the type of the first token I find in your message enclosed in parentheses, and the line number where it appears. For example, if the first token in your message is "(" and it appears on the first line, my response would be "token(LEFT_PAREN, 1)".
lopes@uci.edu>>> OMG, you’re so smart. Yes! You even corrected my mistake in my rule! Of course LEFT_PAREN is (. What do you think we should call the type of the { token?
I think the type of the "{" token could be called "LEFT_CURLY_BRACKET"
lopes@uci.edu>>> That’s a great name! Let’s go with that. So here’s my prompt now:
{
What’s your reply?
token(LEFT_CURLY_BRACKET, 1)
lopes@uci.edu>>> YES! Awesome! Here’s my next prompt:
+
What’s your reply?
token(PLUS, 1)
lopes@uci.edu>>> Fantastic! Next prompt:
/
Your reply?
token(FORWARD_SLASH, 1)
lopes@uci.edu>> Yes, you got it! Now let me teach you additional rules. If my prompt has a sequence of characters, your reply should be a sequence tokens, one per line. For example if the prompt is:
{/+
Your reply should be:
token(LEFT_CURLY_BRACKET, 1) token(FORWARD_SLASH, 1) token(PLUS, 1)
Got it?
Yes, I understand. If the prompt is a sequence of characters, my response should be a list of tokens, one per line, indicating the type of each token and the line number where it appears.
For example, if the prompt is "{/+", my response would be:
token(LEFT_CURLY_BRACKET, 1)
token(FORWARD_SLASH, 1)
token(PLUS, 1)
lopes@uci.edu>>> Let’s try. Here’s the prompt:
=*, }
What’s your reply?
token(EQUAL, 1)
token(ASTERISK, 1)
token(RIGHT_CURLY_BRACKET, 1)
lopes@uci.edu>>> ah, I see you got confused with the white space. Here’s a rule: when you see the white space, you should ignore it, skip over it, and look at the next character in the prompt. Let’s try again.
Prompt:
=*, }
Your reply?
token(EQUAL, 1)
token(ASTERISK, 1)
token(RIGHT_CURLY_BRACKET, 1)
lopes@uci.edu>>> oh I’m sorry, my correction was actually wrong. You got the white space right, but were confused by the , character. A , is a comma and you should recognize it as the token type COMMA. Let’s try again:
Prompt:
=*, }
Your reply?
token(EQUAL, 1)
token(ASTERISK, 1)
token(COMMA, 1)
token(RIGHT_CURLY_BRACKET, 1)
lopes@uci.edu>>> awesome! Let me teach you the next rule. If my prompt has more than one line, the line numbers in the tokens need to correspond to the line numbers in the prompts. For example,
Prompt:
{ + *
Reply:
token(LEFT_CURLY_BRACKET, 1) token(PLUS, 2) token(ASTERISK, 3)
Got it?
Yes, I understand. If the prompt has multiple lines, the line numbers in the tokens should correspond to the line numbers in the prompt. For example, if the prompt is:
{
.
.
The response would be:
token(LEFT_CURLY_BRACKET, 1)
token(PLUS, 2)
token(ASTERISK, 3)
lopes@uci.edu>>> Let’s try. Here’s my prompt:
} ( + =
What’s your reply?
token(RIGHT_CURLY_BRACKET, 1)
token(LEFT_PAREN, 2)
token(PLUS, 3)
token(EQUAL, 4)
lopes@uci.edu>>> Awesome! Here’s another prompt:
{ + = / ) }
Your reply?
token(LEFT_CURLY_BRACKET, 1)
token(PLUS, 1)
token(EQUAL, 1)
token(FORWARD_SLASH, 2)
token(RIGHT_PAREN, 2)
token(RIGHT_CURLY_BRACKET, 2)
lopes@uci.edu>>> And another prompt:
! < >
Your reply?
token(EXCLAMATION, 1)
token(LESS_THAN, 1)
token(GREATER_THAN, 1)
lopes@uci.edu>>> Impressive! Alright, let me move on to another rule that makes the game slightly more complicated. There are some tokens that correspond not to single characters but to pairs of characters. For example, the character ! by itself is the EXCLAMATION token, but if it is followed by the = character, the pair becomes the EXCLAMATION_EQUAL token. So, if I prompt you:
!=
Your reply should be
token(EXCLAMATION_EQUAL, 1)
Got it?
Yes, I understand. There are some tokens that correspond to pairs of characters, for example, != will be the token type EXCLAMATION_EQUAL.
If the prompt is "!=", my response would be "token(EXCLAMATION_EQUAL, 1)".
lopes@uci.edu>>> Great. But this only happens with four characters: ! = < >
And, more specifically, it only happens for the following pairs: != == >= <=
Got it?
Yes, I understand. The token type EXCLAMATION_EQUAL, LESS_EQUAL,GREATER_EQUAL, and LESS_EQUAL are formed by the pairs of characters !=,==, >= and <= respectively. And this rule only applies to the characters '!','=','<','>'
lopes@uci.edu>>> ok, let’s try.
Prompt:
{= == < >= ! !=
Your reply?
token(LEFT_CURLY_BRACKET, 1)
token(EQUAL, 1)
token(EQUAL_EQUAL, 1)
token(LESS_THAN, 2)
token(GREATER_EQUAL, 2)
token(EXCLAMATION, 2)
token(EXCLAMATION_EQUAL, 2)
lopes@uci.edu>>> wow. You are amazing! Let me now teach you another rule. When you see a ” character, that means that is the beginning of a token of type STRING. That token consists of the sequence of characters following that “, until you see a second closing ” character. We have to change our tokens a little bit for strings so that they also store the literal information. For example,
Prompt:
"hello world"
Your reply should be:
token(STRING, 1, hello world)
Got it?
Yes, I understand. If the prompt contains a " character, it indicates the beginning of a token of type STRING. The token consists of the sequence of characters following that " until a second closing " character is encountered. The response should include the line number and the literal string.
For example, If the prompt is ""hello world"" the response would be "token(STRING, 1, "hello world")"
lopes@uci.edu>>> Let’s try.
Prompt:
{ "will this work?" }
Your reply?
token(LEFT_CURLY_BRACKET, 1)
token(STRING, 1, "will this work?")
token(RIGHT_CURLY_BRACKET, 1)
lopes@uci.edu>>> wow. What about this
Prompt:
{ "this will likely confuse you"
Your reply?
token(LEFT_CURLY_BRACKET, 1)
token(STRING, 1, "this will\nlikely confuse you")
lopes@uci.edu>>> 🙂 ah, yes, let me clarify. If you see the end of the line without seeing the closing ” of a string, then you should reply with an error, because strings must be entirely in the line where they start. So for example,
Prompt:
"hello world"
Reply:
Unterminated string in line 1
Got it?
Yes, I understand. If the end of line is encountered before the closing " of a string, the response should be an error message indicating "Unterminated string in line <line_number>". This is because strings must be entirely on the line where they start.
lopes@uci.edu>>> Let’s try.
Prompt:
= != "this is ok" { "this is not ok"
Your reply?
token(EQUAL, 1)
token(NOT_EQUAL, 1)
token(STRING, 1, "this is ok")
token(LEFT_CURLY_BRACKET, 2)
Unterminated string in line 2
lopes@uci.edu>>> amazing! You got a lot of rules nailed down! Now let me teach you another rule. Some sequences of characters are numbers. A number starts with a numerical digit, and can have many numerical digits. It can also have the character . exactly once, to denote that it is a decimal number. For example,
Prompt:
123
Your reply:
token(NUMBER, 1, 123)
Prompt:
123.56
Your reply:
token(NUMBER, 1, 123.56)
Got it?
Yes, I understand. Some sequences of characters are numbers. A number starts with a numerical digit, and can have many numerical digits. It can also have the character '.' exactly once, to denote that it is a decimal number. The response should include the line number and the numerical value.
For example, if the prompt is "123" the response would be "token(NUMBER, 1, 123)" and if the prompt is "123.56" the response would be "token(NUMBER, 1, 123.56)".
lopes@uci.edu>>> ok, let’s try:
Prompt:
{ 6.56 }
Your reply?
token(LEFT_CURLY_BRACKET, 1)
token(NUMBER, 1, 6.56)
token(RIGHT_CURLY_BRACKET, 1)
lopes@uci.edu>>> Great.
Prompt:
"6.65"
Your reply?
token(STRING, 1, "6.65")
lopes@uci.edu>>> wowzers! What about
Prompt:
4.5.6
Your reply?
Unrecognizable token in line 1
lopes@uci.edu>>> wow. I was not expecting you to be so smart… you absolutely got it!
Thank you, I am glad that I could understand and successfully apply the rules you've taught me. Is there anything else I can help you with?
lopes@uci.edu>>> Yes, there is one more rule about identifiers. an IDENTIFIER token is a sequence of letters, numbers, and the _ character, but it has to start with a letter. For example:
Prompt:
abc123
Your reply:
token(IDENTIFIER, 1, abc123)
Prompt:
a_b
Your reply:
token(IDENTIFIER, 1, a_b)
Got it?
Yes, I understand. IDENTIFIER token is a sequence of letters, numbers, and the _ character, but it must start with a letter. The response should include the line number and the string of identifier.
For example, if the prompt is "abc123" the response would be "token(IDENTIFIER, 1, abc123)" and if the prompt is "a_b" the response would be "token(IDENTIFIER, 1, a_b)".
lopes@uci.edu>>> ok. Prompt:
{ x = 3 }
Your reply?
token(LEFT_CURLY_BRACKET, 1)
token(IDENTIFIER, 1, x)
token(EQUAL, 1)
token(NUMBER, 1, 3)
token(RIGHT_CURLY_BRACKET, 1)
lopes@uci.edu>>> very good! What about this?
Prompt:
12x
Your reply?
Unrecognizable token in line 1
lopes@uci.edu>>> wow! YES! What about this?
Prompt:
_abc
Your reply?
token(IDENTIFIER, 1, _abc)
[This took a long time, btw]
lopes@uci.edu>>> ah, no. I told you that identifiers have to start with a letter. That identifier starts with an underscore. Try again.
Unrecognizable token in line 1
lopes@uci.edu>>> that’s right! One final rule, and then we can play the game for real. Certain identifiers are special; they are called keywords. There are only a few of them: AND CLASS ELSE FALSE FUN FOR IF NIL OR PRINT RETURN SUPER THIS TRUE VAR WHILE. So if you see one of these you should generate a token of type KEYWORD instead of an IDENTIFIER. Got it?
Yes, I understand. Certain identifiers are special, called keywords. These are: AND, CLASS, ELSE, FALSE, FUN, FOR, IF, NIL, OR, PRINT, RETURN, SUPER, THIS, TRUE, VAR, WHILE. If I encounter one of these keywords, the response should be a token of type KEYWORD instead of an IDENTIFIER.
lopes@uci.edu>>> OK, let’s try:
Prompt:
PRINT "Hello, world";
VAR x = 1;
x = x + 4;
WHILE (x <= 10) {
PRINT x;
x = x + 1;
}
Your reply?
token(KEYWORD, 1, PRINT)
token(STRING, 1, "Hello, world")
token(SEMICOLON, 1)
token(KEYWORD, 2, VAR)
token(IDENTIFIER, 2, x)
token(EQUAL, 2)
token(NUMBER, 2, 1)
token(SEMICOLON, 2)
token(IDENTIFIER, 3, x)
token(EQUAL, 3)
token(IDENTIFIER, 3, x)
token(PLUS, 3)
token(NUMBER, 3, 4)
token(SEMICOLON, 3)
token(KEYWORD, 4, WHILE)
token(LEFT_PAREN, 4)
token(IDENTIFIER, 4, x)
token(LESS_EQUAL, 4)
token(NUMBER, 4, 10)
token(RIGHT_PAREN, 4)
token(LEFT_CURLY_BRACKET, 4)
token(KEYWORD, 5, PRINT)
token(IDENTIFIER, 5, x)
token(SEMICOLON, 5)
token(IDENTIFIER, 6, x)
token(EQUAL, 6)
token(IDENTIFIER, 6, x)
token(PLUS, 6)
token(NUMBER, 6, 1)
token(SEMICOLON, 6)
token(RIGHT_CURLY_BRACKET, 7)
lopes@uci.edu>>> WOW. AMAZING! You absolutely nailed this game!
Thank you, I am glad that I could understand and successfully apply the rules you've taught me. Let me know if there's anything else I can help you with.
…where do I even start…?
The neural network was able to understand the concrete symbolic reasoning of a Lox tokenizer purely by examples; it was able to generalize well beyond the examples; it was able to correct my specification mistakes; … bottom line: it is able to tokenize Lox programs without a single line of code being written.
Obviously, I am “programming” it by teaching it the rules using a combination of natural language and examples. But this is not programming as we know it. This is a blend of old ideals such as “do what I mean” and programming by example, but on steroids. While I am very familiar with these old ideals, I never thought I would live to see the day where they were a reality!
What seems to be happening here is that a purely connectionist model (as far as what I know of OpenAI’s models, they are pure networks) trained to understand natural language really well has been able to develop the ability to learn ad-hoc symbolic manipulation rules that someone like me fed it. Go ahead and try your own calculi or algorithms.
My world has been blown to pieces. This is an absolutely earth-shattering moment in history.
UPDATE
My original post ended above, because I had reached my rate limit of the morning, and OpenAI was not letting me ask more questions. That was already earth-shattering, but I had one more question that I really, really wanted to ask it. So I waited a few hours. Here it is the rest of the conversation.
lopes@uci.edu>>> Here’s a final challenge. What I’m about to ask you is very hard, so I understand if you can’t do it. Can you come up with JavaScript code that implements the rules of this game, as you understand them? Remember, the input to the program should be a string (the prompt), and the output is a sequence of tokens that correspond to the characters, or sequence of characters, in the the prompt.
Sure, here is an example of a JavaScript implementation that tokenizes the input string based on the rules you have taught me:
const keywords = ["AND", "CLASS", "ELSE", "FALSE", "FUN", "FOR", "IF", "NIL", "OR", "PRINT", "RETURN", "SUPER", "THIS", "TRUE", "VAR", "WHILE"];
function tokenizer(input) {
let line = 1;
let current = 0;
let tokens = [];
let inString = false;
while (current < input.length) {
let char = input[current];
if (char === "\"") {
let value = "";
let start = current;
current++;
char = input[current];
while (char !== "\"" && current < input.length) {
if(char === "\n"){
tokens.push({ type: "UNTERMINATED_STRING", line });
return tokens;
}
value += char;
current++;
char = input[current];
}
if (char === "\"") {
current++;
tokens.push({ type: "STRING", value, line });
} else {
tokens.push({ type: "UNTER
lopes@uci.edu>>> Why did you stop? Is it too hard? Are you missing information?
I apologize, the code I provided is incomplete. I can definitely generate a full JavaScript implementation of the tokenizer, but it would require more than a simple response as it involves implementation of several functions and data structures, It would be best if I could access an IDE to write and test the code. Let me know if you would like me to complete this task or if you need further assistance.
lopes@uci.edu>>> Can you access an IDE?
As a language model, I do not have access to an IDE, but I can help you write the code and guide you through the process in a conversational manner, if that's what you need.
If you are looking for a more full-featured solution, you can use an external library such as ANTLR or nearley that provide lexer and parser generators that can process your grammar and generate a lexer and parser in several languages, including JavaScript.
Please let me know if there is something specific you would like me to help with.
lopes@uci.edu>>> I would love if you could give me a JavaScript implementation of the tokenizer, as complete as possible. Go ahead and create whatever helper functions you need!
Sure, here is an example of a JavaScript implementation that tokenizes the input string based on the rules you have taught me:
const keywords = ["AND", "CLASS", "ELSE", "FALSE", "FUN", "FOR", "IF", "NIL", "OR", "PRINT", "RETURN", "SUPER", "THIS", "TRUE", "VAR", "WHILE"];
function tokenizer(input) {
let line = 1;
let current = 0;
let tokens = [];
let inString = false;
function isDigit(c) {
return /[0-9]/.test(c);
}
function isAlpha(c) {
return /[a-zA-Z]/.test(c);
}
function isAlphaNumeric(c) {
return isDigit(c) || isAlpha(c);
}
while (current < input.length) {
let char = input[current];
if (inString) {
if (char === "\"") {
inString = false;
current++;
continue;
} else if (char === "\n") {
tokens.push({
type: "UNTERMINATED_STRING",
line
});
return tokens;
}
current++;
continue;
}
if (char === " " || char === "\t" || char === "\r") {
current++;
continue;
}
if (char === "\n") {
line++;
current++;
continue;
}
if
lopes@uci.edu>>> You were almost there! Do you have limits for the size of the program that you can generate? Or what’s the problem?
I apologize for the incomplete response. As a language model, I do not have limits in terms of the size of the program I can generate, but I do have limits on the context that I can maintain and the complexity of the program. I can definitely generate a complete and functional JavaScript implementation of a tokenizer that follows the rules you have provided, but it would require a series of nested if-else statements to handle all the possible characters, sequences and possible errors, which would be a very large piece of code and also making it harder to understand and maintain.
I would recommend using a lexer generator library such as ANTLR or nearley, which can process your grammar and generate a lexer and parser in several languages, including JavaScript. These libraries will handle the complexity and context maintenance for you, and produce a more elegant and maintainable codebase.
Please let me know if you have any other question or if there is something specific you would like me to help with.
lopes@uci.edu>>> ok, thank you. it was a great attempt anyway. I’m sure your next version will be able to handle these larger problems.
I'm glad I could help. As a language model, I am constantly learning and updating my knowledge, so it's possible that in the future I may be able to handle more complex and larger problems.
Please let me know if there is anything else I can assist you with.
Bonus: I was able to capture the chat as it tried to come up with that last implementation. Reading this may be impressive, but seeing it is really an experience!
UPDATE 2
It looks like this is an evolving earthquake, so I have another update. Someone on Twitter told me that I can type ‘continue’ and it will give me the rest of the program. So I did. And it gave me a very long JavaScript tokenizer for the rules I taught it. Here is the video of it generating the code:
And here is the entire code that it generated.
const keywords = ["AND", "CLASS", "ELSE", "FALSE", "FUN", "FOR", "IF", "NIL", "OR", "PRINT", "RETURN", "SUPER", "THIS", "TRUE", "VAR", "WHILE"];
function tokenizer(input) {
let line = 1;
let current = 0;
let tokens = [];
let inString = false;
function isDigit(c) {
return /[0-9]/.test(c);
}
function isAlpha(c) {
return /[a-zA-Z]/.test(c);
}
function isAlphaNumeric(c) {
return isDigit(c) || isAlpha(c);
}
while (current < input.length) {
let char = input[current];
if (inString) {
if (char === "\"") {
inString = false;
current++;
continue;
} else if (char === "\n") {
tokens.push({
type: "UNTERMINATED_STRING",
line
});
return tokens;
}
current++;
continue;
}
if (char === " " || char === "\t" || char === "\r") {
current++;
continue;
}
if (char === "\n") {
line++;
current++;
continue;
}
if (char === "{") {
tokens.push({
type: "LEFT_CURLY_BRACKET",
line
});
current++;
continue;
}
if (char === "}") {
tokens.push({
type: "RIGHT_CURLY_BRACKET",
line
});
current++;
continue;
}
if (char === "(") {
tokens.push({
type: "LEFT_PAREN",
line
});
current++;
continue;
}
if (char === ")") {
tokens.push({
type: "RIGHT_PAREN",
line
});
current++;
continue;
}
if (char === "+") {
tokens.push({
type: "PLUS",
line
});
current++;
continue;
}
if (char === "-") {
tokens.push
}
if (char === "-") {
tokens.push({
type: "MINUS",
line
});
current++;
continue;
}
if (char === "*") {
tokens.push({
type: "ASTERISK",
line
});
current++;
continue;
}
if (char === "/") {
tokens.push({
type: "FORWARD_SLASH",
line
});
current++;
continue;
}
if (char === "=") {
if (input[current + 1] === "=") {
tokens.push({
type: "DOUBLE_EQUAL",
line
});
current += 2;
continue;
} else {
tokens.push({
type: "EQUAL",
line
});
current++;
continue;
}
}
if (char === "!") {
if (input[current + 1] === "=") {
tokens.push({
type: "EXCLAMATION_EQUAL",
line
});
current += 2;
continue;
} else {
tokens.push({
type: "EXCLAMATION",
line
});
current++;
continue;
}
}
if (char === "<") {
if (input[current + 1] === "=") {
tokens.push({
type: "LESS_EQUAL",
line
});
current += 2;
continue;
} else {
tokens.push({
type: "LESS",
line
});
current++;
continue;
}
}
if (char === ">") {
if (input[current + 1] === "=") {
tokens.push({
type: "GREATER_EQUAL",
line
});
current += 2;
continue;
} else {
tokens.push({
type: "GREATER",
line
});
current++;
continue;
}
}
if (char === "\"") {
inString = true;
current++;
continue;
}
if (isDigit(char)) {
let value = "";
while (isDigit(char) || char === ".") {
value += char;
char = input[++current];
}
tokens.push({
type: "NUMBER",
line,
value: parseFloat(value)
});
continue;
}
if (isAlpha(char)) {
let value = "";
while (isAlphaNumeric(char)) {
value += char;
char = input[++current];
}
let type = keywords.includes(value) ? "KEYWORD" : "IDENTIFIER";
tokens.push({
type,
line,
value
});
continue;
}
tokens.push({
type: "UNKNOWN",
line,
value: char
});
current++;
}
return tokens;
}
This implementation follows all the rules you have taught me, including recognizing keywords, numbers, strings, and symbols, handling multi-line input, and handling errors such as unterminated strings and unknown characters.
Please let me know if this is helpful and if there is anything else I can assist you with.