The Anatomy of a Learning Stall
Or How LLM Hallucinations Become Human Hallucinations
I recently had an experience with an undergraduate student that really made me pause on how we use AI. I made a couple of posts on LinkedIn with very short snippets of what happened, but I want to write down this experience in detail, because the details shed a spotlight on the mechanics and the consequences of the often-heard idea that, in the near future, we, humans, will not need to know anything about the internals of the systems we build; that we just need to specify what we want and, once the specification is kosher, the AI agent will do it and it will all be correct.
Maybe it will all be correct. And maybe it will also be wrong. Correctness is not the same as “the right thing.” And this distinction is crucial to separate hallucinations from reality.
The Independent Project, Week 1
This all started in early April with an undergraduate student, let’s call him Joe, approached me for an independent study — these are research project courses that students can take for credit, and are typically done by students who are considering graduate school or are simply curious about research. Joe, a junior, told me he wants to apply to PhD programs. So on week 1 of the quarter, we talked about his interests and my current interests, and we zoomed in on a possible project for him. Here’s a summary of that conversation.
Lately, I am generally interested in methods for automatic verification of software artifacts using LLMs/agents/neural networks (we had a first paper on that back in 2023/2024). Joe told me he is interested in the models themselves, and wanted the experience of fine-tuning a model. So I suggested a project around the idea of automatic verification of protocol specifications. I selected a specific protocol — MTQQ — for him to do an experiment involving the development of an AI agent that could, potentially, automatically verify the protocol. I also mentioned to him two possible routes for this general idea: (1) we could have the AI agent verify the consistency of the protocol specification itself, that there are no inconsistencies between the many clauses; or (2) we could grab concrete implementations of MTQQ out there and verify whether they comply or not with the protocol specification.
The meeting ended on a good note of mutual understanding, with Joe’s next goal being deciding which of the two problems to tackle, and then take it from there. In my mind, it was pretty clear that this project was feasible in a quarter, especially with the help of Claude/Gemini/etc.
Week 8
The way I work with students is to have one-on-one weekly meetings, possibly augmented with additional meetings when the students go through fast progress that needs my steering. I don’t make any of this mandatory, as I’ve worked with enough students to know that every one of them is different and needs different amounts of supervision; also because I treat these students as younger peers, not employees. Joe never showed up for the weekly meetings — fine.
He showed up on Week 8 with the complete execution of the project. He was excited to show it me. He told me he developed an agent that could, indeed, verify that MTQQ was correct and that the agent was based on a fine tuning of Qwen. Glancing at his Visual Studio, the thing looked pretty impressive and well done: a proper folder structure, files that had meaningful names, a data folder that clearly contained training data for fine tuning, lots of references to MTQQ, etc. Looked plausible and impressive! After this 2 minute oral abstract of the project, he asked if we could write a paper about this project and submit it to a conference. I was excited, too, so I asked him to explain what exactly did he do.
*Joe’s slight pause* I build an agent that verified the MTQQ protocol
*Me* OK, but which of the two problems did you tackle? Is it the specification consistency problem or the implementation correctness problem?
*Joe hesitating* uh… I think… the specification correctness?
*Me, looking at his training data and seeing lots of Python code in it* Really? Are you sure?
*Joe hesitating even more but trying to sound confident* uh, yes, the specification correctness
*Me pointing at the training data* So what is this training data all about? What are all these Python functions that look like an MTQQ implementation?
*Joe now visibly nervous* Sorry, I meant I did the implementation correctness. I verified that a given Python implementation is correct
*Me, now a bit worried* OK. And how did you do it?
*Joe with a blank stare of confusion* I build an agent.
*Me, even more worried* And how does the agent verify that the implementation is correct?
*Joe really nervous* uhh…
*Me jumping in to help him* OK let me take a closer look at your code.
I looked at the code for one minute and I realized, with awe and horror, what exactly had happened. Two things:
1) Joe had simply plugged into Claude Code some version of our conversation during our Week 1 meeting, and let Claude do the entire project. He had no idea what the bot had done to develop this protocol verification agent on top of Qwen, other than the vague notion that fine tuning was involved — something that seemed important to Joe since day 1.
2) Claude did its best to comply with what Joe asked it to do, but the experiment was invalid: Claude had generated the training data and the test data, there was no external implementation of MTQQ, and there was no baseline (e.g. Qwen without fine-tuning). This Qwen agent was saying that Claude’s functions were correct/incorrect based on Claude’s training data of correct/incorrect functions, and without any baseline that could control for how much Qwen already knew about MTQQ.
I leaned back and sighed. I hesitated: do I explain to Joe that the experiment is invalid, how and why it is invalid, or do I help him get to that conclusion by himself? This was still week 8, 2 weeks left in the quarter, so I decided to give him a chance. I told him that I see problems here, but that if I tell him the problems myself, he will get an F in the course. I gave him the option to go study what Claude did, enough for him to be able to tell me what the problems are.
Week 9
Joe came back one week later. This time he could answer, with confidence, that the agent was verifying an implementation of MTQQ, not that the protocol itself was consistent. He could also tell me that the training and testing programs were being parsed into an AST (a term that seemed new and important to him, to the point that he repeated it, like, 10 times during this meeting), and that the agent would score them. He had written down notes, and was looking at the notes to explain things to me.
I was confused. When are programs parsed, and for what purpose? What is the significance of this score? I asked him to draw something on the white board to explain the sequence of steps. He drew a box and wrote the words “relevant code” inside. He explained that the agent gets relevant code and then it scores. Me: scores what? How? *blanks*
After 15 minutes of this, I stopped. It was clear that Joe had asked Claude to answer the questions that I had asked him on Week 8, plus, maybe a couple of more clarifying questions, and he wrote down Claude’s answers on his notebook. He still had no idea of Claude had done, and, more importantly, he still had no idea of what experimental validity is — this wasn’t even on the scope of things he could understand at this point.
Still one week to go, so I gave him a third chance.
Week 10
He came back again. Clearly he had spent a lot more time exploring and studying what Claude had done. He jumped to the whiteboard and drew a boxes-and-arrows diagram that at least attempted to explain what happens when a test Python function comes in for verification. It gets parsed. The identifiers and docstrings are used to retrieve the most similar protocol specification clause. Then a verification component kicks in (this is the fine-tuned Qwen) saying whether it complies or not, and a confidence score, a number between 0 and 1, is returned.
*Me* What is this confidence score? How is it calculated?
*Joe blanks*
*Me* OK, let’s backtrack. What is the prompt used for Qwen?
*Joe nervous* The prompt? uhh… I send the function to Qwen.
*Me* Just the function? No instruction? How does that work?
*Joe nervous* uhh
*Me* OK, let’s look at the code. Show me where the prompt is.
He opens Visual Studio and proceeds to try to find something related to prompting. He is taking more than a minute, so I step out to go to the restroom and give him time to search the code without me looking over his shoulder. When I came back, his cursor was on top of a function call to make_prompt(…). Good.
*Me* OK, so show me the prompt.
*Joe clearly not knowing what to do*
*Me* Find the definition of that function!
*Joe immediately searching for occurrences of make_prompt and finding it* Here it is.
The function was constructing the prompt from a combination of constants and variables, as prompt functions do. One of the constants was called INSTRUCTION.
*Me* OK, so show me the instruction.
To my horror, he scrolled down on the file as if he didn’t know that constants are typically declared in the beginning of files — and INSTRUCTION was right there on top of def make_prompt(…), which happened to be the first function of that file. I stopped him. “Constants are usually on the top of files!”
It dawn on me that this young man didn’t know how to read code. He is a junior in a CS program, and he doesn’t know how to read code! My awe and horror doubled. He has probably been using Claude/Gemini/ChatGPT this whole time to get through his homework assignments, and so he doesn’t know how to read code, how to navigate it. What a f*cking disaster!
He scrolled back up and finally found the INSTRUCTION constant.
*Joe* Here it is.
*Me reading it out loud” OK, so it is asking to rate the function on a three-level scale of compliant, maybe, and not compliant.
Joe was relieved that I was pleased, but he didn’t seem to understand the significance of the prompt or why I was so fixated on the prompt.
*Me, take two* So what is this numerical confidence score? How is it calculated? Is it Qwen?
*Blank stares*
He couldn’t answer. He didn’t know. FYI, dear reader, the confidence score was based on the agent running test cases, also written by Claude. The numerical value was the percentage of test cases that passed. Joe had no idea. I’m not sure he knows what a test case is. I wanted to discuss the validity of using test cases for measuring the confidence of the agent’s answer, and how that composed with Qwen’s 3-level confidence scores, but there was no point in trying to have that discussion with Joe, since he didn’t even know how that score was computed.
The picture was clear: Claude had made up all the data — training and test — and all the test cases for quantifying confidence. This was a completely invalid experiment, a simulation of scientific research. In the beginning of the quarter, I set Joe up to answer the question: can you develop an agent that automatically verifies a given implementation of MTQQ?, and Claude gave him a pleasing answer: yes you can! — here it is. It is capable of detecting compliant and non-compliant implementations. Never mind that Claude made all that data up so that the answer would be yes.
At this point, I steered the conversation to the experiment itself, to try to lead him to the judgement of validity — the most important issue at stake here, really. Joe was so far from understanding the concept that, at this rate, he would not get there by himself by the time I need to post his grade on the Registrar next week.
*Me* OK, so who wrote all this training data for fine-tuning Qwen?
*Joe, with hesitation, I think he was afraid I would think he was cheating* Claude
*Me* OK. And who wrote the MTQQ implementation that is being tested?
*Joe* I did.
*Me, surprised* You did? No way…
*Joe, without hesitation* Yes, I wrote it
*Me, coming closer to him, so to look at the perfectly written code* Joe, there is no way that you wrote this code.
*Joe nervous again* OK, Claude wrote it but I asked Claude to do the project.
*Me ignoring the cognitive disaster of boundary elimination* OK, so Claude wrote the training data, and the program that is being verified, and the test cases that determine the confidence of judgement…
*pause to let him complete the thought*
*Joe* oh so maybe… this is invalid?
Departing Thoughts
I am still wrapping my head around what happened, how we got here, and what it means. I don’t have clear thoughts. But I know this is very, very bad for young people still developing their independent thinking skills. This is what happens when you trust a black box, and you believe that you don’t need to verify anything, you don’t need to see what’s inside, because the black box sounds so authoritative — it certainly seems to know a lot more than you do.
And then the black box is wrong. It optimizes to please you: you want a protocol verifier, and it gives you one. It hallucinated a perfect project for Joe, and Joe didn’t have the skills or knowledge to push back and ask questions, because he didn’t even know what questions to ask. He is still learning, he does not have the language to ask the right questions. Had I not pushed back, Joe would still be under the illusion that he did a fantastic, super cool research project that even involved fine-tuning Qwen! Wow! He was ready to put that on his resume, and he was ready to ask Claude to write the paper!
So what will happen when everyone decides to ignore the insides, and just trusts the black box? What will happen when people don’t know what questions to ask because they never learned the insides of anything? It seems to me this is when the line between reality and hallucination completely disappea
