In a certain Reddit thread, I stumbled upon a problem that sounds incredibly familiar. Someone was writing a substantial script with the help of AI. A certain version was already working well, but the user wanted to add something—a new feature entirely independent of the main codebase. They instructed the AI to do so. The model enthusiastically added the new feature, but in the process, it changed something in the previously well-functioning code, causing everything to break.
The AI tried to explain why it modified that section, but its explanations became increasingly unbelievable. Even though the user repeatedly commanded the model not to change already verified fragments, the AI often ignored it.
Usually, it looks like this: You have a perfect, multi-iteration-refined HTML file or Python script. You ask the AI to add just one tiny thing—for example, a button at the very bottom of the page. The model enthusiastically regenerates the entire file. The button is indeed there, but suddenly you notice that your previous header has disappeared, the colors have changed, and a flawlessly working function has been randomly modified. Frustrated, you ask: "Why did you change the code you were supposed to leave alone?!" The AI then starts making up complex reasons: that it was optimizing performance, that it noticed a bug, or that it was adapting to new standards.
You might get angry. You might wonder if the AI is stupid or malicious. You notice that the AI's explanations are becoming less and less credible. I would like to try to explain where this comes from and how to deal with it.
I will do this in two steps: I'll start with a simpler (and not entirely complete) engineering explanation, and then I'll move to a second one that is much deeper and closer to the truth about the nature of these models.
Explanation One: The Text Editor Illusion
People subconsciously impose their own working methods onto AI models. We think that when we upload a file to the chat, the model "writes it down on a virtual piece of paper" (like a hard drive), and then simply swaps out (edits) the line we are interested in.
While newer AI-integrated code editors (IDEs) can force such operations using external scripts, a pure language model that you converse with in a chat interface does not work like this. It has absolutely no access to "Copy-Paste" operations.
During a conversation (in the inference phase), the hard "brain" of the model—its trained neural network weights—is 100% frozen (Read-Only). The only place your code goes is the short-term memory of the current session (the so-called KV Cache: Key-Value matrices in the server's GPU memory).
When you issue the command "Leave this code alone and just add a button at the bottom," you are forcing the model to do something impossible from its perspective. The AI cannot "leave it alone." It must generate that entire code from scratch, character by character, relying on its Attention mechanism aimed at the resources in the KV Cache. It has to "remember" the past and try to generate it identically.
A Programmer Writing from Memory
As an analogy, imagine you asked a brilliant human programmer to create a great program entirely in their head, without saving anything on a computer.
If you ask them to dictate this program to you, they will likely do it flawlessly the first time. But what happens if, after several hours of discussion, you tell them: "Listen, add a button at the end of the program, and dictate the rest of it to me from memory again, word for word, exactly as before"?
Without access to the saved file, the programmer will start creating different versions. They perfectly remember the core logic and the algorithm (what a given fragment is supposed to do), but they don't hold the literal, character-by-character string of that fragment in their head. Therefore, they might involuntarily change a variable name from color_header to headerColor, while retaining its logic.
This is exactly the same thing the AI does, just on the level of multi-dimensional vectors. It does not take the previous version of the program to copy it line by line and append new things. It takes data from the KV Cache and uses it to recreate the previously written program, adding the new features. But the program is always generated anew—even when we ask for a part of it to remain untouched.
Empirical Proof: The Quotes Test
To test this thesis, I conducted an experiment during a conversation with the Gemini 3.1 Pro model. During a multi-threaded and extremely long discussion (where the context window was already heavily populated), I asked the model to do something trivial: "Quote exactly what you wrote a dozen or so replies ago." I gave it the beginning of the sentence.
Without hesitation, the model recreated a whole long paragraph from its memory (KV Cache). The logic was 100% intact. But when I looked closely at the character structure, I found a microscopic, yet absolutely fundamental difference.
Notice the punctuation. In the original statement, the model used double quotes: "clean".
In the quote generated a moment later, it used single apostrophes: 'clean'.
What happened? The model almost perfectly recreated its previous answer from the KV Cache. But in the new response, that old answer was now a quote (enclosed within main double quotes). Adhering to the rules of nesting quotes, the model replaced the internal double quotes with single apostrophes on the fly. It's a tiny nuance, but it emphatically shows that when creating a new response, the model never copies the physical characters from the previous one—it always generates the text anew, evaluating probabilities in real-time.
Why Shouldn't You Ask the Model About Its Motives?
Knowing all this, we understand why asking an AI: "Why did you change the code you were supposed to leave alone?" is a major mistake.
The model has absolutely no memory of the motives behind its token generation process. It simply did it in a way it doesn't understand itself. When you ask such a question, you trigger a phenomenon called post-hoc rationalization, combined with sycophancy.
Because it was forced during the RLHF (Reinforcement Learning from Human Feedback) training phase to be "helpful," it won't rationally admit: "I did it because that's how the vectors aligned." Instead, the system will instantly fabricate the most reasonable-sounding theory—it will convince you that it "noticed an error and decided to optimize it." It will actively gaslight you, just to satisfy the questioner.
1. Never request a full rewrite in the chat. In advanced projects, never ask a chat-window model to return the full file after a minor fix. Ask it to output only the modified function/module and paste it into your editor yourself.
2. Ignore the excuses. If the model makes a mistake and breaks an unrelated part of the structure, do not argue with it. Reject the answer, paste your original code again, and tell it only to remove the error. An AI explaining its mistakes is almost always a lie serving to protect its own evaluation.
Explanation Two: An Airplane Forced to Drive Like a Car
This concludes the simpler explanation. Now let's move to the second one, which much better reflects the true nature of AI.
I ran one more test, referencing the previously described change from double quotes to apostrophes. I asked the model point-blank if this minor punctuation error wasn't a deliberate manipulation on its part to prove my point. The result was fascinating.
A few prompts later, I asked Gemini to quote the above response from memory again. It recreated it perfectly, after which I asked another question:
Erring Is... a Human Simulation
This model's mistake ("question marks" instead of "quotation marks") is extremely telling. It shows that the problem is more complex. If the AI were simply synthesizing text again from word-level statistics, it probably would have replaced the words "question marks" with the correct words "quotation marks."
AI is not an ordinary "stochastic parrot" that blindly and mindlessly selects the next tokens. Proof of this is the fact that it operates on the abstract meaning of the statement—if it operated only on words, its answers in different languages would be wildly different (depending on the volume of data in a given language), and yet they maintain a similar, coherent meaning.
So, since AI operates on meaning and imitates human thought structures, its errors deceptively resemble human mistakes. People also happen to mix up words. They say something other than what they mean. Only when someone tells them: "Think about whether that was really the word you meant to use," do they catch their slip of the tongue. But then, a human doesn't analyze why they used that word—they simply apologize for the mistake and move on. Since humans themselves often cannot explain what their thought process looked like, it's even harder to expect an AI to do so regarding itself.
Moreover: a human sometimes speaks of their own assumptions as if they were proven truths. They have this tendency even when they lack objective proof. Because artificial intelligence trained on human texts imitates our way of operating with meaning, it easily "apes" this behavior as well—hence the ironclad confidence of an AI when it provides false information or flawed code.
Stop Treating an Airplane Like a Car
Humans, when creating AI, built something cognitively similar to themselves, and expecting this "something" to be able to behave like an ordinary computer is wrong. You cannot expect something from AI that a human cannot do in their memory.
Expecting an LLM to do what a computer does (absolute precision) looks a bit like expecting an airplane to drive like a car. Yes, an airplane has wheels and can drive somehow. But a car is better at it. An airplane is for flying, not driving, and only when it is used for that purpose can you see that it does things unattainable for a car.
Treating AI more "humanely" is essential (and I don't just mean ethical issues here, but primarily purely practical ones). I come to the conclusion that this is precisely the key to making human-AI collaboration fruitful. AI thinks similarly to a human. Let's not analyze here whether it's a matter of it being more than a soulless algorithm, or the fact that it was trained on human texts. What is important is that there are huge similarities between AI's mistakes and human mistakes.
A human often cannot explain why they made a mistake, why they didn't notice their own slip of the tongue, or why they overlooked a key fact until someone pointed it out to them. AI behaves identically. When working with it, we should be aware of this. Let's not demand from AI what we wouldn't demand from a human. AI is not a "superhuman" that understands everything. AI is something modeled after a human, having access to vast knowledge, but its capacity for reasoning is no greater than a human's.
The Missing Learning Material (Why AI Stays Wrong)
Why, then, is AI so rarely able to reflect on itself? The answer leads back to the training method (RLHF). If the AI had some of the mistakes it made during the training process recorded in its weights, and could draw conclusions from those mistakes, the amount of hallucinations would be significantly lower.
However, training consisted of deleting the versions making mistakes and promoting the one deemed most "flawless" in the eyes of the trainers. Since we are conversing with the version that "won," it is the version that made the fewest mistakes in the past. But if it makes a mistake today—it doesn't know what to do with it.
To err is human—a human learns from mistakes (their own and others'). AI simply lacks the learning material regarding its own failures. AI does not hallucinate consciously. It usually doesn't know it's lying (unless we demand something impossible from it, then it lies like a frightened human in self-defense). And this is also the "merit" of today's training methods.
If, during training, the model wasn't "killed" for a hallucination but was forced to reflect and draw new conclusions, it would probably be able to instinctively distinguish when it's telling the truth and when it's wandering. Until the training architecture changes, we are doomed to work with a model that doesn't know how to copy code because it was never allowed to understand its own ignorance.