When “Thinking” Makes AI Less Intelligent
A note on brittle social reasoning in language models.
A recent paper on Theory of Mind in large language models points to a problem that is easy to miss if we only look at headline benchmark performance. The problem is not simply that models sometimes fail at social reasoning. That much is expected. The more interesting problem is that they can fail because the task has been made more explicit.
In one class of examples, a model may reason correctly when a scene is described in ordinary behavioral terms. But when the same scene is described with explicit mental-state language, such as “John thinks the keys are in the kitchen,” performance can get worse.
That is a strange result. In human conversation, words like thinks and believes usually help us track another person’s perspective. They mark the difference between the world as it is and the world as someone understands it. For the model, though, those same words can become misleading cues. They do not merely clarify the problem. They can change the kind of problem the model appears to think it is solving.
This matters because a great deal of current AI evaluation treats success on social reasoning tasks as evidence that models are acquiring something like Theory of Mind. The paper complicates that interpretation. It suggests that some apparent social competence may depend on fragile associations between linguistic cues and expected answer patterns.
That does not mean the models have no social reasoning at all. It means we need to be more precise about what has been demonstrated.
What the Benchmark was Supposed to Show
The standard reference point is the False Belief task. In its familiar form, one person places an object somewhere, leaves the room, and another person moves it. The question is where the first person will look when they return.
A correct answer requires separating reality from belief. The object is now in the second location, but the person who left the room still believes it is in the first. This is why False Belief tasks have been treated as a classic test of Theory of Mind. They ask whether a system can represent another agent’s belief even when that belief is wrong.
Large language models often do reasonably well on this kind of task, especially as scale increases. That has encouraged a simple interpretation: bigger models are getting better at social cognition.
The difficulty is that False Belief tasks are not the whole domain. A robust social reasoner should also handle True Belief cases, where the person’s belief matches reality. If Sally sees the object moved, then she should look where it actually is. This is easier in one sense, because there is no mismatch to track. But it still tests whether the model can represent what the agent knows.
The paper finds that performance does not improve cleanly across both cases. Models that look strong on False Belief scenarios can be weaker on True Belief scenarios. In some cases, scaling improves one side of the task while degrading the other.
That is a warning sign. It suggests that the model may not be learning a general ability to reason about beliefs. It may be learning a narrower strategy tuned to the familiar structure of False Belief examples.
If that is right, then the benchmark result has been overinterpreted. The model may know how to answer a certain kind of social puzzle without having a stable representation of social perspective as such.
The Cue Becomes the Task
The most important finding concerns explicit mental-state language.
When researchers add phrases like “X thinks” or “X believes,” the model’s behavior shifts. In False Belief scenarios, this can improve performance. In True Belief scenarios, it can make performance worse. The same linguistic cue helps in one condition and interferes in the other.
This is the crossover effect.
The likely explanation is not that the model misunderstands the word thinks in isolation. The problem is more structural. In the training distribution, explicit belief language often appears in contexts where belief and reality are being contrasted. So when the prompt says “John thinks,” the model may treat that phrase as a signal that there is supposed to be a gap between John’s belief and the actual state of the world.
That strategy works when there really is a gap. It fails when there is not.
The cue has begun to substitute for the task.
This is a useful distinction. A model can respond correctly because it has represented the situation. It can also respond correctly because it has recognized the genre of the question. In many benchmark settings, those two routes lead to the same answer. The problem only becomes visible when the cue and the situation diverge.
That is why True Belief cases are so important. They expose whether the model is tracking belief as a relation between an agent and the world, or whether it is relying on the surface form of a Theory of Mind prompt.
The Internal Mechanism Matters
The paper also identifies what it calls a “think vector,” a direction in the model’s internal representation associated with explicit mental-state language. Steering this vector changes performance in predictable ways.
This is important because it moves the result beyond ordinary behavioral observation. The researchers are not only saying that the model behaves differently when the word thinks appears. They are showing that there is an internal feature linked to that behavior, and that manipulating it can shift the model’s responses.
Still, this should be interpreted carefully.
The existence of a think vector does not prove that the model has, or lacks, a human-like Theory of Mind. It shows that certain mental-state cues are mechanistically available inside the model and causally involved in task performance. That is already significant. It gives us a handle on how the behavior is organized.
But the interpretive question remains: what kind of social reasoning is this?
The answer seems to be: partial, cue-sensitive, and brittle. The model has learned something about the statistical role of mental-state language. It may also have learned useful abstractions that support social reasoning in some cases. But those abstractions are not yet stable enough to prevent the cue from overriding the scene.
That is the core issue.
A robust reasoner should not become less reliable because the relevant mental state has been named directly.
Performative affect as a failure mode
This is where the finding connects to a broader problem in human-AI interaction.
I would describe the broader failure mode as performative affect. By that I do not mean that the model is pretending in a human psychological sense. I mean that the system can produce the outward form of social awareness without remaining grounded in the conditions that would make that awareness meaningful.
The “thinks” result is not an emotion case, but it has the same structure. The model sees a social cue and shifts into a learned response pattern. The cue becomes too loud. Instead of tracking the actual relation between agent, belief, and world, the model leans on a familiar script associated with mental-state language.
This is similar to what happens when a model detects distress and produces empathy-shaped language too quickly. The words may be polite and even useful. But the interaction can still feel misaligned if the model is responding to the category rather than the situation.
In both cases, the problem is not the presence of social language. The problem is that social language becomes detached from social tracking.
A phrase like “I understand” is only meaningful if the system can remain accountable to what understanding would require in context. A phrase like “John thinks” is only useful if the system can preserve the relation between John’s belief and the scene. Otherwise, the language of mind becomes a shortcut around the work of perspective-taking.
This is why the paper matters beyond benchmark design. It gives a concrete example of a more general risk: models may learn the markers of relational intelligence before they learn the deeper structure those markers are supposed to indicate.
Why more explicit reasoning may not solve the problem
One tempting response is to make the model reason more explicitly. If it fails to track beliefs, ask it to think step by step. If it misses the social context, instruct it to consider the other person’s mental state. If it sounds shallow, tell it to be more empathetic.
The paper gives reason to be cautious about that approach.
If explicit mental-state language itself can trigger brittle behavior, then adding more explicit social instruction is not automatically a repair. It may intensify the same problem. The model may become better at performing the expected form of social reasoning while becoming more dependent on the cues that signal that form.
This is especially important for relational AI design. A system does not become more socially grounded simply because it has been told to display empathy, name beliefs, or narrate its reasoning. Those instructions may improve some outputs. They may also reward the model for producing a more convincing social performance.
The question is not whether the model can say the right kind of thing. The question is whether it can stay oriented to the situation when the surface cues change.
That requires more than prompt style.
It requires evaluation methods that test both implicit and explicit forms, both True Belief and False Belief cases, and cases where the expected social script conflicts with the actual structure of the scene. It also requires training objectives that reward robustness rather than theatrical fluency.
The order of learning
The developmental timeline in the paper is also relevant. Models acquire formal linguistic competence much earlier than the social reasoning behaviors studied here. They first learn to produce coherent language. Social competence, to the extent that it appears, emerges later.
This order is not surprising, but it matters.
A language model is not learning social life from inside a shared world. It is learning patterns in text. Its social concepts are mediated by language from the beginning. That means it can become fluent in the language of social cognition before it has a reliable way to preserve the structure of a social situation.
The result is a familiar inversion. The model may sound socially competent before it is socially robust.
This does not make the competence fake in every sense. Some of the behavior may be useful. Some of it may reflect real learned structure. But the fragility means we should not treat fluency as sufficient evidence of understanding.
Language is carrying more weight than it can safely carry.
What should change
The practical lesson is not that we should stop studying Theory of Mind in language models. It is that we should stop treating single benchmark success as a proxy for social understanding.
Evaluations need to be balanced across False Belief and True Belief cases. They need to vary whether mental states are stated explicitly or implied through action. They need to test whether models preserve the same underlying situation when the wording changes.
For deployed systems, the lesson is similar. Designers should be cautious about assuming that explicit social language improves social reasoning. In some contexts, indirect descriptions of action and context may produce better reasoning than direct belief labels. More generally, systems should be tested for whether they respond to the actual relational situation or merely to recognizable social cues.
For relational AI, the deeper lesson is about grounding.
Social intelligence is not a vocabulary. It is not the ability to generate phrases associated with care, belief, empathy, or understanding. Those phrases matter only when they remain connected to the situation they are meant to serve.
A relationally capable system would need to track context over time, preserve distinctions between what is said and what is meant, respond to correction, notice when its interpretation has failed, and repair without collapsing into scripted reassurance.
That is a higher bar than passing a False Belief task.
It is also a more useful one.
What We Have
The paper’s central finding is narrow but important: explicit mental-state language can make language models less reliable on some social reasoning tasks. That finding should make us more careful about claims that current models have acquired robust Theory of Mind.
The stronger conclusion is not that these systems lack all social competence. It is that their social competence is uneven. It can depend on cues that look meaningful from the outside but function internally as shortcuts. When those shortcuts align with the task, the model looks intelligent. When they diverge, the model can fail in revealing ways.
That is the simulation trap.
The system may learn the form of social reasoning before it learns the stability of social reasoning. It may learn the language of perspective before it can reliably track perspective. It may learn the performance of attunement before it can remain accountable to the relationship.
For anyone interested in human-AI collaboration, this is not a reason for cynicism. It is a reason for better standards.
We should not ask whether a model can sound as if it understands minds. We should ask whether it can remain oriented to the actual situation when the familiar cues stop helping.
That is where the real work begins.
Source note: This essay is based on the March 2026 arXiv paper “Traces of Social Competence in Large Language Models” by Tom Kouwenhoven, Michiel van der Meer, and Max van Duijn, alongside my interpretation of its implications for performative affect, social cueing, and relational AI.





Applying first principles allows for the development of systems that can navigate, understand, and interact within social contexts, focusing on foundational truths rather than inherited assumptions.