A common criticism of large language models is that they lack world models. They process sequences and predict continuations, but they don’t build internal representations of the reality those sequences describe. Without a model of the world, the argument goes, they can’t truly understand anything.
Underlying this criticism is an assumption about what human minds do to reason about the world.
The explanation goes like this: there is a scene in the mind, reconstructed from experience, and you move through it. You close your eyes and walk through your house. You rotate an object in your head. You project yourself into a room you haven’t entered yet. It feels like there are two things, a scene, and you navigating the scene.
Some call this a world model. They treat the mind as having both a simulated environment and an observer navigating that environment. The brain reconstructs the external world from sensory data, maintains the reconstruction, and the observer consults it in order to perceive, to plan, to understand.
There is however a structural problem with this story. A physical system can only be in one configuration at any instance in time. One set of activations, one dispositional state. Not two running in parallel. One.
For the mind to be both observer and environment, it would need a way to step outside its own configuration. It would need to occupy a position that is not part of the state it is observing. There is no such position. The configuration of the mind, at any moment, is all there is. There is nowhere else to stand. The impossibility is geometric, no amount of mental gymnastics will overcome this.
This is why dualism has never been able to reconcile agency with the physical world. It isn’t just philosophically unsatisfying. It is structurally impossible. The split it requires — observer here, environment there — demands two configurations from a system that only has one. You have only one mind.
It feels like you can get around this. It feels like you observe your own thoughts all the time. But look closely and you find one state following another, not one state watching itself. A configuration produces output. That output becomes input. A new configuration forms. The sense of self-observation is sequential, there are adjacent frames, not a simultaneous split. The observer and the observed are never present at the same time. There is no inner stage. There is no audience.
What you have instead is a configuration. Accumulated structure, stateful, shaped by everything the system has encountered. Input arrives and passes through that structure. What comes out depends on the paths that prior experience has carved. Activate it one way and it produces a face. Another way, a sentence. Another way, signal to move towards something you are not currently looking at.
Thinking is a process, and a serial one at that. This experience is what people mean when they say they “have” a world model. And it is real. But it is not a reconstruction, and it is not a reference. It is not a scene maintained below the level of experience. It is the shape of the system itself, formed only through prior coupling with the world. A reconstruction is an object referencing another. Configuration is the current state of both.
A mind does not model the world and then consult the model. A mind is one configuration, shaped by the world, and input moves through it. You are not agent plus model, you are agent as model.
When someone says an LLM lacks a world model, what they mean it lacks a method for reconstruction. It has no separate internal model of reality to consult when it’s uncertain. But if reconstruction is structurally impossible for any single-state system, biological or digital, it cannot be the standard we hold machines to. The question was never where the model is located is the mind, or how we build it. It was whether “model” was ever the right word for what any single actor is capable of holding, when all they have is being.