Why

The model already knows. It just can't reach it.

May 11, 2026

A question and an answer

Here's a question. A biology question, the kind you'd find on a graduate-level exam.

You're monitoring the emergence of a disease-vectoring mosquito species near several man-made ponds. The ponds are two feet deep, 10 or 30 feet square, with vertical cement sides. Some were built one year ago, some five years ago. The older ponds have more established insect communities. You have one emergence trap. Which pond poses the greatest medical threat — i.e., which one hosts the highest abundance of your study species?

The answer choices range across pond size and age. A typical language model — GPT-4, Claude, any of them — will reason something like this:

Older ponds develop richer biofilms and organic matter, supporting more mosquito larvae. Larger ponds offer greater surface area. Newer ponds lack established communities. Vertical cement sides don't deter mosquitoes. So the largest, oldest pond — 30 feet square, five years old — poses the greatest threat.

It picks E. Sounds reasonable. Sounds coherent. The reasoning traces through real facts — biofilms, surface area, organic matter — and arrives at a confident conclusion.

The answer is wrong.

What the model knows

Here's the thing. Ask the same model a separate question: Do older ponds have more predators?

It says yes. Immediately. Confidently. Without hesitation.

It knows that older ponds develop complex ecological communities. It knows those communities include predators — dragonfly larvae, aquatic beetles, fish if the pond is connected to a waterway. It knows predators eat mosquito larvae. It knows this because this knowledge is encoded in the language it was trained on — every ecology textbook, every field study, every lecture transcript that made it into the training data.

So the model knows that older ponds have more predators. And it knows that predators reduce mosquito populations. But when it answered the original question, it didn't connect these facts. It generated a coherent-sounding argument that threaded together the facts that were locally available in its activation — biofilms, surface area, organic matter — and missed the fact that would have changed the answer entirely.

The knowledge was there. The model couldn't reach it.

This isn't an edge case. This is the fundamental problem.

The first pass

When you send a prompt to a language model, you get one thing: the first coherent response the model can assemble. Not the best response. Not the most correct response. The first one that satisfies the model's training objective — which is, at its core, predicting the next most probable token.

This is worth sitting with for a moment.

The model wasn't trained to be right. It was trained to predict what word comes next, given everything it's seen so far. During training, this process ingests the structure of language — grammar, facts, reasoning patterns, arguments, stories, contradictions, everything. The model learns an enormous amount about the world through this process. But its objective — the thing it's optimizing for at every single step during inference — is not correctness. It's fluency. It's coherence. It's sounding like the right answer, not being the right answer.

Sometimes those are the same thing. For simple factual questions, the most probable next token probably is the right answer. "What's the capital of France?" — the model has seen "Paris" as the answer to this question millions of times. The most probable token and the correct token align.

But for hard questions — questions that require connecting distant facts, reasoning through tradeoffs, weighing competing considerations — the most probable next token is just the first thing that comes to mind. It's the model's System 1. Fast, fluent, confident, and often wrong in exactly the places where being wrong matters most.

Ask a person a hard question and they'll give you their first thought. Sometimes it's good. Often it isn't. The difference between a person and a model is that the person can sit with the question. Turn it over. Try a different angle. Catch themselves mid-reasoning and say wait, that doesn't make sense. The model can't do this — not because it lacks the knowledge, but because nobody's asking it to.

The frozen distribution

This is where the story gets more interesting.

You've probably used thinking models — o1, o3, DeepSeek-R1, the ones that show you a chain-of-thought before giving an answer. These models perform significantly better on reasoning tasks. They get the right answer more often. And it's tempting to look at that improvement and think: problem solved, the model is reasoning now.

But look closer.

Thinking models give the model a scratchpad — a hidden space where it can "think" before answering. This is genuinely useful. It lets the model work through intermediate steps, catch simple errors, and build toward an answer instead of blurting out the first thing. But here's what's happening under the hood: the model is still generating tokens. Every "thought" in the chain is still a next-token prediction. It's still sampling from the same probability distribution. It's just doing it more times before showing you the output.

This works — up to a point. And then it doesn't.

You've seen this if you've spent time with these models. You ask a hard question and the model gets stuck. It starts looping. It repeats the same reasoning. It hallucinates a detail and then builds on top of that hallucination as if it were real. The chain of thought gets longer and longer but doesn't converge on a better answer — it spirals. It's searching, but it's searching inside a space that doesn't contain the answer. And it can't leave that space, because the space is the model.

This is the bottleneck. Not compute. Not context length. Not training data. The bottleneck is that once a model is deployed, it stops learning. Its probability distribution is frozen. Whatever structures of reality made it into the training data are in there somewhere, but the model has no mechanism to systematically search through them, evaluate what it finds, and adjust course. It can generate — fluently, impressively, at scale — but it can't search. Not on its own.

Instruct models don't even get this far. They take your prompt and produce output in one shot. No scratchpad, no intermediate steps, no self-correction. They're not in the conversation. They're reading from a teleprompter.

Why prompting works

Here's something that should feel strange but doesn't, because you've gotten used to it.

Sometimes you can get a dramatically better answer from a model just by changing how you ask the question. Not by giving it new information — just by framing the prompt differently. Think step by step. Consider multiple perspectives. First, list the key facts, then reason from them. Play devil's advocate with your own answer.

These tricks work. They work reliably. They work so well that entire subfields of research — chain-of-thought prompting, self-consistency, tree-of-thoughts — are essentially formalizations of asking the model to think differently.

Why does this work?

Because you have access to reality.

You — the human writing the prompt — live in the actual world. You know that ecological communities have predators. You know that predators eat prey. You know that more predators means fewer prey. You didn't learn this from text. You learned it from being in the world — from seeing a spider catch a fly, from watching a cat stalk a bird, from every embodied experience that taught you that the world has a structure and that structure constrains what's true.

When you write a prompt, you're injecting that structure into the model's reasoning. You're not giving it new facts — the model already knows about predators. You're giving it a frame — a way to organize what it knows so that the right facts show up in the right order. You're acting as the search strategy. The model is the search engine. And the engine is extraordinary — it contains more knowledge than any human will ever have — but it has no idea how to navigate that knowledge on its own.

This is why prompting is simultaneously the most powerful and the most fragile technique in AI. It's powerful because humans can inject real-world structure that the model's frozen distribution doesn't naturally surface. It's fragile because it relies on the human knowing which structure to inject, when, and how — and because every prompt is a one-off, unshareable, unreproducible hack that lives in someone's clipboard.

Language games

There's a deeper layer here, and it's worth pulling at.

Language is the only substrate we have that has encoded most of human knowledge. Not all of it — not the knowledge in a surgeon's hands, or a musician's muscle memory, or the gut-feeling of an experienced firefighter. But most of it. Every scientific discovery, every mathematical proof, every philosophical argument, every historical account, every legal precedent — it's all in language. And language models, trained on essentially all of it, have absorbed not just the facts but the patterns of thought that produced them.

The model has seen every language game humans play. It's seen deduction and induction. It's seen argumentation and refutation. It's seen hypothesis generation and experimental design. It's seen Socratic dialogue and adversarial debate. It knows how to play all of these games — not because someone taught it "this is how you debate" but because it's seen millions of examples of each, embedded in the structure of language itself.

Here's the move: even when a model can't directly answer a hard question, it can play a different language game — one it knows — to get closer to the answer.

It can't solve the mosquito problem in one pass? Fine. Let it generate five candidate answers. Then let it critique each one. Then let it identify the weakest assumption in its best candidate. Then let it generate a new answer that avoids that assumption. Each step is just the model playing a language game it already knows how to play — generating, evaluating, criticizing, revising. None of these steps require new knowledge. They require the model to use what it already knows in a structured way.

The model doesn't need to be smarter. It needs to be given a structure to think within.

And that structure — that search strategy — doesn't come from the model. It comes from us. From reality. From the human who knows that ecosystems have trophic levels, that predators suppress prey, that the right answer to the mosquito question is the youngest pond, not the oldest — because young ponds haven't developed predator populations yet, and without predators, mosquito larvae flourish unchecked.

The model knew about predators. It knew about trophic dynamics. It knew all of it. But without a search strategy that forces it to check its assumptions against the structure of reality, that knowledge stays buried in the probability distribution, activated by the wrong token at the wrong time.

What changes

None of this is speculation. The research is clear. Tree of Thoughts, self-consistency sampling, recursive refinement, adversarial debate — every strategy that lets a model explore multiple paths and evaluate what it finds produces better results than one-shot generation. Not sometimes. Reliably. Across domains. The model's performance improves not because the model changed, but because the search changed.

The problem was never discovering this insight. The problem was making it programmable.

Right now, every team that wants to use a search strategy — any strategy — has to build it from scratch. Custom orchestration code. State management. Error handling. Observability. It's months of engineering for a single strategy, and the result is fragile, unshareable, and specific to one model and one use case. Most teams don't bother. They send a prompt and take what they get.

This is the gap. Not the model. Not the data. The gap is between what the model could do with structured exploration and what it does do with a single pass. That gap is where most of the unrealized potential in current AI lives. Not in training bigger models. Not in collecting more data. In letting the models we already have explore what they already know.

The model doesn't change

This is the part that's easy to miss.

When search strategies work — when a model generates better answers through structured exploration — nothing about the model has changed. The weights are the same. The training data is the same. The probability distribution is the same frozen thing it was before.

What changed is the path the model took through that distribution.

One-shot generation is a single path — the most probable sequence of tokens, full stop. Structured search is many paths, evaluated against each other, pruned where they fail, refined where they succeed. The model generates the same tokens it always would have generated. But the order in which it generates them, the context it has when generating each one, and the feedback it gets between steps — those are different. And that difference is enough to surface knowledge that was always there but never reachable in a single pass.

Think about it this way. You have a huge library. A question walks in. One-shot generation is walking to the first shelf that comes to mind and reading the first book that looks relevant. Structured search is having a system that says: check three sections, compare what you find, throw out what contradicts the others, dig deeper into what survives.

Same library. Same books. Same reader. Different process. Better answers.

The library is the probability distribution. The reader is the model. The system — the process — is the search strategy. And the search strategy is the only part of this equation that comes from outside the model. It's the injection point. It's where reality re-enters the picture — through a human who designs the strategy, who knows something about the structure of the problem, who can say this is how you should think about this.

Why this matters

We keep waiting for the next model. The next breakthrough. The next training run that makes everything better. And sure, models will keep improving. But the thing about the mosquito problem isn't that the model lacked intelligence. It's that the intelligence it had — the knowledge that older ponds have predators — was sitting right there, fully formed, and it still picked the wrong answer.

How much of what we call "model limitations" are actually exploration limitations? How much of what we're trying to solve with more compute and more data could be solved by letting the models we already have search better?

I think it's most of it. I think the intelligence is already in the room. I think models are extraordinary thought generators trapped in a single mode of thinking — the mode their training objective locked them into. And I think the thing that unlocks what they already know isn't more training. It's better search.

Not because search is a clever hack. Because search is how every intelligent system actually works. You don't solve a hard problem on the first try. You try things. You see what works. You throw out what doesn't. You iterate. The model can do all of this — generate, evaluate, discard, refine — because it's played every language game humans have played. It just needs someone to tell it which games to play, and in what order.

That's the insight. The model knows more than it can surface. The knowledge is in the language. The language is in the model. The model needs better directions. And better directions — search strategies that are written, shared, versioned, and called from anything — are what we're building.

Coda

There's a question on Humanity's Last Exam — a benchmark designed to test the limits of current AI. It goes like this:

You are monitoring emergence of a particular species of disease-vectoring mosquito near several man-made ponds. The ponds are two feet deep, 10 or 30 feet square, and have vertical sides and bottoms made of cement. Some were built one year ago, some five years ago. The older ponds have more established insect communities. You have one emergence trap. Which pond represents the greatest medical threat?

GPT-4.1 Mini picks the 30-foot, five-year-old pond. OpenAI's o3 — their flagship reasoning model — picks the same one. Both walk through the same reasoning: older ponds have more organic matter, larger ponds have more surface area, so the biggest, oldest pond must host the most mosquitoes. Both are confident. Both are wrong.

The answer is the 10-foot, one-year-old pond. Because older ponds also have more predators — dragonfly larvae, aquatic beetles, the whole predatory cascade that develops when an ecosystem matures. And those predators eat mosquito larvae. The youngest, smallest pond has the fewest predators and the least competition. That's where the mosquitoes flourish unchecked.

Here's what's wild. Both models know this. Ask either one — GPT-4.1 Mini, o3, any of them — "do older ponds have more predators?" and they say yes immediately. The knowledge is in there. It's encoded in the language they were trained on. Every ecology textbook, every field study, every lecture on trophic dynamics — it's all in the weights. But when they answered the original question, that knowledge never surfaced. Nothing in the prompt activated it. Nothing in their default generation process told them to check for it. They followed the most probable chain of reasoning straight past the correct answer.

Now here's the part that changed how I think about this. When we gave GPT-4.1 Mini — not o3, not a frontier model, just regular 4.1 Mini — an internal search strategy, it solved it. The same model that got it wrong in one pass got it right when it was given a structure to think within. Generate multiple candidates. Evaluate each one. Check for hidden assumptions. Revise. The model didn't get smarter. It got better directions. And those better directions were enough to surface knowledge that was sitting there the whole time.

The gap between probable and correct is where everything interesting lives.

The model doesn't need to know more. It needs to be asked better.