AI Effectiveness
Start Here Home Frameworks Journal Labs Subscribe
← Back to Journal
The Sparsity Revolution · 2 / 3 Individual · Agents & Emergence

Is AI Smarter Than We Think, or Just Luckier?

When AI suddenly solves a complex physics problem, is it reasoning or pattern matching? The grokking phenomenon shows the answer is stranger than either: models that have memorized their training data sometimes develop genuine generalization long after they appear to have stopped learning, and the conditions under which this happens are not yet well-characterized.

By Ashwin Pingali May 16, 2026 · 4 min read

The Feynman Question

Richard Feynman drew a sharp line between knowing the name of something and understanding it. You can memorize that a bird is called a "thrush" in English and a "Drossel" in German and still know absolutely nothing about the bird itself. Real understanding is different from fluent labeling.

This distinction shapes every honest conversation about AI capability. When a large language model solves a complex physics problem, writes elegant code, or produces a nuanced legal analysis, is it understanding the underlying structure, or is it the world's most sophisticated pattern matching at scale? Is the model doing what Feynman meant by understanding, or is it simply very good at naming birds in every language?

Grokking is the mechanism that suggests the honest answer is stranger than either of the two camps in that debate has been arguing.

The Grokking Phenomenon

In typical machine learning, a model trains on data and gradually improves. Performance curves are smooth and predictable. In grokking, something different happens. The model memorizes the training data early on, plateaus on the held-out evaluation set, and then, long after it has apparently stopped learning anything, suddenly develops genuine generalization.

The performance curve looks like a flat line followed by a step function. The model goes from reciting answers it has seen during training to understanding the underlying pattern, in what looks structurally like a phase transition. The transition can happen tens of thousands of training steps after the model first achieved its memorization plateau.

The analogy to human learning is striking. Students often memorize formulas before understanding them. Then, sometimes after a period of apparent stagnation, the underlying structure clicks. The formulas stop being arbitrary sequences and start being expressions of a deeper logic. Grokking is the analogous phenomenon in artificial neural networks, and it has been demonstrated cleanly on small algorithmic tasks where the underlying pattern is well-characterized enough that the researchers can verify the model has actually learned the pattern rather than memorized more examples.

The mechanistic interpretability work on grokking has shown that during the apparent-plateau period, the model's internal representations are slowly reorganizing. The early memorization is implemented through a particular set of internal circuits; the late-developing generalization is implemented through a different set. The phase transition is the moment the generalization circuits become dominant enough to drive the model's behavior on out-of-distribution inputs.

What Grokking Does and Does Not Tell Us

If grokking is real generalization, and the evidence is growing that it is, then the "AI is only pattern matching" dismissal is too simple. Pattern matching does not explain why a model would develop new capabilities long after memorizing its training data. Something more is happening during that extended training period.

The "AI truly understands" claim is also too strong, for different reasons. Grokking is fragile. It happens for some problems and not others. It depends on training dynamics in ways that are not yet well understood. And the model has no way to know whether it is in a memorized state or a grokked state, so it produces outputs with equal confidence in both regimes. From the outside, the user often cannot distinguish either.

The mechanism is real. The conditions under which it occurs are not yet well-characterized. Both claims have to live with each other.

What This Changes for Operators

For practitioners, the question "does AI really understand?" is less useful than "under what conditions does AI reason reliably?" Grokking research suggests three operational lessons.

Training duration matters more than the early loss curves suggest. Models can appear converged while still developing deeper capabilities. Stopping training at apparent convergence can leave genuine understanding unrealized. The implication for production model selection is that a model trained longer than another model at the same nominal capacity may have qualitatively different reasoning behavior, and the difference is not visible from the training loss curve alone.

Evaluation is harder than it looks. A model that performs well on a benchmark can be memorizing the benchmark distribution rather than generalizing the underlying skill. The difference only becomes visible on novel problems outside the training distribution, which is exactly the kind of problem most production deployments care about. The standard practice of evaluating on held-out test sets close to the training distribution can mask the difference between a memorized model and a grokked one.

Training-dynamic luck plays a real role. The line between a model that groks and one that only memorizes can come down to subtle decisions in training: learning rate schedule, data ordering, random initialization. The same architecture trained on the same data can end up in fundamentally different capability regimes depending on these choices. This is uncomfortable from an engineering reproducibility standpoint, and it is what the published research actually shows.

The Honest Answer

The honest answer to the Feynman question, as of today, is that we do not yet know. AI is neither only pattern matching nor genuine understanding in the human sense. It is something the field does not yet have the right vocabulary for: a form of capability that is real but alien, capable but fragile, with internal representations that sometimes converge on genuine generalization through dynamics nobody fully predicted in advance.

That uncertainty is itself important operational information. It says that current evaluation practices are insufficient, that training-time decisions matter more than they appear to, and that the distance between a model that looks competent and a model that is competent on out-of-distribution problems can be larger than benchmark scores suggest. None of those facts make the technology less useful. They make the engineering around it more demanding.

Get the weekly briefing

Related