Two Failure Modes, One Diagnosis
Two failure modes show up in production AI systems often enough that they deserve names. The first is catastrophic forgetting, a model trained on a new task overwriting the knowledge needed for previous tasks, with the loss invisible until somebody asks the model the old question. The second is context collapse, a model with a stated million-token context window degrading sharply when its input grows past a much smaller threshold, and the output becoming confidently wrong in a way that looks like hallucination but is not.
Both failures share the same underlying diagnosis. The system has no architecture for deciding what to keep and what to let go. It either holds onto everything until storage overflows, or it compresses the window into a summary that strips out the load-bearing detail. Neither is what you want from a system that is supposed to reason over a body of evidence.
The biology of forgetting, which I covered in Forgetting Makes You Smarter, evolved a different answer over hundreds of millions of years: dedicated machinery for selective forgetting that works alongside the memory system rather than against it. AI is now converging on the same answer from a different direction, through architectures like Google Research's Titans and MIRAS that build adaptive forgetting into the model itself. This article is about those two failure modes, the architectural shift that addresses them, and what changes for builders working on production AI systems.
Catastrophic Forgetting and Context Collapse
Catastrophic forgetting has been a known limitation of neural networks since the late 1980s, when McCloskey and Cohen first documented it in their work on connectionist networks. When a model is trained on Task B after being trained on Task A, the weights that encoded Task A drift toward configurations that serve Task B, and the model's performance on Task A degrades sharply. The naive responses, bigger models and more training data and larger context windows, do not scale. A model with a two-million-token context that stores everything indiscriminately faces the same problem as a person who cannot distinguish signal from noise. Information overload degrades performance rather than improving it.
Context collapse is the inference-time cousin of catastrophic forgetting, and it is the more operationally important one for anyone shipping AI systems today.

The phenomenon does not appear gradually. It behaves like a phase transition. Performance holds steady as context grows, then at a critical threshold it drops sharply. In one published study, an LLM tasked with rewriting a growing document showed accuracy holding at 66.7% as the document expanded, then dropping to 57.1% at the threshold, which is worse than running the same task with no context at all. The collapsed context was not just incomplete. It was actively misleading, because the model treated its compressed summary as authoritative.
This is the reason a model advertising a 128K or one-million token context window does not necessarily mean it can effectively use that much context. The window is a capacity limit, not a quality guarantee. Effective context, the amount of information the model can actually reason over without degradation, is often a fraction of the advertised window, and the threshold is model-specific and use-case-specific.
The practical consequence is one of the most consistent patterns I have watched in production AI deployments. Teams scale a system by adding more context to the prompt, and find their output quality plateauing or regressing past a threshold they cannot predict in advance. The fix is not more context; it is selective context.
The Architectural Response: Titans and MIRAS
Google Research's Titans architecture, published in early 2025, was the first widely discussed attempt to build adaptive forgetting into a transformer-class model rather than wrap it around one. The architectural shift is fundamental. Traditional transformers compress information into fixed-size representations and lose nuance in the process. Titans uses deep neural networks as memory modules that actively learn and update while processing input. The capability researchers call test-time memorization is the ability to maintain and refine long-term knowledge without offline retraining.
Three mechanisms make this work, and they map directly onto principles that biology has been demonstrating for hundreds of millions of years.
The first is surprise metrics. The system detects significant differences between what it currently knows and what it encounters. Unexpected information receives higher priority for storage, while routine data is safely skipped. This mirrors how the brain retains anomalous events better than routine ones, the same reason a fire alarm is remembered and a familiar commute is not.
The second is momentum. Individual data points do not always signal their importance immediately. Momentum lets the model consider recent patterns alongside immediate surprises, capturing contextually relevant information even when each individual element looks unremarkable in isolation. The biological analog is contextual binding. The brain uses surrounding state to determine what matters now, even when the immediate cue is subtle.
The third is adaptive forgetting. Weight decay mechanisms continuously deweight outdated information during inference, managing finite memory capacity even when sequences exceed two million tokens. Rather than treating all stored information as equally valuable, the system continuously deweights older, less relevant material to free capacity for new learning. The biology I covered in Forgetting Makes You Smarter, the Rac1/cofilin pathway that actively degrades synapses, the synaptic pruning via long-term depression, the working memory decay that lets unreinforced traces fade, is all serving the same function. Continuously remove information that no longer serves current objectives. Free capacity for what does.
Titans is one architecture. The category is going to grow. MIRAS, also from Google Research, applies adjacent principles to retrieval-augmented systems by treating the retrieval index as a memory module with its own forgetting policy rather than as a static document store. The pattern underneath all of this is that selective forgetting at the right cadence is treated as a performance feature, not as a limitation to engineer around.
Why the Biology and the Engineering Are Converging
The convergence between active forgetting in biology and adaptive forgetting in AI is not coincidence. Both systems are solving the same finite-system-meets-infinite-stream problem, and both converged on the same shape of solution: a separate machinery for selectively reducing access to information that no longer serves the current objective. The biology has been evolving for several hundred million years. The engineering is roughly five years into being taken seriously as a first-class design concern rather than as a workaround for memory limits.
What the convergence suggests is that this is not a problem you can solve by adding capacity. You can buy a larger context window. You cannot buy a better selectivity policy. The selectivity is what determines whether the system continues to work as the input grows, and the substrate matters less than the existence of the policy. A brain solves it through molecular machinery in dopamine neurons. An AI solves it through gradient-based weight updates during inference. The interface is different. The function is the same.
What Changes for Builders
For anyone shipping AI systems in production, three things shift once the architectural picture comes into focus.
The first is context budget tuning. The instinct on most teams is to fill the context window, here is the model spec and the full document and the conversation history and the relevant code. That instinct produces worse output past the effective-context threshold, and the worse output is not loud. It looks like a confident answer that happens to be subtly wrong. The discipline that pays off is treating the context budget as a resource to optimize rather than a container to fill, and measuring where the per-use-case threshold actually sits. The threshold cannot be inferred from the model spec. It has to be empirically located by running the same task at increasing input sizes and watching for the phase transition.
The second is model and architecture selection. For applications that demand long-horizon reasoning over evolving input, coding agents, document research, multi-turn customer support, the architectural choice matters as much as the model size. A model built on a standard transformer at one-million token context will degrade differently from a model with built-in adaptive forgetting at the same nominal capacity. The right comparison is not parameter count or context size. It is effective context under sustained load, and the only way to find it is by measuring it on your workload.
The third is the RAG pipeline architecture itself. RAG systems traditionally focus on retrieval, finding the right documents for a given query. The architectural shift suggests they should also be focusing on selective retention, deciding which retrieved documents to keep in context and which to deprioritize as the conversation evolves. The retrieval problem and the forgetting problem are two faces of the same coin. Both are decisions about what should be in the model's attention at the moment of reasoning. A retrieval pipeline that always pulls the top-k documents without a forgetting policy will eventually run into the same context-collapse pattern that single-prompt deployments do.
None of this is glamorous engineering. It is the kind of work that compounds. Better selectivity at the architecture layer makes the model layer cheaper, the prompting layer simpler, and the output more reliable. The teams I have watched ship reliable production AI are not the teams that use the largest models. They are the teams that have figured out their effective context and design around it.
What I Do Not Yet Know
The piece of this picture I am least confident about is whether the architectural shift generalizes past Titans-style memory architectures or whether the gains are largely specific to that family. Several research groups are now publishing variants, but the empirical comparisons are early, and the production deployments I am aware of are small enough that the operational signal is noisy.
My current hypothesis is that the principle generalizes. Selective forgetting at the right cadence is the load-bearing piece, and the specific implementation of how to compute the right cadence matters less than the fact that the system has one at all. But I have not seen this tested at production scale across multiple architecture families, and the few comparative benchmarks I have read measure things adjacent to the property that actually matters.
If you have run production AI systems on a Titans-class architecture against a standard-transformer baseline at the same nominal capacity, and you have measurements of effective context under sustained load, I would genuinely like to compare notes. The operational evidence on this is going to develop faster than the academic benchmarks, and the working knowledge of a few engineering teams is going to be ahead of the published literature for the next year or two on a question that affects every production AI deployment.