NewsFlash Articles Data Fundraising Skill&API

a16z: Can AI's "Forgetfulness" Be Cured by Continual Learning?

Read this article in 36 Minutes

The breakthrough lies in enabling the model to do the powerful thing of training after deployment: compress, abstract, learn.

Original Title: Why We Need Continual Learning

Original Authors: Malika Aubakirova, Matt Bornstein, a16z crypto
Original Translation: Deep TechFlow

In Christopher Nolan's "Memento," the protagonist Leonard Shelby lives in a shattered present. Suffering from anterograde amnesia due to a brain injury, he is unable to form new memories. Every few minutes, his world resets, leaving him trapped in an eternal "now," not remembering what just happened or knowing what will come next. To survive, he tattoos himself, takes Polaroid photos, relying on these external props to compensate for his brain's inability to create new memories.

Large language models also live in a similar eternal present. Once trained, a vast amount of knowledge is frozen in their parameters, unable to form new memories or update their parameters based on new experiences. To address this limitation, we have built numerous scaffolds: chat histories as short-term notes, retrieval systems as external notebooks, and system prompts as tattoos. However, the model itself has never truly internalized this new information.

Increasingly, researchers believe this is insufficient. While Information-Completeness Learning (ICL) can solve problems where the answer (or pieces of it) already exists in some corner of the world, for problems that require genuine discovery (such as entirely new mathematical proofs), adversarial scenarios (like security attacks and defenses), or knowledge that is too implicit and inexpressible in language, there is ample reason to argue that models need a way to write new knowledge and experiences directly into their parameters post-deployment.

ICL is transient. True learning requires consolidation. We may be stuck in the eternal present of "Memento" before we allow models to continually consolidate their knowledge. Conversely, if we could train models to learn their own memory architecture, instead of relying on custom tools as crutches, we may unlock a whole new dimension of scaling.

This research field is called continual learning. While the concept is not new (see the paper by McCloskey and Cohen from 1989), we believe it is one of the most critical research directions in the current AI field. The exponential growth in model capabilities over the past two to three years has made the gap between what models "know" and what they can "learn" increasingly apparent. The purpose of this article is to share what we have learned from top researchers in this field, clarify the different pathways of continual learning, and drive the development of this topic in the entrepreneurial ecosystem.

Note: The shaping of this article benefited from in-depth discussions with a group of excellent researchers, PhD students, and entrepreneurs who generously shared their work and insights in the field of lifelong learning. From theoretical foundations to engineering realities post-deployment, their insights have made this article much more robust than if we had written it alone. Thank you for your contributions of time and thought!

Let's Talk Context First

Before arguing for parameter-level learning (i.e., updating model weights), it is necessary to acknowledge a fact: context learning does work. And there is a strong argument that it will continue to prevail.

The essence of the Transformer is a sequence-based conditional next-token predictor. Give it the right sequence, and you get surprisingly rich behavior without even touching the weights. That's why methods like context management, prompt engineering, instruction fine-tuning, and few-shot examples are so powerful. Intelligence is encapsulated in static parameters, while the manifested capability dramatically changes with the content you feed into the window.

Cursor's recent in-depth article on autonomous agent scaling is a good example: model weights are fixed, and what really makes the system run is the careful arrangement of context—what to put in, when to summarize, how to maintain coherence over hours of autonomous operation.

OpenClaw is another excellent example. It didn't go viral because of special model privileges (everyone has access to the underlying model) but because it efficiently transforms context and tools into a working state: tracking what you're doing, structuring intermediate outputs, deciding when to reintroduce prompts, maintaining persistent memory of previous work. OpenClaw elevates the intelligent agent's "shell design" to the level of an independent discipline.

When prompt engineering first emerged, many researchers were skeptical about the idea of relying solely on prompts as a serious interface. It seemed like a hack. Yet it is a native product of the Transformer architecture, requiring no retraining and automatically upgrading as the model advances. As the model becomes stronger, so do the prompts. The "rough but native" interface often wins because it directly couples to the underlying system rather than confronting it. So far, this has been the trajectory of LLM.

State Space Model: Context on Steroids

As the mainstream workflow shifts from raw LLM invocation to an agent loop, the pressure on context learning models intensifies. In the past, situations where the context window was fully filled were relatively rare. This usually occurred when LLM was tasked with a long series of discrete tasks, and the application layer could trim and compress chat histories in a relatively straightforward manner.

But for an agent, a single task might consume a large chunk of the total available context. Every step of the agent's loop relies on the context passed from the previous iterations. And they often fail after 20 to 100 steps due to "forgetting": the context gets filled up, coherence degrades, and convergence becomes unattainable.

Therefore, major AI labs are now dedicating significant resources (i.e., large-scale training runs) to develop models with ultra-long context windows. This is a natural progression as it builds on already effective methods (contextual learning) and aligns with the broader trend in the industry towards reasoning-time computation transfer. The most common architecture involves interleaving fixed memory layers between standard attention heads, known as the State Space Model (SSM) and a linear attention variant (referred to collectively as SSM). The SSM offers a fundamentally better scaling curve in long-context scenarios.

Caption: Comparison of scaling between SSM and traditional attention mechanism

The goal is to help agents increase the number of coherent steps by several orders of magnitude, from around 20 steps to around 20,000 steps, without losing the broad capabilities and knowledge provided by traditional Transformers. If successful, this would be a significant breakthrough for long-running agents.

You can even view this approach as a form of continual learning: while not updating model weights, it introduces an external memory layer that requires minimal resetting.

Therefore, these non-parametric methods are real and powerful. Any assessment of continual learning must start here. The question is not whether today's contextual systems are useful; they indeed are. The question is: have we hit a ceiling, and can new methods take us further?

What Context Misses: "File Cabinet Fallacy"

"What has happened with AGI and pre-training is that, in some sense, they've overfit… Humans are not AGI. Yes, humans do have this skill foundation, but humans lack a lot of knowledge. What we rely on is continual learning.

If I were to create a super-smart 15-year-old who knows nothing. A good student, very eager to learn. You could say, go be a programmer, go be a doctor. The deployment itself would involve some kind of learning, a trial-and-error process. It's a process, not just throwing the finished product out there." — Ilya Sutskever

Imagine a system with unlimited storage space. The world's largest filing cabinet, where every fact is perfectly indexed and instantly retrievable. It can look up anything. Has it learned?

No. It has never been forced to compress.

This is the core of our argument, referencing a point previously made by Ilya Sutskever: LLMs are fundamentally compression algorithms. During training, they compress the internet into parameters. Compression is lossy, and it is this very lossiness that makes them powerful. Compression forces the model to find structure, generalize, and build representations that can transfer across contexts. A model that memorizes all training samples is not as good as a model that extracts underlying patterns. Lossy compression is learning in itself.

The irony is that the mechanism that makes LLMs so powerful during training (compressing raw data into compact, transferable representations) is precisely what we refuse to let them continue doing after deployment. We stop the compression at the moment of release and replace it with external memory.

Of course, most intelligent agents will compress context in some customized way. But isn't the bitter lesson telling us that the model itself should learn this compression directly, at scale?

Yu Sun shared an example to illustrate this debate: mathematics. Look at Fermat's Last Theorem. For over 350 years, no mathematician could prove it, not because they lacked the right literature, but because the solution was highly novel. The conceptual gap between existing mathematical knowledge and the ultimate answer was too vast.

When Andrew Wiles finally cracked it in the 1990s, he spent seven years almost in seclusion, having to invent entirely new techniques to arrive at the answer. His proof relied on successfully bridging two different branches of mathematics: elliptic curves and modular forms. While Ken Ribet had already shown that establishing this link would automatically solve Fermat's Last Theorem, no one before Wiles had the theoretical tools to actually build this bridge. Similar arguments can be made for Grigori Perelman's proof of the Poincaré conjecture.

The core question is: Do these examples prove that LLMs lack something, some updated prior, the ability to truly think creatively? Or does this story precisely demonstrate the opposite conclusion—that all human knowledge is just data available for training and recombination, and Wiles and Perelman merely showed what LLMs can also achieve at a larger scale?

This question is empirical, and the answer is not yet clear. But we do know that there are many categories of problems where contextual learning fails today, and parameter-level learning might be useful. For example:

Caption: Categories of problems where context learning failed and parameter learning may prevail

More importantly, context learning can only handle things expressible in language, while weights can encode concepts that cannot be conveyed in words. Some patterns are of such high dimensionality, too implicit, or too deeply structured to fit into context. For example, the visual texture that distinguishes benign artifacts from tumors in medical scans, or the subtle rhythm defining a speaker's unique cadence in audio, these patterns are not easily decomposed into precise vocabulary.

Language can only approximate them. Even the longest context window cannot convey these things; such knowledge can only reside in the weights. They live in the latent space of learned representations, not in text. No matter how large the context window grows, there will always be some knowledge that text cannot describe, only be carried by parameters.

This might explain why explicit "machine remembers you" features (e.g., ChatGPT's memory) often make users feel uneasy rather than pleasantly surprised. What users truly desire is not "recollection" but "capability." A model that has internalized your behavioral patterns can generalize to new scenarios; a model that merely recalls your past cannot. The gap between "This is what you wrote in response to this email last time" (verbatim recall) and "I've understood your thought process enough to anticipate your needs" is the gap between retrieval and learning.

Getting Started with Continual Learning

Continual learning has various paths. The dividing line is not whether there is a memory function but: where does the compression occur? These paths span a spectrum from no compression (pure retrieval, frozen weights), to full internal compression (weight sharing learning, making the model smarter), with an important middle ground (modules).

Caption: Three paths of continual learning—context, modules, weights

Context

At the context end, teams are building smarter retrieval pipelines, intelligent agent shells, and cue arrangements. This is the most mature category: the infrastructure is validated, the deployment path is clear. The limitation lies in depth: context length.

A noteworthy new direction: multi-agent architectures as a scaling strategy for context itself. If an individual model is constrained to a 128K token window, a coordinated group of intelligent agents—each holding its own context, focusing on a slice of the problem, communicating results—can collectively approximate infinite working memory. Each agent does context learning within its window; the system does aggregation. Karpathy's recent autoresearch project and the Cursor building a web browser are early examples. This is a purely non-parametric approach (no weight change), but it vastly elevates the upper limit of what context systems can achieve.

Module

Within the module space, teams are building pluggable knowledge modules (compressed KV cache, adapter layer, external memory storage) to enable specialized capabilities on top of a universal model without retraining. An 8B model with the right module can match the performance of a 109B model on a target task, with only a fraction of the memory footprint. The appeal lies in its compatibility with existing Transformer infrastructures.

Weight

On the weight update front, researchers are pursuing true parameter-level learning: only updating relevant parameter segments with a sparse memory layer, optimizing the model through feedback in a reinforcement learning loop, and compressing context into weights at inference time through test-time training. These are the deepest methods and the hardest to deploy, but they truly enable the model to internalize new information or skills.

There are various mechanisms for parameter updates. Here are a few research directions:

Figure Note: Overview of weight-level learning research directions

Weight-level research spans multiple parallel paths. Regularization and weight-space approaches have the longest history: EWC (Kirkpatrick et al., 2017) penalizes parameter changes based on their importance for previous tasks; weight interpolation (Kozal et al., 2024) blends old and new weight configurations in weight space, but both are relatively fragile at scale.

Test-time training pioneered by Sun et al. (2020) and later evolved into architectural primitives (TTT layer, TTT-E2E, TTT-Discover) takes a different approach: performing gradient descent on test data and compressing new information into parameters when needed.

Meta-learning asks: can we train models to learn how to learn? From MAML's few-shot-friendly parameter initialization (Finn et al., 2017) to Behrouz et al.'s nested learning (2025), which structures the model as a hierarchical optimization problem, running fast adaptation and slow updates in different timescales, drawing inspiration from biological memory consolidation.

Distillation preserves knowledge from a previous task by making the student model match a frozen teacher checkpoint. LoRD (Liu et al., 2025) made distillation so efficient that it can run continuously by simultaneously pruning the model and replaying a buffer. Self-Distillation from Teacher (SDFT, Shenfeld et al., 2026) flipped the source, using the model's own output under teacher conditions as a training signal, circumventing catastrophic forgetting of sequence fine-tuning.

Recursive Self-Improvement operates on a similar vein: STaR (Zelikman et al., 2022) guides reasoning ability from self-generated chains of inference; AlphaEvolve (DeepMind, 2025) discovered algorithmic optimizations dormant for decades; Silver and Sutton's "Experience Engine" (2025) defines agent learning as an endless stream of continuous experience.

These research directions are converging. TTT-Discover has already fused test-time training and RL-driven exploration. HOPE nests fast and slow learning cycles within a single architecture. SDFT has turned distillation into a fundamental operation for self-improvement. The boundaries between columns are blurring. The next generation of continual learning systems is likely to combine multiple strategies: using regularization for stability, meta-learning for acceleration, self-improvement for compounding. A growing number of startups are betting on different layers of this tech stack.

Continual Learning Startup Landscape

The nonparametric edge of the spectrum is most well-known. Shell companies (Letta, mem0, Subconscious) build orchestration layers and scaffolding to manage the content fed into the context window. External storage and RAG infrastructure (such as Pinecone, xmemory) provide the retrieval backbone. Data presence, the challenge is putting the right slice in front of the model at the right time. As the context window expands, the design space for these companies grows, especially at the shell end, a wave of new startups is emerging to manage increasingly complex context strategies.

The parametric end is earlier-stage and more diverse. The companies here are trying some version of "post-deployment compression," allowing the model to internalize new information in the weights. The landscape here roughly divides into a few different bets on how the model should "learn" post-release.

Partial Compression: Learning without Re-Training. Some teams are building pluggable knowledge modules (compressed KV cache, adapter layers, external memory stores) to enable a universal model to specialize without moving core weights. The common argument is: you can achieve meaningful compression (not just retrieval) while keeping the stability-plasticity trade-off within a manageable range because learning is sequestered, not dispersed throughout the entire weight space. An 8B model paired with the right module can match the performance of much larger models on the target task. The advantage is composability: modules can plug and play with existing Transformer architectures, be swapped or updated independently, and the experimentation cost is much lower than retraining.

RL and Feedback Loop: Learning from Signals. Some teams are betting that the richest learning signals post-deployment are already present within the deployment loop itself—user corrections, task successes and failures, reward signals from real-world outcomes. The core idea is that the model should treat every interaction as a potential training signal, not just inference requests. This closely mirrors how humans progress at work: do the work, receive feedback, internalize what works. The engineering challenge is to translate sparse, noisy, and sometimes adversarial feedback into stable weight updates without catastrophic forgetting. However, a model that can truly learn from deployment will compound value in ways that a context system cannot.

Data-Centric: Learning from the Right Signals. A related but distinct bet is that the bottleneck is not in the learning algorithm but in the training data and surrounding systems. These teams focus on filtering, generating, or synthesizing the right data to drive continuous updates: the premise being that a model with a high-quality, well-structured learning signal needs far fewer gradient steps to meaningfully improve. This aligns naturally with feedback loop companies but emphasizes upstream issues: it's one thing for a model to be able to learn, it's another to learn what and to what extent.

New Architecture: Learning Capability from Bottom-Up Design. The most radical bet believes that the Transformer architecture itself is the bottleneck, and continuous learning requires fundamentally different computational primitives: architectures with continuous-time dynamics and built-in memory. The argument here is structural: if you want a system for continuous learning, you should embed learning mechanisms into the foundational architecture.

Figure Caption: Landscape of Continuous Learning Startups

All major labs are also actively positioning themselves in these categories. Some are exploring better context management and causal chain reasoning, some are experimenting with external memory modules or sleep-time compute pipelines, and a few stealth companies are pursuing new architectures. This field is early enough that no one approach has emerged victorious, and given the breadth of use cases, there shouldn't be a single winner.

Why Naive Weight Updates Fail

Updating model parameters in a production setting triggers a series of failure modes that have not yet been addressed at scale.

Figure Caption: Failure Modes of Naive Weight Updates

The engineering problem has been extensively documented. Catastrophic forgetting means a model sensitive enough to learn new data will destroy existing representations—the stability-plasticity dilemma. Time decoupling refers to invariance rules and varying states being compressed into the same set of weights, where updating one will damage the other. Failure of logical integration occurs because fact updates do not propagate to their inferences: changes are confined to the token sequence level, not the semantic conceptual level. Unlearning remains impossible: there is no differentiable subtraction operation, so false or toxic knowledge lacks a precise surgical excision plan.

There is a second class of problems receiving less attention. The current separation of training and deployment is not just an engineering convenience; it is a boundary for security, auditability, and governance. Opening this boundary will lead to multiple issues simultaneously. Security alignment may unpredictably degrade: even slight fine-tuning on benign data can exhibit widespread misalignment behaviors.

Ongoing updates create a surface for data poisoning attacks—a slow, persistent hint injection version, but it lives in the weights. Auditability collapses as a continually updated model is a moving target, unsuitable for versioning, regression testing, or one-time authentication. When user interactions are compressed into parameters, privacy risks escalate, sensitive information is baked into representations, harder to filter out than information in the retrieval context.

These are open problems, not fundamentally impossible. Addressing them, like solving core architectural challenges, is part of a continuous learning and research agenda.

From "Memory Fragments" to True Memory

Leonard's tragedy in "Memento" was not that he could not function—he was resourceful in any scenario, even remarkably so. His tragedy was that he could never compound. Each experience remained external—a polaroid, a tattoo, a note in someone else's handwriting. He could retrieve, but he could not compress new knowledge.

As Leonard navigated this self-constructed maze, the boundary between reality and belief began to blur. His condition not only deprived him of his memory; it compelled him to constantly reconstruct meaning, making him both detective and unreliable narrator in his own story.

Today's AI operates under the same constraints. We have built very powerful retrieval systems: longer context windows, smarter shells, coordinated multi-agent groups, and they work. But retrieval is not the same as learning. A system that can retrieve any fact has not been forced to seek structure. It has not been forced to generalize. To train with such powerful lossy compression—the mechanism that converts raw data into transferable representations—is precisely what we shut off at deployment.

The path forward is likely not a single breakthrough but a layered system. Contextual learning will still be the first line of defense: native, validated, continuously improving. Modular mechanisms can handle the intermediate zone of personalization and domain specialization.

But for those truly hard problems—discovery, combating adaptation, tacit knowledge that can't be put into words—we might need to have the model continue compressing its experience into parameters post-training. This implies sparse architectures, meta-learning objectives, and progress in self-improving loops. It may also require us to redefine the "model": not a set of fixed weights, but an evolving system, including its memory, its update algorithm, and its ability to abstract from its own experience.

The file cabinet is getting bigger. But a bigger file cabinet is still a file cabinet. The breakthrough is getting the model to do the powerful thing it does during training post-deployment: compress, abstract, learn. We stand at a turning point from a model of amnesia to a model with a hint of experience. Otherwise, we'll be stuck in our own "Fragments of Memory."