Updated knowledge in language models

There's a tension in modern AI development: when do you teach a model something new versus when do you give it tools to look things up?¹

Every few months, we see announcements of new foundation models. Usually it's some pseudo-novel architecture with a special recipe of training data. Because brand new architectures make a splash, and retraining an existing model with better data doesn't, researchers are incentivized to always package up data advancements within a fresh architecture.² For the last few years they've basically been coupled efforts.

New architecture: more parameters, deeper models, novel heuristics, better pre-processing, alignment strategies, etc.
New data: either historical data that's recently been unlocked for training purposes (digitized books, updated licensing) or data that's actually net new since the data snapshot was last collected (internet posts, alignment datasets for RLHF).

While we're chasing SOTA, the answer is both - always - all the time.³ But as models become more agentic and tool capable, this calculus is changing. The question isn't just "what should the model know?" but "what should the model know versus what should it be able to look up?"

The knowledge cutoff illusion

I keep some scattered notes on new models whenever they're released and I have time to do an evaluation. This was one bit of testing I did from the release of GPT-4. I was trying to assess the actual date of the model's cutoff by probing it for news that occurred after the alleged date.

GPT-3.5

As of my last update in September 2021, Meta is the new name of the company formerly known as Facebook. On October 28, 2021, Facebook announced that it would be rebranding as Meta. This change was part of the company's broader shift in focus from primarily being a social media platform to becoming a "metaverse" company.

Obviously, September comes before October in any kind of normal calendar.

GPT-4

As of my knowledge cutoff in September 2021, Meta could refer to several things. In a general context, the term "meta" is used to describe something that is self-referential or an abstraction from a concept.

However, if you are referring to a company named Meta, I would need more context. As of 2021, one notable entity was the Metaverse, a collective virtual shared space, created by the convergence of physically virtually enhanced reality, which is often theorized in science fiction and futurism. A company could potentially name itself Meta if it were involved in this field.

I should note that Facebook, one of the world's largest social media companies, has expressed interest in developing metaverse-related technologies, but as of my last training cut-off, it was still named Facebook Inc.

Meta's renaming announcement did in fact come on October 28, 2021. Which GPT-3.5 seems to know but GPT-4 did not.

The dates mentioned here are also pretty interesting. Both models allegedly have a knowledge cutoff date in September. But GPT-3.5 was aware of news that hadn't yet occurred at its official knowledge cutoff date. The original GPT-4 on the other hand had no idea about the Meta rebranding news.

This reveals something important about how these systems actually work. I suspect - and I don't know this for sure - that GPT-3 actually was finetuned over time as additional data came in. The actual knowledge cutoff date that it refers to comes from the system prompt, not from the underlying training data. Something like this:

You are ChatGPT, a large language model trained by OpenAI.

Knowledge cutoff: 2023-10

Current date: 2025-03-07

In theory you can put any arbitrary date in this payload - it's not actually coupled to the underlying dataset. When GPT 3 & 4 were launched, they had no ability to access web data. Whatever knowledge was baked into the weights was the actual cutoff for the model.

The tool-first paradigm

As we've moved into more agentic use of models, that has changed. It's more important that the model is taught how to properly use tools and how to parse the output results, rather than trying to embed all modern knowledge into the weights themselves. When I'm assessing GPT, Claude, and Gemini for new projects, that's the main criteria I'm looking at. Can they:

Call tools accurately and at the right time.
Operate in an agent loop until they deliver results that will validate.
Have high performance on some actual task definition.

This is roughly my order of prioritization. Their ability to one-shot the right answer is getting less important. Who cares that the model "knows" the founding date of Stanford versus being able to look it up from a credible source?⁴ It's increasingly about the whole lifecycle. Which also makes the question of retraining a lot more interesting. If you're not trying out a new architecture direction, is it even worth retraining your model?

This shift changes how we think about AI capabilities. Instead of trying to create an omniscient system, we're building systems that have pretty good intuition but read new things on demand.

Like with all things there's a catch. Not all knowledge is created equal.

When retrieval isn't enough

We're now increasingly curating the datasets that are used to train LLMs. When you tap out the entire Internet as your corpus, the only place to turn is manual data curation and/or smarter filtering strategies to really isolate high quality data examples. If you've meaningfully improved this dataset⁵ then it's obviously worth retraining.

But let's say you haven't. Or haven't really. There's just more of the same online: some more modern trivia, updated baseball scores, new coding library releases. Is it worth retraining the model's weights in this case? It seems like a waste of compute, right?

Maybe. But maybe not.

Let me give you a tangible example that's near and dear to my heart: integrating a new library in a coding project. You can tell that models have a strong preference for using the version of the library that was in their training corpus. When dealing with Rust or Python where the library API has meaningfully changed in the latest release version, even if you give it the updated API spec in the prompt or AGENTS.md file, the model still will sometimes write code using the old function names.⁶

And even when it writes code with the new APIs, it just doesn't seem to do as good of a job. Its intuitive sense for the semantics of what functions are going to do, not just how they're named, isn't as sophisticated. You get the feeling that the attention mechanism is heavily looking back on the context you've provided versus letting the emergent properties of the weights play their part.

This happens constantly in the real world. FastAPI changes their dependency injection syntax. React introduces a new hook pattern. SQLAlchemy overhauls their query API. The model can read the documentation, but going the next mile seems a bit lacking.

The weight of experience

Here's what I think is happening: there's a difference between factual knowledge and experiential knowledge. Factual knowledge - like "Stanford was founded in 1885" - can be easily retrieved. It's discrete, verifiable, and context independent.

Experiential knowledge is different. It's both figuring out new patterns and figuring out what the closet parallels are in what it's seen already. You're not trying to learn something brand new every time you're trying to learn something new. You're trying to figure out the closest mental model that you already have to apply to that new thing. That's what allows industry experts to quickly ramp up to some new knowledge. They're trying to look for the differences from what they already know and slot it in, not grok it for the first time.

Gradient descent appears to bake something unique into these larger architectures - something that can't be matched with longer context length alone. The attention mechanism of the context window seems more like explicit citations. The training - perhaps by virtue of the model non-linearties - is when you can get into some of the deeper pattern matching.⁷

The economics of knowledge

This creates an interesting economic calculus. For most startups and even medium-sized companies, retraining a foundation model is simply not feasible. The compute costs are enormous, the data requirements are massive, and the engineering complexity is beyond most teams.

But for the major AI labs, the question becomes more nuanced. They have to decide what knowledge deserves the expensive treatment of being burned into weights versus what can be handled through retrieval.

My hypothesis is that we'll see a bifurcation:

Weight-worthy knowledge: Patterns, idioms, common error modes, and the kind of experiential wisdom that only comes from seeing thousands of examples. This is especially true for coding, where the relationship between concepts is as important as the concepts themselves.

Retrieval-worthy knowledge: Facts, recent events, specific documentation, and anything that changes frequently or is highly domain-specific.

The challenge is that this line isn't always clear. The Meta rebranding example shows how even "factual" knowledge can be more complex than it appears. Knowing about the rebrand isn't just about knowing the fact. You probably want to also understand all the cultural and business context that surrounds it. It's a hard to route that query through an information retrieval engine even if you can read the blog posts.

What this means for the AI stack

For anyone building on top of these models, this has practical implications. When you're choosing which model to use for a coding task, you're choosing between different knowledge bases and different training vintages.

A model trained six months ago might struggle with the latest version of your favorite framework, even if you provide the documentation. A newer model might have internalized the patterns and idioms that make code feel natural and idiomatic. Prompt enrichment can take you some of the way but not all the way there.

I always emphasize that companies need to spend their time upfront getting really rigorous about metrics. Collect whatever feedback you can in a format that can be parsed later⁸. Once you get some breathing room, benchmark your models across providers to make sure you're starting from a high baseline. This is even before you start optimizing your agent implementation itself.

The future of learning

As models become more capable of using tools and accessing external information, the question of what deserves to be learned versus what should be looked up becomes more sophisticated. We're moving toward a world where AI systems can dynamically acquire the knowledge they need for any given task.

But some knowledge - the deep patterns, the intuitive understanding, the accumulated wisdom of millions of examples - still needs to be learned the old-fashioned way: through gradient descent over massive datasets.

That gradient descent doesn't have to happen from scratch though. It could be some continuous finetuning process, isolated training of the KV cache, LORA adapters, or likely something we haven't invented yet.

This is why the question of when to retrain isn't going away. It's only becoming more complex. As the pace of change accelerates, the half-life of model knowledge gets shorter. The models that can most effectively balance the weight of experience with the lightness of retrieval will be the ones that actually get adopted and trusted for serious work.

The future isn't just about smarter models - it's about models that are smarter about what they know and what they need to learn.

Incidentally, this is more of a concern for practitioners than for researchers. Most language model benchmarks are required to occur zero-shot: like a textbook exam, you're not allowed to use external information retrieval engines. You can only use what's in your head. ↩
Despite the fact that most models these days are some autoregressive transformers with a few unique modifications on top. But you can get pretty near SOTA with ROPE embeddings, a deep architecture, and a ton of petaflops. ↩
Often, just throwing more flops at existing architectures will typically deliver better scores. This is one reason why most labs have started to benchmark their models based on Chinchilla optimality bounds or similar. ↩
For the two people curious, it's 1885. ↩
A big focus at the moment is the kind of human-guided-answers that will step through the explanation of a problem. This content - either real or synthesized - is a big part in powering the "thinking" mode of most modern reasoning models. ↩
It's a bit like that senior dev - that we all know and love - that really insists on keeping us pinned to an older version because the "API contract is just so much better". ↩
But who really knows. I'm always dubious when people try to read into the black boxes to divine behavior. ↩
You might not have time to build the parsing and evaluation harnesses up front. That's okay. But data once lost can usually never be recovered again. ↩