The tenacity of modern LLMs

If you compare the raw benchmarks for Claude 4 Sonnet vs 3.7, we see solid gains but not a huge relative difference. AIME improved from 80% to 85%¹. MMLU climbed to around 86.5% from 86.1%. The SWE-bench score jumped from 70.3 to 80.2%.

But like all metrics, those don't tell the whole story.

What I've been surprised to experience is the tenacity of modern LLMs. Give them a scoped task with some provable success criteria (ie. unit tests) and these models will keep iterating until they find a solution. Even if that means burning through a ton of context or reasoning tokens to get there.

My approach to kicking off these async runs is pretty straightforward. Here's the template of my usual prompt with some color commentary.

my problem is abc. This will mostly affect @file1 and @file2.

Establish the specific goal that should be accomplished and the file targets that are jumping off points. For net-new logic that doesn't tie to an existing file I usually still guide it to some files that I know to be relevant.

I'm not too rigorous about providing this project specification. I usually describe the feature/bug, my pre-existing knowledge of what I think is going on and how I would solve it, and how it would affect the user (if it's a frontend feature). I haven't found the approach of creating a PRD to be particularly helpful to improving the outcome, aside from making you formalize your thinking about what's in scope.

make sure to follow the context that's already provided to you about the codebase. start by researching the current state of affairs in this code by recalling the most relevant files for this refactor.

Even with our initial files, the model still needs to warm up its internal state about how these files relate to the rest of the codebase and where we might need to make changes.

But having a starting place with these files is helpful to more specifically compose RAG searches. It has the additional context of the headers and classes that might be imported in other files.

let's first write a stub version of the logic. this should implement the core functionality (correct function signatures, class definitions, and dataclasses).

then first write a simple happy path unit test that tests this new flow

As much as possible we're trying to follow a TDD pattern here. But in a conventional TDD setting we'd usually sketch out the full API contract on paper first before we write the test, to make sure that the unit test is really baked before we start writing the code. But in practice I've found that the models are better at outputting first-draft code than deriving the right API contract without the code. So we cheat a little here and code a draft before we do an initial unit test, to get a sense for the scope of the happy path test.

once that is done, run the test and iterate on the code until it executes properly.

Run an initial version of the test to get the feedback, then instruct the model to perform a loop until it's solved. Behind the scenes these agentic settings let the model choose if it should continue processing the next message², so we need to indicate that it should always be making changes concurrent with running the tests. The tests inject more information into the system that can be used to write the subsequent message. Without a specific instruction to loop itself the model sometimes won't and just bail out early.

every time you make a change, sanity check your changes by running uv run pytest -vvv -k {new unit test}. only when you're done with the full refactor should you run the unit tests on the entire codebase.

Focus just on making the current test pass, mostly for speed of iteration but also for overall focus of the task. When we have this scoped logic complete we can then zoom out and make sure we didn't break other areas of the test. Scoping also helps avoid a lot of model verbosity, like if you've ever experienced the model liberally modifying/creating files that are just clearly not relevant to the core task.

Certainly a prompt like this won't work all the time but it does a solid first pass most of the time (especially for webapp development or targeted data processing jobs). I'm working on a Swift project right now and this was the loop that it followed for a recent feature:

Read 3 other swift files
Update todos
Update files (x6 changes): initial implementation of NSData.decompressed
Run build
Build fails because of syntax (missing enum)
Updates files (x1 change): update syntax
Run build
Build fails because .gzip isn't available in the NSData compression API
Update files (x1 change): Implement a proper gzip decompression using the Compression framework
Run build
Build fails because it was trying to use a zlib wrapper instead of gzip
Updates files (x2 change): Switch to /usr/bin/gunzip via an external Process
Run build
Build successful

It tried three wholly different approaches here: one with NSData, one with the Compression APIs, and one with an external executable bundled with macOS. This comes back to tenacity.

Tenacity requires a feedback loop. And coding happens to be one of the few domains where that feedback can be nearly instantaneous. Write some code, run it, see what breaks, fix it, repeat. When you give an LLM access to this loop (whether through unit tests, script output, or webapp console logs) it's able to acquire more context that drives the next step. If the method call is failing, check the signature. If an edge case is failing, consider more complete handling.

There's something that feels very information-theoretic about this setting. You're injecting more information into the closed-system without having to do anything yourself. But since you're injecting information into the system, you can still run into the same garbage-in-garbage-out problem that plagues all language models.

Unit tests have provided my most reliable guardrail on agent behavior. Since after all, when a model manages to make them pass, you get both working functionality and a regression test for future changes. Agents are way too willing to spam you with unit tests if left to their own writing, however, so making sure the happy path is stable before adding additional coverage has been the best way I've found to avoid code slop.

There seem to be some common approaches to how LLMs usually go about debugging. There's nothing empirical about this analysis but casual observations from extensive usage over the last month:

Get the code running first. If something won't compile or execute after several attempts with minor fixes, the model will often step back and try a different approach entirely—maybe switching libraries or reframing the problem.
When logic fails, simplify. I've seen models struggle with complex regex patterns for text parsing, fail repeatedly, then suddenly pivot to writing explicit parsing logic with for loops and string operations. Less elegant, perhaps, but I suspect clearer logic about how it works. This itself I find quite interesting: since regexes are a formal language, you would think that a machine would be quite efficient at writing them. But much like how LLMs often fail at math, my guess is reasoning through the full internal state of what a state-machine would do given a regex pattern is just too complicated. Loops are more explicit.
Fallback to bash If RAG searches aren't returning the expected context, the models are pretty liberal at using a bash terminal to navigate the disk hierarchy (cd, ls) and inspect files manually (head, cat). This helps in contexts where dependency files are not otherwise mapped as part of the project. But I've also seen it quite a bit in "trying" code syntax to make sure it passes in a more isolated environment.
Liberal Prints There's definitely an MCP server out there that can set breakpoints in running code, but it doesn't seem like the current generation of architectures are particularly optimized to stepping through stack traces. My hunch is it requires a bit too much concurrent context that models aren't optimized for in their turn-based setting: the full stack, the current breakpoint, the suite of variables.³ Instead, models prefer to add a lot of print statements as their way of postfacto stepping through the behavior of the executing code and making changes based on it.

At the risk of anthropomorphizing these models too much, we've definitely RLHF'd our way to a workflow that mirrors what people do. There's something almost human about watching a model work through a hard problem: the false starts, the gradual understanding, the moment when the pieces finally click together. Even if this isn't reasoning in the conventional sense, it wouldn't surprise me if the workflow of a lot of software development can be brute-forced by just injecting ever increasing context into the system. You're converging on the right code by trying different things.

If you squint, there's some deeper link between software development and gradient descent. You start at an initial parameter space (your code), you make some step (modify a file), you run a forward pass (try to run unit tests), and apply the backward gradients (make a new modification that tries to fix the code). It just so happened that humans were approximating this behavior with Stack Overflow and setting breakpoints. Models have their own strategy of doing that same thing.

There's a cliche loved by coaches and teachers. Hard work beats talent when talent doesn't work hard. At least in the case of language models, the talent is certainly willing to work pretty damn hard.

We'll see how far that gets them.

Not an exact comparison because 3.7 was benchmarked on the 2024 edition of the dataset and 4.0 was benchmarked on 2025. ↩
This itself is a bit of a simplification. Since the models are just being called by client callers in an infinite loop, what we're really looking for is the model to output a message with no more tool calls (or an empty text context). At this point we have no more information to inject into the model, so we break out of the loop. ↩
Past a certain context length, all IDEs (Claude Code, Cursor, etc) will truncate your message history to avoid reaching the actual limit of the inference model. Some do it explicitly. Others more opaquely transform your past messages into RAG structs and then just query for the most relevant ones when the next message comes in. My hunch is the RAG approach is more flexible but there's certainly something appealing with the simplicity of the single shot summarization approach. ↩

Related tags:

#engineering

The tenacity of modern LLMs

# June 21, 2025

/dev/newsletter