Coercing agents to follow code conventions
# August 11, 2025
Coding agents are getting really good. You're hard pressed to find a professional engineer these days that doesn't use Cursor or Claude Code for at least some part of their workflow.
LLMs are structurally advantaged to write code: there's a huge training dataset online and code is immediately validatable. Few other domains have a reinforcement learning environment where you can do as clean of a loop as that. Try compiling legal docs and getting a binary outcome.
My pipeline for making agents work as reliably as possible is:
- Scope out some clear functionality of the feature and the testing scope1
- Implement the initial feature
- Add associated tests for the new feature
- Request the model iterates until the new tests are all passing
- Request that the model iterates until all old tests are still passing and lints pass
This loop ends up taking the majority of time in steps 4-5. We can one-shot the first draft but then refining it over time requires a lot of trial and error. The reason why models can be tenacious now on coding tasks is because they can accept more context/knowledge from the environment and make adjustments. Simply asking the model to fix itself isn't injecting any more knowledge into the system and won't lead to much improvements. The more context you can feed forward into the model the better that the results will get.
Bad LLM Code
Usually the code that comes out of this iteration isn't wrong (ie. it technically compiles and passes tests) but it's messy. Or simply against my own coding conventions. Here are some things that I've seen2:
- When in a long running loop to try and fix some tests, models will sometimes wrap test logic in a try/except in order to get it to pass
- Models will use patch() to monkeypatch by default over testing the underlying end to end logic
- They'll use bare assert statements over more detailed ValueError exception subclasses
- time.sleep() in tests to stopgap investigating race conditions, or just littered excessively
- Using
List[]
andOptional[]
on Python 3.12 instead of the modernlist[]
andA | None
.
All these happen regardless of how strongly I try to coerce the system prompt. We've managed to successfully make models pretty tenacious in problem solving, but that doesn't help much if they're able to cheat on the given problems while convincing themselves3 that they succeeded.
Determystic
Static analysis is a pretty good antidote to this behavior. At its simplest that feedback loop are just tests (pytest) and some linter (ruff). I put these checks by default in a Makefile and then instruct my system prompt to run every time.
I'm very bullish about the long term opportunity of static analysis to improve agents. But that goes beyond flagging things that are technically incorrect. We need to move into flagging things that aren't quite the behavior that we want.
What we ideally want is to specify our own rules. And it makes sense to specify these rules in code and not LLMs - since we need some determinism in these lints to return the same results after every pass. Thankfully LLMs are able to generate AST parsing code pretty accurately.
Could we just use LLMs to create deterministic guardrails for themselves? Determystic is a proof of concept to answer that question.
Let's say we want to ensure that our models always use modern Python typehinting and don't use Optional types. Let's say your LLM generated some code that we don't like:
from typing import Optional
from pydantic import BaseModel
class MyModel(BaseModel):
name: Optional[str] = None
age: int
def main():
model = MyModel(name="John", age=30)
print(model)
if __name__ == "__main__":
main()
We can describe this explicitly by referencing bad code that the agent just wrote.4
uvx determystic new-validator
1. Enter the bad Python code that your Agent generated: var: Optional[int] = None
2. Describe the issues Don't use Optional - use A | None
3. Name your validator Modern typehinting
After this you can run uvx determystic validate
and get a full validation of your project based on your new rule:
Custom Validator main.py:6: Use 'T | None' instead of 'Optional[T]' for type hints
4 |
5 | class MyModel(BaseModel):
>>> 6 | name: Optional[str] = None
7 | age: int
8 |
How it works
Determystic creates validators for your coding conventions, using the AST of your programming language. Running the new-validator CLI will kick off a Claude job to draft an AST validator. The validator will look for this Optional[] pattern and flag an error when it finds one.
To make sure that our AST parser does what we intend it to do, we also have the internal determystic Agent write a series of test cases that will test good and bad behavior. We will iterate on this until our parser passes so we know we'll get better code that you don't have to internally check yourself. All this is stored in a local .determystic
hidden folder in your current project that you can introspect.
If this approach catches on, you can also share these validators across projects or across team members. In a pipeline like this I think the shareability of the validators is mission critical.
To make determystic a bit more convenient for all my projects, I also bundled in some default validators that I install on every project these days. ruff for code formatting and ty for type checking. ty is still an an early beta but moving quickly. I imagine once it's released as a stable beta a lot of people are going to move over from pyright and mypy5.
The future
As LLMs get better and we put more compute behind RL environments, I think a lot of this will be solved at the model layer. But even then the system prompt makes us rely on natural language to describe what is otherwise a more precise problem.
In the future we might not even look at our code. In these situations it doesn't really matter what your code looks like. But for as long as we need humans interacting with the code that we write - and therefore have opinions about coding style and library utilization - it's still very helpful. I've found it can also result in better code by enforcing some ways of implementing library subclasses that I know will lead to a more efficient end draft.
I'm still in the very early days on this approach. But I'd love some feedback about ergonomics and the unusual coding conventions I'm sure people are going to throw at it. Raise an issue if you have examples of things that aren't working and I'll try to get them fixed.
-
There are more guides than I can read on how exactly to do this, so I won't spend much time on that here. ↩
-
These are based on pretty extensive use of Claude lately but similar other things crop up when I'm using the GPT family of models. ↩
-
And, sometimes, us. ↩
-
uvx is just a runner here to execute the determystic package from pypi. ↩
-
These are more comprehensive typechecking solutions but can be meaningfully slower for large projects. ↩