Language servers for AI
# September 25, 2025
Human beings are still undoubtedly the main consumers of language servers.1 They're integrated into pretty much every IDE on the planet and provide the syntax highlighting that makes thousands of LOC a whole lot more readable.
But if I had to put money on it, I'd bet their accelerating capabilities will be motivated more by automatic calling from LLMs versus manual calling by us.
Static analysis
For the sake of simplicity here, I'm defining language servers as broadly as possible. This includes the basic semantic parsing and also logical mapping of functionality. A non-exhaustive list of things that language servers can validate:
- Is this grammar compatible with the language syntax
- Is this function actually defined in this scope
- Does this attribute access actually exist in the underlying instance object
Language extensions like TypeScript or type annotations in Python help provide more context to our language servers, when checking for type validity or access patterns. Compiled languages typically enforce more strict typing anyway so the overlap of compilation toolchains and static analysis is a bit more obvious.
Zero-shotting in a complex project
LLMs are very strong in generating code that's grammatically valid; ie. can compile or execute without any syntax errors. They struggle more when it comes to long term external dependencies: some class has function A or parameter B & you need to call that same function over in file C. Let's make this a bit more concrete:
from store_enums import ApartmentType
class Apartment:
type: ApartmentType
floor: int
sq_ft: int
def collect_rent():
# to implement by an LLM, depending on size of the apartment
...
If you were to write this, you probably already have an approach in mind after reading that simple stub. Consider the type of apartments that we have to support, price the square foot accordingly, and multiply it by the square feet of the actual apartment.
If we're trying to automate that logical flow, our goal would be for the LLM to understand what we need to do and:
- Search for the store object
- Identify there are different types of apartments
- Read the values of ApartmentType so we can
switch
on them in the implementation - Access the ApartmentType by relative import
This should be readily solvable by modern agents at a small scale. But the more classes you have, the more you're going to fill your context. And the more code you ask the model to write that depends on the outputs of other code, the opportunity for even small access errors to snowball into code that doesn't even run.
Some of that limitation can be blamed on the search/RAG systems we use to surface code. Whenever you have a feature to implement, the LLM has to re-familiarize itself with the content of your codebase. They either search a string literal and grep through your codebase, or they provide some semantic description of what they're looking for. It's the ultimate guess-and-check of trying to find the needle in the haystack.2 Sometimes it fully misses the context, in which case you have no hope that the predicted code is going to correctly reference some out of scope object properties.
People have long hoped that increasing the context length will solve this problem. Put all your code in the context and let the model figure out what to do next. We've already seen a meaningful improvement in the context length that modern models can support. Most of the frontier labs host options that can parse 1M-2M tokens. However, thus far we do typically see diminishing performance when we stuff the context length with irrelevant information to the task at hand.3
Agent correction loops
I find that agents work best when you're able to give them some deterministic validators. This pretty closely follows the RL paradigm that they were trained on in the first place; they loop over multiple tool calls, then periodically call some test function to act as the evaluator / reward function. We have already optimized the weights to support this workflow. It seems reasonable to mirror that pattern more closely at inference time as well.
What makes a good validator? Something that's (A) deterministic, so the model can optimize against the same benchmark criteria. And (B) fast, so you spend the majority of your time generating tokens. Token generation will almost always be your primary latency bottleneck. There's not much there we have control over other than using a smaller model or getting an ASIC chip online. So might as well make the rest of the out-of-band tool calls as fast as possible.
Unit tests that you've written are the golden baseline for agent correction loops. If you know exactly the criteria you want to test against, and you add enough guard rails as part of the LLM generation4, it can usually work long enough to find something that passes tests. But it's usually pretty painful to write tests before you actually have logic stubbed out, no matter what TDD advocates say. It also takes a lot of unit tests to fully cover all code paths and edge cases. It's certainly easier to let an agent do a lot of this heavy lifting for you.
But then we're left with a conundrum: can you tell an agent to write good one-shot tests even if you don't have faith they can write good one-shot code? They feel like a similar level of complexity to me.
Static analysis is by definition a more general solution. You can run it on your codebase before you have a single line written. It provides a baseline that your code is internally consistent. It might not do what you intend it to do but it's almost guaranteed to avoid raising exceptions at runtime. It's also so fast you can run it on every file modification without having to make a decision on whether this is the right time to spend X minutes waiting for unit tests.
I kick off almost all my agent tasks with the forced requirements that before they yield back to me, they run static analysis & our existing unit tests. In the worse case, this is still useful even when the logic wasn't correct since it provides you a reasonable code quality baseline to pick up again. In the best case, I've seen a few situations where the language server flags some errors with external function calls and guides the LLM to pull that file back into its context cache. This then let it fix other parts of the code to be compatible with the actual definition.
Code conventions in language servers
eslint
used to have a ton of different flavors for the linting syntax that people use. At this point I feel like most people have given up on the style wars, so long as everyone within a project is just using the same standard. I wouldn't be surprised if we see a resurgence of some of this opinionated design when it comes to broader codebase conventions.
LLMs are exposed to all of the code that's online with all manner of competing conventions. Error handling (proactive vs. reactive), web requests (sync vs. async), design patterns (factory vs. dependency injection), etc. We expect our model to intuit the design patterns we use in an existing project, or choose some reasonable ones when it gets started on a new project. That's putting a lot of pressure on the model.
But if we can enforce these conventions deterministically, instead of just hoping to the probabilistic gods, why wouldn't we? Create a langauge server verifier that all web requests happen with async; errors are always thrown and caught in a main
handler instead of being handled in the function itself; maybe even functions can only instantiate some classes via dependency injection instead of a direct allocation. Not all coding conventions can be expressed with AST validators. Perhaps not even most. But I could see this readily becoming an extension of language servers especially for the most egregous of style violations.
We'll see how this side of the ecosystem develops. It would certainly be a departure from the scope of conventional language servers. But you could have said the same thing for expanding language server capabilities to validate function arguments over basic code highlighting. In the meantime, I have a proof of concept that shows some of the power of using agents to write these style guide parsers. Maybe that will soon even come to a language server near you.
-
This being in any dispute in 2025 was not on my 2020 bingo card. ↩
-
I suspect this is part of the reason why beginners are so impressed by LLMs and "pros" are not; the pros have much more complicated codebases that they need to work around. The beginners can one shot something that works very quickly while LLMs still have everything in context. ↩
-
I'm optimistic that with enough training and RL execution for long horizon tasks we'll be able to overcome some of this bias. But the question of information density is a tricky fundamental challenge to overcome; humans certainly aren't able to keep that same amount of context in our heads concurrently either, so we can't look to us capabilities as a useful baseline. ↩
-
You want to avoid the model "cheating", like silencing errors in try/catch loops or reading the test function so you just return the expected I/O outputs. I still see this happen occasionally on SOTA models. ↩