How language servers work

I'm sure you've used a language server even if you haven't explicitly installed one. All modern IDEs (from VSCode to Zed to Cursor to Neovim) use one to provide syntax coloring and error underlining when it notices that you've written incorrect or subpar code.

This is a dive into how language servers work under the hood.

A language server is half lexer and half server. The server part is trivial: it just implements a common spec for how to communicate linting errors in a way that all IDEs can implement to standard. The lexer is where the magic happens. When you modify a file, the language server parses the code into an Abstract Syntax Tree (AST), analyzes the semantic relationships, and applies a set of rules to detect issues.

Here's the typical flow:

Parsing: Raw source code gets tokenized and parsed into an AST
Semantic Analysis: The server builds symbol tables, resolves imports, tracks variable scopes
Rule Application: Static analysis rules check for type mismatches, undefined variables, unreachable code
Incremental Updates: Only the changed portions get re-analyzed

The key is that this all happens without executing any code. Language servers provide deterministic feedback about code correctness through static analysis alone - keeping things fast but importantly without any side effects.

Parsing in Action

Here's a minimal example of how tokenization and AST generation work. All interpreters must provide a method to parse their own tokens. For lower level languages with compilers, there's usually an ecosystem of lexers that support the language parsing. Sometimes they're defined in terms of more generic grammars that can be parsed into AST with more standardized tooling like ANTLR (cross-language parser generator), Tree-sitter (incremental parsing for editors), Yacc/Bison (classic LR parsers), or PEG parsers like PEG.js and TatSu.

import ast
from dataclasses import dataclass
from typing import Dict, List, Optional
from json import dumps as json_dumps

# Raw source code
source = """
def calculate(x: int, y: int) -> int:
    result = x + y
    return result
"""

# Parse into AST
tree = ast.parse(source)

# Simple AST visitor to extract function definitions
class FunctionAnalyzer(ast.NodeVisitor):
    def __init__(self):
        self.functions: List[Dict] = []

    def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
        func_info = {
            'name': node.name,
            'args': [arg.arg for arg in node.args.args],
            'return_type': ast.unparse(node.returns) if node.returns else None,
            'line': node.lineno
        }
        self.functions.append(func_info)
        self.generic_visit(node)

analyzer = FunctionAnalyzer()
analyzer.visit(tree)
print(json_dumps(analyzer.functions, indent=4))

The key of the AST lies is in converting raw program text into a more symbolic representations that we can actually reason about between projects - independent of specific variable names, function decomposition, etc. It's what underpins all interpreted languages and compiler pipelines.

[
    {
        "name": "calculate",
        "args": [
            "x",
            "y"
        ],
        "return_type": "int",
        "line": 2
    }
]

Semantic Analysis in Practice

Once we have an AST, the language server needs to understand what the code actually means. This is where semantic analysis comes in - it's not enough to know that x + y is syntactically valid; we need to verify that both x and y are actually defined and that they're compatible types.

The foundation of semantic analysis is the symbol table - a data structure that tracks every variable, function, and identifier in your code. Think of it as the language server's memory of "what exists where."

class SymbolTable:
    def __init__(self, parent: Optional['SymbolTable'] = None):
        self.parent = parent
        self.symbols: Dict[str, Dict] = {}

    def define(self, name: str, symbol_type: str, line: int) -> None:
        self.symbols[name] = {'type': symbol_type, 'line': line, 'used': False}

    def lookup(self, name: str) -> Optional[Dict]:
        if name in self.symbols:
            self.symbols[name]['used'] = True
            return self.symbols[name]
        elif self.parent:
            return self.parent.lookup(name)
        return None

The clever part here is the parent reference. This allows us to model nested scopes - when you define a variable inside a function, it should be accessible within that function but not outside it. The symbol table forms a chain: if we can't find a variable in the current scope, we look in the parent scope, then its parent, and so on.

Now we need to actually walk through the code and build these symbol tables. The AST visitor pattern is usually the cleanest way to iterate through this graph:

class ScopeAnalyzer(ast.NodeVisitor):
    def __init__(self):
        self.current_scope = SymbolTable()
        self.errors: List[str] = []

    def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
        # Create new scope for function
        new_scope = SymbolTable(parent=self.current_scope)
        old_scope = self.current_scope
        self.current_scope = new_scope

        # Add parameters to function scope
        for arg in node.args.args:
            self.current_scope.define(arg.arg, 'parameter', node.lineno)

        # Visit function body
        self.generic_visit(node)

        # Restore previous scope
        self.current_scope = old_scope

We encounter a function definition, we temporarily switch to a new scope, analyze everything inside the function, then switch back. The function's parameters become part of that inner scope, so they're accessible throughout the function body but nowhere else.

Maintaining this state space lets us then catch when code tries to use variables that don't exist:

    def visit_Name(self, node: ast.Name) -> None:
        if isinstance(node.ctx, ast.Store):
            # Variable assignment - add to current scope
            self.current_scope.define(node.id, 'variable', node.lineno)
        elif isinstance(node.ctx, ast.Load):
            # Variable use - must exist somewhere in scope chain
            if not self.current_scope.lookup(node.id):
                self.errors.append(f"Undefined variable '{node.id}' at line {node.lineno}")

This is where the AST context comes in handy. Python distinguishes between storing to a variable (x = 5) and loading from it (print(x)). When we see a load context, we need to verify the variable exists - which we can do just by checking the current state.

Let's see this in action with some problematic code:

source = """
def process_data(items):
    total = 0
    unused_var = "never used"
    for item in items:
        total += item
    return total + undefined_var  # Error: undefined_var not defined
"""

tree = ast.parse(source)
analyzer = ScopeAnalyzer()
analyzer.visit(tree)
print(analyzer.errors)

The analyzer catches our mistake:

[
    "Undefined variable 'undefined_var' at line 7"
]

Real language servers extend this concept much further - tracking types, method calls, import resolution, and more. But the fundamental pattern remains: parse the code into a structure you can reason about, then systematically verify that all the relationships make sense.

Rule Application Engine

Parsing and semantic analysis give us a structured understanding of the code, but they don't automatically tell us what's problematic about it. That's where the rule engine comes in - it's the component that actually decides "this looks suspicious" and annotates those red squiggly lines you see in your editor.

Rules are individual inspectors, each looking for a specific type of issue. One rule might scan for unused imports, another checks for overly complex functions, and another flags potential security vulnerabilities. The rule engine coordinates all these inspectors and consolidates their outputs to make sure they're mutually exclusive.

First, we need a way to represent what these inspectors find:

from enum import Enum
from typing import Any, Callable, List, Dict, Set
from dataclasses import dataclass
import ast

class ErrorSeverity(Enum):
    ERROR = "error"
    WARNING = "warning"
    INFO = "info"

@dataclass
class Diagnostic:
    line: int
    column: int
    message: str
    severity: ErrorSeverity
    rule_id: str

The Diagnostic is the language server's way of saying "I found something worth mentioning." It captures both what the issue is, where it occurs, and how serious it might be. A rule_id is used in some context to configure which rules to enable or disable.

For the sake of a demo, the rule engine can stay simple. Here it's just a coordinator that runs each rule and aggregates the results:

class RuleEngine:
    def __init__(self):
        self.rules: List[Callable[[ast.AST], List[Diagnostic]]] = []

    def add_rule(self, rule: Callable[[ast.AST], List[Diagnostic]]) -> None:
        self.rules.append(rule)

    def analyze(self, tree: ast.AST) -> List[Diagnostic]:
        diagnostics = []
        for rule in self.rules:
            diagnostics.extend(rule(tree))
        return diagnostics

Each rule is just a function that takes an AST and returns a list of diagnostics. This keeps the architecture clean - rules can be developed independently and plugged in as needed.

Let's look at a concrete rule. Here's one that catches unused imports:

def check_unused_imports(tree: ast.AST) -> List[Diagnostic]:
    """Find unused import statements"""
    diagnostics = []

    class ImportChecker(ast.NodeVisitor):
        def __init__(self):
            self.imports: Dict[str, int] = {}
            self.used_names: Set[str] = set()

        def visit_Import(self, node: ast.Import) -> None:
            for alias in node.names:
                name = alias.asname if alias.asname else alias.name
                self.imports[name] = node.lineno

        def visit_Name(self, node: ast.Name) -> None:
            if isinstance(node.ctx, ast.Load):
                self.used_names.add(node.id)

    checker = ImportChecker()
    checker.visit(tree)

    for name, line in checker.imports.items():
        if name not in checker.used_names:
            diagnostics.append(Diagnostic(
                line=line,
                column=0,
                message=f"Unused import '{name}'",
                severity=ErrorSeverity.WARNING,
                rule_id="unused-import"
            ))

    return diagnostics

The pattern here is familiar: walk the AST, collect information, then check for violations. We track all the imports we see, track all the names that get used, then flag any imports that never appear in the usage set.

Here's another rule that catches overly complex functions:

def check_function_complexity(tree: ast.AST) -> List[Diagnostic]:
    """Check cyclomatic complexity of functions"""
    diagnostics = []

    class ComplexityChecker(ast.NodeVisitor):
        def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
            complexity = 1  # Base complexity

            for child in ast.walk(node):
                if isinstance(child, (ast.If, ast.While, ast.For, ast.With)):
                    complexity += 1
                elif isinstance(child, ast.BoolOp):
                    complexity += len(child.values) - 1

            if complexity > 5:  # Arbitrary threshold
                diagnostics.append(Diagnostic(
                    line=node.lineno,
                    column=node.col_offset,
                    message=f"Function '{node.name}' has high complexity ({complexity})",
                    severity=ErrorSeverity.WARNING,
                    rule_id="high-complexity"
                ))

    checker = ComplexityChecker()
    checker.visit(tree)
    return diagnostics

This rule implements cyclomatic complexity - a measure of how many different paths through a function exist. Each conditional statement adds complexity. Too many conditionals and the function becomes hard to test and maintain.

Now we can wire everything together and see it in action:

engine = RuleEngine()
engine.add_rule(check_unused_imports)
engine.add_rule(check_function_complexity)

source = """
import os   # unused import
import sys  # unused import
from typing import List

def complex_function(items: List[int]) -> int:
    total = 0
    for item in items:
        if item > 0:
            if item % 2 == 0:
                if item > 10:
                    if item < 100:
                        total += item * 2
                    else:
                        total += item
                else:
                    total += item
            else:
                total += item // 2
        else:
            total -= abs(item)
    return total
"""

tree = ast.parse(source)
diagnostics = engine.analyze(tree)

for diag in diagnostics:
    print(f"{diag.severity.value}: {diag.message} (line {diag.line})")

The engine catches both issues:

warning: Unused import 'os' (line 2)
warning: Unused import 'sys' (line 3)
warning: Function 'complex_function' has high complexity (6) (line 6)

Production language servers ship with hundreds or thousands of these rules. Some check for style consistency, others catch common mistakes, and more sophisticated ones can even detect security vulnerabilities or performance issues. Each rule should be independent and composable - you can mix and match them to create the linting behavior you want.

Rust sets a high bar

I am a huge advocate for Rust in general but especially the care that the core team takes to provide interpretable errors. It's now a core focus of the project. Rust's error reporting was redesigned in RFC 1644:

Rust offers a unique value proposition in the landscape of languages in part by codifying concepts like ownership and borrowing. Because these concepts are unique to Rust, it's critical that the learning curve be as smooth as possible. And one of the most important tools for lowering the learning curve is providing excellent errors that serve to make the concepts less intimidating, and to help 'tell the story' about what those concepts mean in the context of the programmer's code.

The RFC set specific goals for error messages:

Create something that's visually easy to parse
Remove noise/unnecessary information
Present information that works well for new developers, post-onboarding, and experienced developers
Use labels on the source itself rather than sentence "notes" at the end

This resulted in the now-familiar Rust error format with color-coded labels that explain both the what (primary labels in red) and the why (secondary labels in blue) of errors:

error[E0499]: cannot borrow `foo.bar1` as mutable more than once at a time
  --> src/main.rs:29:22
   |
28 |      let bar1 = &mut foo.bar1;
   |                      -------- first mutable borrow occurs here
29 |      let _bar2 = &mut foo.bar1;
   |                       ^^^^^^^^ second mutable borrow occurs here
30 |      *bar1;
31 |  }
   |  - first borrow ends here

Language Server vs. Compiler

It's worth pointing out that rustc actually is not the rust language server. That job falls to rust-analyzer.

While rustc prioritizes correctness and educational error messages, rust-analyzer prioritizes responsiveness for IDE features. Since rustc has access to the full AST context that's used to compile down code, they can pull error messages directly from the current context. But it takes longer. rust-analyzer needs to provide as much contextual feedback as possible via AST alone while skipping the heavy the compilation pipeline.

Speed vs. Correctness Trade-offs: rust-analyzer uses incremental, query-based compilation that can provide feedback in milliseconds, while rustc does thorough analysis that takes seconds. The language server maintains its own compilation database and only re-analyzes minimal sets of files when changes occur.

Different Analysis Goals: rustc needs to generate correct machine code, so it performs exhaustive analysis. rust-analyzer only needs enough analysis to power IDE features like completion and error highlighting, so it can make approximations that would be unacceptable in a compiler.

Laziness vs. Incrementality: rust-analyzer is designed around laziness - the ability to skip analyzing most code and only look at what's immediately relevant to the current file. This is fundamentally different from rustc's approach of analyzing entire crates.

Interpreted languages

Interpreted languages (Python, Node, etc) don't have the benefit¹ of a compiler. So they can't fall back on Rust's rustc to analyze all the errors that might occur anytime in runtime. Instead you're left to figure out errors by running all the code paths through tests or manual execution.

Static analysis is all there is for interpreted languages to run a correctness check on the entire codebase right as the code is written.

This fundamental limitation shapes how language servers for interpreted languages must operate. They can't rely on a compilation step to catch type mismatches, undefined variables, or import errors. They've turned to increasingly complex heuristics. They're practically starting to resemble the processing of compilers themselves.

Python's type hints have no runtime enforcement outside of specialized libraries like Pydantic or FastAPI - they exist mainly for static analysis tools like mypy and language servers like pylsp². The language server parses type annotations, track variable assignments, and infer types across function calls without any guarantee that the runtime behavior will match:

def process_items(items: list[str]) -> int:
    total = 0
    for item in items:
        total += len(item)  # Language server assumes item is str
    return total

# But this runs fine at runtime:
process_items([1, 2, 3])  # Passes integers, not strings

The language server will flag this as an error, but Python will happily execute it until you hit the len() call on an integer. This disconnect between static analysis and runtime behavior forces interpreted language servers to make conservative assumptions.

Javascript's situation is even more complex. Without Typescript annotations, the language server must perform type inference on a language where variables can change type mid-execution:

let value = "hello";
value = 42;           // Now it's a number
value = [1, 2, 3];    // Now it's an array

Language servers like typescript-language-server handle this by maintaining multiple possible types for each variable and propagating uncertainty through the analysis. When they can't be sure, they either suppress warnings (reducing false positives) or flag everything (increasing false positives). Neither approach is perfect.

The result is that interpreted language servers must work harder for less certain results. They need to implement all the control-flow analysis, type inference algorithms, and import resolution strategies to approximate what a compiler gets for free. Tools like pyright and typescript-language-server represent hundreds of thousands of LOC to the reconstructing and semantic information that compiled languages maintain throughout their toolchain.

Some of this is baked into the language designs themselves. Even a few language choices that make it impossible to create a fully type-safe compiler compiler even if you tried. It's also why adding type hints to Python or adopting Typescript for Javascript projects can dramatically improve the language server experience. You're essentially providing the static information that the analysis engine needs to give you compiler-level feedback.

Conclusion

The architecture of language servers is deceptively simple: parse source code into ASTs, build symbol tables to track semantic relationships, apply rule engines to detect issues, then serialize the results using the LSP. But the implementation details reveal the complexity - managing incremental updates, handling partial parse trees, maintaining performance with large codebases, and dealing with the uncertainty inherent in static analysis.

Compiled languages have a natural advantage here. Their language servers can leverage existing compiler infrastructure and benefit from type systems that enforce correctness. Interpreted languages must reconstruct this semantic information through increasingly sophisticated heuristics alone, which are then limited by the original variable typing decisions that the language made.

The real innovation of language servers isn't in any single algorithm or technique - it's in the standardization of developer tooling across editors and languages. The LSP has created an ecosystem where language-specific analysis can be developed once and consumed everywhere. As more code is written automatically that becomes even more important.

Or drawback depending on who you're talking to. ↩
At least, that was the original purpose. I think runtime annotation sniffing is becoming more useful in devtooling than the language servers themselves. ↩
rust-analyzer was specifically designed to be independent of the Rust compiler (rustc) to enable faster feedback and richer IDE integration. While it shares some components with the compiler, it maintains its own analysis pipeline optimized for interactive use rather than code generation. ↩