How speculative decoding works
# July 13, 2025
In addition to lengthening context windows, the greatest thing holding LLMs back is their inference speed. Once you creep up into the 300B param territory they get pretty damn slow even on the fastest hardware. When a single token takes milliseconds to generate, and when you're generating hundreds of tokens in a response, those milliseconds add up fast. Agents and recursive language model calling make that even more true.
Speculative decoding has been used for a few years now to speed up inference time by 4x in the best cases1. It's one of those techniques that feels almost like cheating: you get faster inference without sacrificing quality. No kind of quantization tricks or anything. Here's how it works.
The bottleneck of autoregressive generation
As models have gotten bigger, they've had to be served across multiple GPUs linked with NVLink or InfiniBand. An A100 80 GB fits roughly 40B params in FP16; anything larger shards across multiple GPUs2. Every time you want to generate a token, you need to coordinate across all these GPUs, run the forward pass, and then do it all over again for the next token.
This is the constraint of autoregressive inference: you can only generate one token at a time. Since each new token has to be conditioned on everything that came before it, there's no way around the sequential steps.
The problem gets worse as models scale. You both need more GPUs (which increases coordination overhead), and each forward pass requires more computation. A 70B parameter model takes O(10x) longer per token than a 7B model since you're requiring an order of magnitude more floating point arithmetic3. While individual tensors can be multiplied pretty quickly, large architectures mean more transformer blocks, which means more time.
Here's what this looks like in practice:
Every token requires a full forward pass through the entire model. Token 2 can't start until token 1 is completely finished. Token 3 has to wait for both 1 and 2.
Obvious tokens
Not all tokens are created equal. Some next-token choices are obvious, even trivial. If I write "The capital of France is...", you know the next token should be "Paris" without needing a 70B parameter model to figure it out. A much smaller model could predict this just as accurately.
Speculative decoding exploits this insight. Instead of using your slow model to predict every single token, you can use a faster "draft" model to make initial guesses about what the next several tokens might be. Then you use the big model to verify these guesses via a single pass.
This is a bit of a hack, but it's a clever hack. The intuition is that many token sequences in natural language are predictable enough that a small model can get them right. And when the small model is wrong, you haven't lost much time because you were going to need to run the big model anyway.
How speculative decoding works
The process works in two phases: speculation and verification.
Speculation Phase: First, you run a small, fast model (maybe 1-7B parameters) to generate a sequence of k tokens. This is your "draft" or "speculative" sequence. Since this model is much smaller, it can generate these k tokens much faster than the large model could generate even a single token.
Verification Phase: Next, you take all k speculative tokens and feed them into your large model in a single forward pass. The large model processes the entire sequence at once and outputs a new token at the very end. Just like you would in a regular autoregressive pass. But because you've now run all these individual tokens through the model, you get a full logit probability breakdown of how likely the large model thinks all of these tokens are.
Then comes the verification. In concept, you accept speculative tokens from left to right as long as they match (or are close to) what the large model predicted. The moment you find a mismatch, you stop, reject that token and all subsequent speculative tokens, and use the large model's prediction instead.
The actual sampling scheme is a bit more complicated. We want to accept each token to keep the distribution of the predictions the same, even if the actual tokens disagree. This means sampling tokens with probability min(1, p_large / p_draft)
, where p_large is the probability the large model assigns to that token and p_draft is the probability the draft model assigned to it. This rejection sampling scheme ensures the final distribution is mathematically identical to what you'd get from just running the large model alone.
You don't actually need the tokens to match exactly - you can accept many tokens even when the probabilities differ significantly. If the large model assigns higher probability to a token than the draft model did, you always accept it (probability = 1). If the large model assigns lower probability, you accept it probabilistically.
The math works out
Let's think about the performance characteristics:
Best case: All k speculative tokens are accepted. You get k tokens for the price of one large model forward pass (plus the cheap cost of the small model). If k=4, you've just achieved 4x speedup.
Worst case: The very first speculative token is wrong. You reject it and all subsequent tokens, keeping only the large model's prediction for the first position. You've generated 1 token but had to run both the small model and the large model. This is actually slower than normal autoregressive generation.
Average case: Some prefix of the speculative tokens are correct. If you accept 2 out of 4 speculative tokens on average, you're getting 2x speedup overall.
You only need the small model to be right only some of the time for this to be worthwhile. Even if the small model is wrong 60% of the time, you can still see significant speedups because of the parallel verification.
Choosing the right draft model
The draft model needs to be fast enough that running it k times is cheaper than running the large model once. This usually means the draft model should be 10-20x smaller than the target model. A 7B draft model paired with a 70B target is a common configuration.
But size isn't the only consideration. The draft model should also be aligned with the target model - meaning it should make similar predictions in similar contexts. If the draft model has a completely different training distribution or instruction-following behavior, it'll produce speculative tokens that the target model consistently rejects.
Some teams train draft models specifically for this purpose, using the same data distribution as the target model but with fewer parameters. Others just use an off-the-shelf checkpoint that has been benchmarked to a high enough level of alignment. The key is finding the sweet spot between speed and alignment.
Real-world performance
In practice, speculative decoding typically achieves 2-3x speedup on average, with some workloads seeing up to 4x. The actual speedup depends heavily on:
- Text predictability: Technical documentation and code see higher speedups than creative writing
- Model alignment: Better-aligned draft models achieve higher acceptance rates than unaligned ones
- Hardware setup: The speedup depends on your specific GPU configuration and memory bandwidth
- Sequence length: Longer sequences tend to have more predictable tokens
The technique is now deployed in production at most frontier labs. It's one of those optimizations that provides meaningful user experience improvements without requiring new model architectures or training procedures. And importantly it guarantees the same outputs as just running the full model. There's no tradeoff considerations of making the model a bit dumber to make it a lot faster.4
When coupled with more quantization and better chips, it's working behind the scenes to get you predictions way faster than even a couple of years ago. Same parameters, same hardware, faster models.
-
More typically it's more like 2-3x wall clock speed. ↩
-
In FP16 the raw parameters of a 100B model are ~200 GB. That's excluding the KV-cache and activations. With those factored in it quickly scales up to several A100s. ↩
-
With modern kernels this relationship is non-linear so on expectation will be less than 10x. But still, it's certainly much slower. ↩
-
This is a much easier trade off to make in the abstract than when you have to assign percentage point tradeoffs to each choice. ↩