Definition
Speculative decoding speeds up inference by having a small "draft" model propose multiple tokens, then verifying them in parallel with the large model.
How It Works: 1. Small model generates N candidate tokens 2. Large model verifies all N in parallel 3. Accept correct predictions, reject wrong ones 4. Continue from last correct token
Speed Benefits: - 2-3x faster inference typical - More benefit for longer generations - Maintains exact output distribution
Requirements: - Draft model must be much faster - Draft and target should be similar - Works best when draft accuracy is high
Variants: - Self-speculative decoding - Medusa (multiple heads) - Lookahead decoding
Examples
Using a 1B parameter model to draft tokens for a 70B model, achieving 2.5x speedup.
Related Terms
Want more AI knowledge?
Get bite-sized AI concepts delivered to your inbox.
Free intelligence briefs. No spam, unsubscribe anytime.