Back to Glossary
techniques

Speculative Decoding

Acceleration technique using a small model to draft tokens that a large model verifies.

Share:

Definition

Speculative decoding speeds up inference by having a small "draft" model propose multiple tokens, then verifying them in parallel with the large model.

How It Works: 1. Small model generates N candidate tokens 2. Large model verifies all N in parallel 3. Accept correct predictions, reject wrong ones 4. Continue from last correct token

Speed Benefits: - 2-3x faster inference typical - More benefit for longer generations - Maintains exact output distribution

Requirements: - Draft model must be much faster - Draft and target should be similar - Works best when draft accuracy is high

Variants: - Self-speculative decoding - Medusa (multiple heads) - Lookahead decoding

Examples

Using a 1B parameter model to draft tokens for a 70B model, achieving 2.5x speedup.

Want more AI knowledge?

Get bite-sized AI concepts delivered to your inbox.

Free intelligence briefs. No spam, unsubscribe anytime.

Discussion