Speculative Decoding

Definition

Speculative decoding speeds up inference by having a small "draft" model propose multiple tokens, then verifying them in parallel with the large model.

How It Works: 1. Small model generates N candidate tokens 2. Large model verifies all N in parallel 3. Accept correct predictions, reject wrong ones 4. Continue from last correct token

Speed Benefits: - 2-3x faster inference typical - More benefit for longer generations - Maintains exact output distribution

Requirements: - Draft model must be much faster - Draft and target should be similar - Works best when draft accuracy is high

Variants: - Self-speculative decoding - Medusa (multiple heads) - Lookahead decoding

Examples

Using a 1B parameter model to draft tokens for a 70B model, achieving 2.5x speedup.

Related Terms

Inference

Using a trained AI model to make predictions on new, unseen data.

Parameters

The learnable values in a neural network that determine its behavior.

Quantization

Reducing model precision to decrease size and increase speed while maintaining quality.

Definition

Examples

Related Terms

Want more AI knowledge?

Discussion