Back to Glossary
techniques

Model Distillation

Training a smaller model to mimic the behavior of a larger, more capable model.

Share:

Definition

Model distillation transfers knowledge from a large "teacher" model to a smaller "student" model.

How It Works: 1. Teacher model generates outputs 2. Student trained to match teacher's outputs 3. Student learns soft labels (probabilities) 4. Result: Smaller model with similar capabilities

Why Distillation: - Smaller models are faster/cheaper - Deploy on edge devices - Reduce serving costs - Maintain quality

  • **Types:**
  • Response Distillation: Match final outputs
  • Feature Distillation: Match intermediate representations
  • Attention Distillation: Match attention patterns

Real Examples: - GPT-4 → GPT-4o mini - Claude 3 Opus → Claude 3 Haiku - Gemini Pro → Gemini Flash

Examples

OpenAI training GPT-4o mini to match GPT-4 quality at a fraction of the cost.

Want more AI knowledge?

Get bite-sized AI concepts delivered to your inbox.

Free intelligence briefs. No spam, unsubscribe anytime.

Discussion