Definition
Model distillation transfers knowledge from a large "teacher" model to a smaller "student" model.
How It Works: 1. Teacher model generates outputs 2. Student trained to match teacher's outputs 3. Student learns soft labels (probabilities) 4. Result: Smaller model with similar capabilities
Why Distillation: - Smaller models are faster/cheaper - Deploy on edge devices - Reduce serving costs - Maintain quality
- **Types:**
- Response Distillation: Match final outputs
- Feature Distillation: Match intermediate representations
- Attention Distillation: Match attention patterns
Real Examples: - GPT-4 → GPT-4o mini - Claude 3 Opus → Claude 3 Haiku - Gemini Pro → Gemini Flash
Examples
OpenAI training GPT-4o mini to match GPT-4 quality at a fraction of the cost.
Related Terms
Want more AI knowledge?
Get bite-sized AI concepts delivered to your inbox.
Free intelligence briefs. No spam, unsubscribe anytime.