Definition
Model evaluation assesses how well AI models perform on various tasks and criteria.
- **Key Metrics:**
- Accuracy: Correct predictions / total
- Perplexity: Language model uncertainty
- BLEU/ROUGE: Text similarity scores
- F1 Score: Balance precision/recall
- Human preference: User studies, Elo ratings
Popular Benchmarks: - MMLU (knowledge) - HumanEval (coding) - MT-Bench (conversation) - BigBench (diverse tasks) - HELM (holistic evaluation)
Evaluation Challenges: - Benchmark contamination - Gaming metrics - Real-world performance gap - Emergent capabilities
Best Practices: - Multiple benchmarks - Human evaluation - Task-specific metrics - Regular re-evaluation
Examples
Claude 3 Opus scoring 86.8% on MMLU knowledge benchmark.
Related Terms
Want more AI knowledge?
Get bite-sized AI concepts delivered to your inbox.
Free intelligence briefs. No spam, unsubscribe anytime.