Definition
Tokens are the fundamental units that language models use to process text. Tokenization breaks text into these smaller pieces before processing.
Tokenization Examples: - "Hello world" → ["Hello", " world"] (2 tokens) - "unhappiness" → ["un", "happiness"] or ["unhapp", "iness"] - Spaces, punctuation, and special characters are often separate tokens
- **Why Tokens Matter:**
- Context Windows: Models have token limits (e.g., 128K tokens)
- Pricing: API costs are per token
- Performance: Longer inputs = slower processing
Rules of Thumb: - ~4 characters per token (English) - ~3/4 words per token - 1 page ≈ 500-600 tokens - Non-English languages often use more tokens
Examples
GPT-4 has a 128K token context window, roughly 300 pages of text.
Related Terms
Want more AI knowledge?
Get bite-sized AI concepts delivered to your inbox.
Free intelligence briefs. No spam, unsubscribe anytime.