Token

Definition

Tokens are the fundamental units that language models use to process text. Tokenization breaks text into these smaller pieces before processing.

Tokenization Examples: - "Hello world" → ["Hello", " world"] (2 tokens) - "unhappiness" → ["un", "happiness"] or ["unhapp", "iness"] - Spaces, punctuation, and special characters are often separate tokens

**Why Tokens Matter:**
Context Windows: Models have token limits (e.g., 128K tokens)
Pricing: API costs are per token
Performance: Longer inputs = slower processing

Rules of Thumb: - ~4 characters per token (English) - ~3/4 words per token - 1 page ≈ 500-600 tokens - Non-English languages often use more tokens

Examples

GPT-4 has a 128K token context window, roughly 300 pages of text.

Related Terms

Large Language Model (LLM)

AI models trained on massive text datasets that can understand and generate human-like text.

Context Window

The maximum amount of text a language model can process at once.

Tokenizer

Component that converts text into tokens that language models can process.

Definition

Examples

Related Terms

Want more AI knowledge?

Discussion