Definition
Vision-Language Models combine visual understanding with language capabilities, enabling AI to see and discuss images.
- **Architecture Approaches:**
- Encoder-Decoder: Separate vision and language components
- Unified: Single model for both modalities
- Adapter-based: Connect vision encoder to LLM
Capabilities: - Image description and captioning - Visual question answering - Document understanding (OCR + reasoning) - Chart and diagram analysis - Multi-image reasoning
Leading Models: - GPT-4V / GPT-4o (OpenAI) - Claude 3 (Anthropic) - Gemini Pro Vision (Google) - LLaVA (open source) - Qwen-VL
Examples
Uploading a receipt photo and asking Claude to extract all line items into a spreadsheet format.
Related Terms
Applying transformer architecture to image recognition by treating image patches as tokens.
AI systems that can understand and generate multiple types of content like text, images, audio, and video.
OpenAI model connecting images and text in shared embedding space.
Want more AI knowledge?
Get bite-sized AI concepts delivered to your inbox.
Free intelligence briefs. No spam, unsubscribe anytime.