Back to Glossary
models

Vision-Language Model (VLM)

AI models that can process and reason about both images and text together.

Share:

Definition

Vision-Language Models combine visual understanding with language capabilities, enabling AI to see and discuss images.

  • **Architecture Approaches:**
  • Encoder-Decoder: Separate vision and language components
  • Unified: Single model for both modalities
  • Adapter-based: Connect vision encoder to LLM

Capabilities: - Image description and captioning - Visual question answering - Document understanding (OCR + reasoning) - Chart and diagram analysis - Multi-image reasoning

Leading Models: - GPT-4V / GPT-4o (OpenAI) - Claude 3 (Anthropic) - Gemini Pro Vision (Google) - LLaVA (open source) - Qwen-VL

Examples

Uploading a receipt photo and asking Claude to extract all line items into a spreadsheet format.

Want more AI knowledge?

Get bite-sized AI concepts delivered to your inbox.

Free intelligence briefs. No spam, unsubscribe anytime.

Discussion