models

Vision-Language Model (VLM)

AI models that can process and reason about both images and text together.

Definition

Vision-Language Models combine visual understanding with language capabilities, enabling AI to see and discuss images.

**Architecture Approaches:**
Encoder-Decoder: Separate vision and language components
Unified: Single model for both modalities
Adapter-based: Connect vision encoder to LLM

Capabilities: - Image description and captioning - Visual question answering - Document understanding (OCR + reasoning) - Chart and diagram analysis - Multi-image reasoning

Leading Models: - GPT-4V / GPT-4o (OpenAI) - Claude 3 (Anthropic) - Gemini Pro Vision (Google) - LLaVA (open source) - Qwen-VL

Examples

Uploading a receipt photo and asking Claude to extract all line items into a spreadsheet format.

Related Terms

Vision Transformer (ViT)

Applying transformer architecture to image recognition by treating image patches as tokens.

Multimodal AI

AI systems that can understand and generate multiple types of content like text, images, audio, and video.

CLIP

OpenAI model connecting images and text in shared embedding space.

Want more AI knowledge?

Get bite-sized AI concepts delivered to your inbox.

Free intelligence briefs. No spam, unsubscribe anytime.

Discussion

Browse all terms Take AI 101 Course