Modern vision-language models like GPT-4o, Claude, and Gemini handle images through a unified architecture rather than bolting a vision model onto a language model. The pipeline starts with a vision encoder, typically a Vision Transformer (ViT) that splits an input image into a grid of patches (commonly 14x14 pixels each) and converts each patch into a vector embedding. For a typical image, this produces a few hundred to a few thousand visual tokens. These visual tokens then go through a projection layer that maps them into the same vector space as text tokens. The combined sequence of visual and text tokens flows through the language model's transformer, where attention mechanisms can reason across modalities — a text token can attend to relevant visual tokens and vice versa. The training process matters as much as the architecture. Early multimodal models were trained on image-text pairs from the web, which produced contrastive learning systems like CLIP that excel at matching but poorly at reasoning. Modern VLMs use richer training: visual instruction tuning with detailed image descriptions, document understanding tasks, chart and diagram interpretation, and synthetic data generated by stronger models. Some architectures use cross-attention layers (Flamingo, IDEFICS) where visual tokens are processed separately but the language model attends to them, while others use full token concatenation (LLaVA, Claude). Despite impressive capabilities, VLMs still struggle in predictable ways: counting objects in cluttered scenes, reading small text, understanding 3D spatial relationships, and following multi-step visual instructions like 'find the third item from the left in the second row'. The pixel-grid tokenization approach loses fine spatial detail that humans use intuitively. Research on better visual tokenization, mixture-of-resolution approaches, and explicit spatial reasoning is the active frontier in 2026.
AdvancedAI & MLMultimodal AIKnowledge
How Vision-Language Models Actually 'See': Inside the Architecture
When you upload an image to GPT-4o or Claude and ask about it, the model isn't running a separate vision system. The image gets converted into tokens that flow through the same transformer that processes text. Understanding this unified architecture clarifies why VLMs work and where they still struggle.
vision-language-modelsmultimodal-architecturevitmultimodal-aima
Want more like this?
WeeBytes delivers 25 cards like this every day — personalised to your interests.
Start learning for free