Chapter 5: Beyond Text: Multimodal LLMs
5.1 Text+Image Models (LLaVA, Flamingo, GPT-4o, DeepSeek-VL)
So far, we have focused on models that live in the world of words. But human intelligence is multimodal: we learn by reading, seeing, hearing, and interacting with the world. For AI to approach this kind of understanding, language models must also expand beyond text.
This limitation of text-only models becomes evident when we consider how humans perceive and process information. We don't experience the world as isolated streams of text—we integrate visual cues, sounds, and physical interactions to form a comprehensive understanding. Traditional LLMs, despite their impressive capabilities with language, lack this holistic perception that comes naturally to humans.
This is where multimodal LLMs come in. By combining text with images, audio, or video, these models can:
- Describe what they "see" in pictures, recognizing objects, scenes, actions, and even emotional context within visual content.
- Answer questions about charts or diagrams, interpreting visual data representations and translating visual patterns into meaningful insights.
- Connect written descriptions to visual understanding, bridging the gap between abstract concepts described in words and their concrete visual manifestations.
- Support real-world tasks like tutoring, accessibility tools, and robotics, where understanding multiple forms of communication is essential for effective assistance.
Multimodal systems represent a significant leap forward in AI capabilities. Rather than processing each type of data in isolation, these models create connections between different forms of information, much like the human brain integrates signals from our various senses. This cross-modal reasoning allows for richer understanding and more natural interactions with AI systems.
In this chapter, we'll explore how researchers are pushing LLMs beyond text, starting with one of the most active areas: Text+Image models.
Text+Image models extend language models by integrating visual encoders with text-based transformers. This integration represents a significant advancement in AI, allowing models to process and understand both visual and textual information simultaneously. In practice, this integration involves several key components working together:
- An image encoder (like CLIP's vision transformer or a convolutional net) processes an image into embeddings. This encoder analyzes the visual content pixel by pixel, identifying features such as shapes, colors, objects, spatial relationships, and even contextual elements. The encoder works through multiple processing layers, each extracting increasingly complex information:
- Low-level features: First, the encoder detects basic elements like edges, textures, and color patterns across the image. This initial layer of processing works similarly to how our eyes first perceive visual information - identifying contrasts between light and dark, detecting boundaries between colors, and registering texture variations (like smooth vs. rough surfaces).
This stage is computationally intensive as the model must analyze every pixel and its relationship to neighboring pixels. For example, when processing a photograph of a forest, the encoder might identify:
- Vertical lines representing tree trunks
- Irregular patterns of green representing foliage
- Textural differences between rough bark and smooth leaves
- Shadow gradients indicating depth and lighting direction
- Color transitions between sky and terrain
The encoder uses specialized filters that respond to specific patterns - some detect horizontal lines, others vertical lines, while others identify specific color gradients or textural elements. These filters work in parallel across the entire image, creating feature maps that highlight where each pattern appears most strongly.
These fundamental visual elements form the building blocks for all higher-level recognition, much like how letters combine to form words and sentences in language processing. Without accurate detection at this stage, the more complex recognition tasks in subsequent layers would fail.
- Mid-level features: These basic elements are then combined to recognize more complex structures such as specific shapes, object parts, and spatial arrangements. At this stage, the model begins to identify meaningful patterns - recognizing that certain edges form the outline of a face, or that particular textures likely represent fur, fabric, or foliage.
This mid-level processing is crucial because it bridges the gap between raw visual data and semantic understanding. For example, when processing an image of a person walking a dog in a park:
- The model might recognize curved lines and color patterns that form the silhouette of a human figure
- It identifies four-legged shapes with characteristic proportions that indicate "dog"
- It detects textural patterns of grass, trees, and sky that suggest "outdoor environment"
- It recognizes spatial configurations that establish the relationship between person and dog (connected by a leash)
The model also starts to understand spatial relationships, determining when objects are above, below, or inside others. These spatial relationships provide critical context - a cup on a table has different implications than a table on a cup. The model learns to recognize standard spatial arrangements (like furniture in a room) and unusual configurations that might require special attention.
- High-level features: Finally, the encoder identifies complete objects, scenes, actions, and the relationships between elements in the image. This is where true "understanding" emerges, as the model recognizes not just isolated objects but meaningful context - distinguishing between a dog sitting on a sofa versus running through a park, or understanding that a person holding a tennis racket near a net represents a specific activity.
At this highest level of processing, the model performs several sophisticated cognitive tasks:
- Object recognition and classification: The model can identify whole entities (people, animals, vehicles, furniture) and categorize them into specific types or classes (German Shepherd dog, mid-century sofa, professional tennis player).
- Scene understanding: Beyond individual objects, the model comprehends entire environments - recognizing a kitchen from its appliances and layout, or a beach scene from the combination of sand, water, and distinctive lighting.
- Action recognition: The model can interpret dynamic elements - differentiating between someone running versus walking, or throwing versus catching - based on posture, positioning, and contextual cues.
- Relationship detection: Perhaps most impressively, the model identifies how objects relate to each other spatially and functionally - recognizing that a person is walking a dog (connected by a leash), riding a bicycle (positioned on top), or cooking food (performing actions on ingredients).
- Contextual inference: The model makes educated guesses about the broader situation - inferring a birthday celebration from candles on a cake and gathering of people, or a professional meeting from business attire and a conference room setting.
The model can also interpret emotional content, social interactions, and even infer potential narratives within the scene. It might recognize facial expressions indicating happiness or concern, body language suggesting tension or relaxation, or social dynamics like a teacher instructing students or friends enjoying a meal together. Through extensive training on millions of images with corresponding descriptions, the model learns to associate visual patterns with rich semantic concepts, enabling it to "see" at a level that approximates human understanding.
The result is a dense representation of the image's content in a numerical format that the model can process - essentially translating visual information into a "language" that the AI can understand and reason with.
- Low-level features: First, the encoder detects basic elements like edges, textures, and color patterns across the image. This initial layer of processing works similarly to how our eyes first perceive visual information - identifying contrasts between light and dark, detecting boundaries between colors, and registering texture variations (like smooth vs. rough surfaces).
- A projection layer maps those embeddings into the same space as the language model's tokens. This critical alignment step ensures that visual information and text information can be processed together. Without this projection, the model would struggle to make meaningful connections between what it sees and what it understands through language.
The projection layer essentially translates the "language of images" into a format compatible with the "language of text," allowing both modalities to coexist in the same computational space. This process involves several sophisticated transformations:
Dimensionality alignment: Image embeddings and text embeddings often have different dimensions and structures. The projection layer reshapes visual features to match the exact dimensions expected by the language model, ensuring that every visual concept can be represented in a way the text processing components can interpret. This process involves complex mathematical transformations that convert the high-dimensional tensors from the vision encoder (which might have shapes like [batch_size, sequence_length, vision_dimension]) into the format required by the language model (typically [batch_size, sequence_length, hidden_dimension]).
For example, a vision encoder might output features with 1024 dimensions per token, while the language model might work with 768-dimensional embeddings. The projection layer would then implement a learned linear transformation (essentially a matrix multiplication) that maps each 1024-dimensional vector to a 768-dimensional vector while preserving as much semantic information as possible.
This alignment is not just about matching numbers - it's about preserving the rich semantic relationships captured in the visual domain. The projection parameters are learned during training, allowing the model to discover optimal mappings between visual concepts and their linguistic counterparts. This ensures that when the language model attends to these projected visual features, it can extract meaningful information that corresponds to concepts it understands through language.
Semantic mapping: Beyond simple dimension matching, the projection layer learns to map visual concepts to their linguistic counterparts. For example, the visual features representing "a red apple" must be projected into a space where they can interact meaningfully with the text tokens for "red" and "apple."
This semantic mapping is a sophisticated translation process that bridges two fundamentally different representational systems. When processing an image of a red apple, the vision encoder extracts features capturing its roundness, smooth texture, red coloration, and stem. These visual features exist as abstract numerical patterns distributed across multiple embedding dimensions. The projection layer must transform these distributed visual patterns into representations that align with how language models understand concepts like "red" (a color attribute) and "apple" (a fruit category).
The challenge is significant because visual and linguistic representations are structured differently:
- In vision, concepts are often entangled - the "redness" and "appleness" exist simultaneously in the same pixels and are processed together.
- In language, concepts are more discrete - "red" and "apple" are separate tokens with distinct meanings that compose together.
Through extensive training on paired image-text data, the projection layer learns to disentangle these visual features and map them to their linguistic counterparts. When successful, the projected visual features will activate similar neural patterns as would be activated by the text "red apple" in the language model. This enables the language model to reason about the visual content using its language understanding capabilities - for instance, answering questions like "What color is the apple?" by connecting the visual representation to the appropriate linguistic concept "red".
This semantic alignment is what allows multimodal models to perform cross-modal reasoning tasks, such as describing unseen objects, answering questions about visual content, or generating text that references visual elements in contextually appropriate ways.
Contextual integration: The projection ensures that contextual relationships in the visual domain (like spatial relationships between objects) are preserved in a way that the language model can access and reason about. This allows the model to answer questions about relative positions or interactions between objects in an image.
This contextual integration is particularly crucial because visual scenes contain rich spatial and relational information that must be translated into a format the language model can process. For example, when looking at an image of a dining table, the model needs to understand not just that there are plates, glasses, and utensils, but their arrangement (plates in front of chairs, glasses above plates, forks to the left of plates), their groupings (place settings), and their functional relationships (napkins folded on plates).
The projection layer preserves these spatial hierarchies by maintaining relative positional information between visual features. Through specialized attention mechanisms, it ensures that:
- Proximity relationships ("the book is next to the lamp") are encoded in ways that language models can interpret
- Containment relationships ("the apple is in the bowl") maintain their hierarchical structure
- Directional relationships ("the dog is facing the camera") preserve orientation information
- Scale relationships ("the elephant is larger than the mouse") retain relative size information
This sophisticated mapping enables the model to correctly interpret questions like "What's above the bookshelf?", "Is the child holding the balloon?", or "Which way is the car facing?" - questions that require understanding not just what objects are present but how they relate to one another in physical space.
Without proper contextual integration, a model might recognize all objects in an image but fail to understand their meaningful relationships, severely limiting its ability to reason about scenes as humans naturally do.
- The language model treats visual embeddings as if they were special tokens, allowing it to "attend" to both words and pixels. Through self-attention mechanisms, the model can create connections between visual elements and textual concepts, forming a comprehensive understanding that spans both modalities.
This integration happens through a sophisticated process where the transformer architecture's self-attention mechanism simultaneously processes both text tokens and visual tokens. When a user asks "What color is the car in this image?", the model's attention heads can focus on:
- The visual embeddings representing the car in the image
- The textual tokens related to "color" and "car" in the query
- The contextual relationship between these elements
The self-attention weights form a complex web of connections, allowing information to flow bidirectionally between modalities. For example, when processing an image of a red sports car alongside text mentioning "vehicle," the model can:
- Associate visual features of the car with the word "vehicle" in the text
- Connect color properties from the visual embedding to potential color descriptions
- Link spatial relationships in the image (car on road) to potential scene descriptions
This cross-modal attention enables the model to perform tasks like visual question answering, image captioning, and text-conditional reasoning about visual content. The attention maps themselves reveal how the model distributes focus across different parts of both the image and text when forming its understanding.
This allows the model to reason about relationships between what it "sees" and what it "reads."
This fusion of visual and textual processing creates a powerful system that can understand context across modalities, enabling it to answer prompts like:
- "What's written on the sign in this photo?" - requiring text recognition within images and understanding of visual context. The model must identify text elements embedded within the visual scene, distinguish them from other visual features, and accurately transcribe the text while maintaining awareness of the sign's context in the broader image (whether it's a street sign, store front, warning notice, etc.).
- "Describe this chart in plain English." - requiring interpretation of data visualizations and translation into natural language. Here, the model must recognize the chart type (bar graph, pie chart, line graph, etc.), identify axes labels, data points, and trends, then synthesize this information into coherent prose that captures the key relationships and insights presented in the visualization.
- "Write a story about this image." - requiring creative generation based on visual stimuli and understanding of narrative elements. This complex task requires the model to recognize not just objects but their relationships, potential emotional content, implied actions or intentions, and then use these elements to create a coherent narrative with characters, setting, plot, and thematic elements that plausibly extend from what's visible in the image.
5.1.1 LLaVA (Large Language and Vision Assistant)
Open-source model combining CLIP for vision + Vicuna (LLM). CLIP (Contrastive Language-Image Pre-training) serves as the vision encoder that processes and extracts features from images, while Vicuna, a fine-tuned version of LLaMA, handles the language processing capabilities. The architecture leverages CLIP's powerful visual representation ability, which was trained on 400 million image-text pairs to understand visual concepts, and combines it with Vicuna's advanced language understanding and generation capabilities.
LLaVA follows a two-stage training process. First, it's pretrained on a large corpus of image-text pairs to establish basic connections between visual and linguistic information. Then, it's specifically trained on instruction-following data that pairs images with text prompts. This training approach enables LLaVA to understand and respond to specific instructions about visual content, going beyond simple image captioning to more complex reasoning about what it sees. This instruction-tuning is what gives LLaVA its ability to follow nuanced directions when analyzing images, rather than just generating generic descriptions.
The training dataset includes approximately 158,000 image-text instruction pairs, carefully curated to cover a wide range of visual reasoning tasks, from simple object identification to complex scene interpretation. This instruction-tuning phase is crucial as it teaches the model to follow specific directives when analyzing visual content. The dataset incorporates diverse image types including natural photographs, diagrams, charts, screenshots, and artistic images, ensuring the model can handle various visual formats. The text instructions are similarly diverse, ranging from simple requests like "What color is the car?" to more complex ones like "Explain the relationship between the people in this image and what they might be feeling."
Example task: describing an image in detail. LLaVA can generate comprehensive descriptions that include object identification, spatial relationships, attributes, actions, and even infer context or emotions from visual scenes. Its descriptions can range from factual observations to more interpretative analyses depending on the prompt.
For instance, when shown an image of a city street, LLaVA can identify not only the vehicles, pedestrians, and buildings, but also describe their relationships (e.g., "a person crossing the street while cars wait at a red light"), infer weather conditions based on visual cues (e.g., "wet pavement suggests recent rainfall"), and even comment on the likely time of day based on lighting conditions and shadows. The model can also perform more specialized tasks like reading text in images, analyzing charts or graphs, identifying landmarks, and recognizing famous people or artwork, demonstrating its versatility across different visual analysis scenarios.
LLaVA stands out for its efficient architecture that achieves strong performance while requiring relatively modest computational resources compared to proprietary alternatives. Its open-source nature has made it a popular choice for researchers and developers working on vision-language applications. The model's architecture is notably streamlined, using a simple projection layer to connect CLIP's vision embeddings with Vicuna's language processing capabilities. This approach avoids the computational overhead of more complex cross-attention mechanisms while still enabling effective communication between the visual and language components. The smaller variants of LLaVA can run on consumer-grade GPUs with 16GB of memory, making advanced multimodal AI accessible to a much broader range of researchers and developers than closed-source alternatives that may require specialized hardware.
The model achieves competitive performance on benchmarks such as VQAv2 (Visual Question Answering) and GQA (Grounded Question Answering), while being significantly more resource-efficient than closed-source alternatives like GPT-4V. On the VQAv2 benchmark, which evaluates a model's ability to answer questions about images, LLaVA-1.5 achieves scores comparable to much larger proprietary models. Its accessibility allows developers to fine-tune it for specific domains or applications, such as medical image analysis (interpreting X-rays, CT scans, and other medical imaging), retail product recognition (identifying products in shelves or catalog images), or educational content development (explaining scientific diagrams or historical artifacts), fostering a growing ecosystem of specialized multimodal AI applications. The model has inspired numerous derivatives and extensions in the open-source community, including versions optimized for different languages, specialized for particular domains like document understanding, or modified to work with video input rather than static images.
Code Example: Using LLaVA for Multimodal Processing
# Complete LLaVA implementation example
import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
# Step 1: Load the pre-trained LLaVA model and processor
model_id = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Step 2: Prepare the image
image = Image.open("colosseum.jpg")
# Step 3: Define your prompt
prompt = "Describe this image in detail."
# Step 4: Process the inputs
inputs = processor(
prompt,
images=image,
return_tensors="pt"
).to("cuda")
# Step 5: Generate the response
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
# Step 6: Decode and print the response
generated_text = processor.decode(output[0], skip_special_tokens=True)
print(generated_text)
For this example, download the Colosseum image here: https://files.cuantum.tech/images/colosseum.jpg
Code Breakdown: Using LLaVA for Multimodal Processing
This code demonstrates how to use the LLaVA (Large Language and Vision Assistant) model to process images and generate descriptive text. Let's break down each part in detail:
1. Imports and Setup
- torch: The PyTorch library provides tensor computation and neural networks functionality.
- PIL.Image: The Python Imaging Library allows us to open and manipulate image files.
- AutoProcessor: Automatically selects the appropriate processor for the model, handling both text tokenization and image preprocessing.
- LlavaForConditionalGeneration: The main LLaVA model class that combines vision and language capabilities.
2. Model Loading
The code loads the LLaVA 1.5 7B model from Hugging Face, which is a moderate-sized variant balancing performance and resource requirements:
- torch_dtype=torch.float16: Uses half-precision floating-point format to reduce memory usage.
- device_map="auto": Automatically determines the optimal device placement strategy, distributing model components across available GPUs or using CPU as needed.
3. Input Preparation
The code prepares two key inputs:
- An image loaded using PIL's Image.open() function.
- A text prompt that specifies the task ("Describe this image in detail").
The processor then:
- Resizes and normalizes the image to match CLIP's expected input format (224x224 pixels).
- Tokenizes the text prompt into input IDs for the language model component.
- Creates attention masks and other required tensor inputs.
4. Generation Process
The model.generate() method creates the text response with several parameters controlling the generation:
- max_new_tokens=256: Limits the response length to a maximum of 256 new tokens.
- do_sample=True: Enables sampling-based generation rather than greedy decoding.
- temperature=0.6: Controls randomness in the generation (lower values are more deterministic).
- top_p=0.9: Implements nucleus sampling, considering only tokens whose cumulative probability exceeds 90%.
5. Behind the Scenes: How LLaVA Processes the Image
When you run this code, LLaVA performs several sophisticated operations:
- The CLIP vision encoder extracts visual features from the image, creating a high-dimensional representation that captures objects, attributes, spatial relationships, and other visual information.
- The projection layer transforms these visual embeddings into a format compatible with the language model's embedding space, essentially "translating" visual concepts into a language the LLM can understand.
- The Vicuna language model (based on LLaMA) receives both the projected visual embeddings and the tokenized prompt, treating the visual information as special tokens in its context window.
- The self-attention mechanism allows the model to focus on relevant parts of both the image representation and the text prompt when generating each token of the response.
- The decoder generates a coherent, contextually appropriate text response based on both the visual content and the text instruction.
6. Advanced Customization Options
The basic example above can be extended with additional parameters for more control:
# Advanced parameters for more control
output = model.generate(
**inputs,
max_new_tokens=512, # Generate longer responses
do_sample=True, # Enable sampling-based generation
temperature=0.7, # Slightly more creative responses
top_p=0.9, # Nucleus sampling parameter
top_k=50, # Limit vocabulary to top 50 tokens
repetition_penalty=1.2, # Discourage repetition of phrases
length_penalty=1.0, # No penalty based on length
no_repeat_ngram_size=3, # Avoid repeating 3-grams
)
7. Practical Applications
This code structure can be adapted for various multimodal tasks by modifying the prompt:
- Visual question answering: "What color is the car in this image?"
- Image reasoning: "Explain what might happen next in this scene."
- Content extraction: "Extract all text visible in this image."
- Creative generation: "Write a short story inspired by this image."
LLaVA's architecture effectively bridges vision and language, enabling these diverse applications with the same underlying model.
Advanced Example: Interactive Visual Question Answering with LLaVA
The following code demonstrates a more sophisticated use case for LLaVA: building an interactive visual question answering application that can process uploaded images and answer questions about them in real-time.
# Advanced LLaVA application: Interactive Visual QA with Gradio
import torch
import gradio as gr
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
# Load the LLaVA model and processor
model_id = "llava-hf/llava-1.5-13b-hf" # Using larger 13B parameter version
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
def process_image_and_question(image, question, temperature=0.7, max_length=500):
"""Process an image and a question to generate a response using LLaVA."""
# Prepare the prompt with the user's question
prompt = f"Answer this question about the image: {question}"
# Process inputs
inputs = processor(
prompt,
images=image,
return_tensors="pt"
).to(model.device)
# Generate the response
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
)
# Decode the response
generated_text = processor.decode(output[0], skip_special_tokens=True)
# Return just the model's answer, removing the original question
response = generated_text.split("Answer this question about the image:")[-1].strip()
return response
# Set up the Gradio interface
with gr.Blocks() as demo:
gr.Markdown("# LLaVA Visual Question Answering")
gr.Markdown("Upload an image and ask a question about it.")
with gr.Row():
with gr.Column():
image_input = gr.Image(type="pil", label="Upload Image")
question_input = gr.Textbox(label="Your Question", placeholder="What's happening in this image?")
temperature = gr.Slider(0.1, 1.0, value=0.7, label="Temperature (creativity)")
max_length = gr.Slider(50, 1000, value=500, step=50, label="Maximum response length")
submit_button = gr.Button("Get Answer")
with gr.Column():
output_text = gr.Textbox(label="LLaVA's Answer", lines=10)
# Connect the interface to the processing function
submit_button.click(
fn=process_image_and_question,
inputs=[image_input, question_input, temperature, max_length],
outputs=output_text
)
# Add example images and questions
gr.Examples(
examples=[
["example_street_scene.jpg", "What safety hazards do you see in this image?"],
["example_chart.jpg", "Explain the main trend shown in this chart."],
["example_food.jpg", "What ingredients might be in this dish?"]
],
inputs=[image_input, question_input]
)
# Launch the application
demo.launch()
For this example, download the required images from these links:
Street Scene: files.cuantum.tech/images/example_street_scene.jpg
Chart: https://files.cuantum.tech/images/example_chart.jpg
Food: https://files.cuantum.tech/images/example_food.jpg
Code Breakdown: Interactive Visual QA Application
This advanced example demonstrates how to build a user-friendly application for visual question answering using LLaVA. Let's break down the key components:
1. Model Selection and Setup
- LLaVA 1.5-13B: This code uses the larger 13B parameter version of LLaVA (compared to the 7B in the previous example), which offers improved reasoning capabilities at the cost of requiring more computational resources.
- The same initialization approach is used, with float16 precision and automatic device mapping to optimize for available hardware.
2. Core Processing Function
The process_image_and_question() function handles the core multimodal processing:
- It takes four inputs: an image, a question, and two generation parameters (temperature and max length).
- The question is formatted into a standardized prompt format that helps guide LLaVA's response generation.
- After processing, it extracts just the relevant answer portion, removing the original prompt for a cleaner user experience.
3. Gradio Interface Construction
The code uses Gradio to create an intuitive web interface for the application:
- User inputs: Image upload, question text box, and generation parameter sliders for fine-tuning responses.
- Layout organization: Arranged in a two-column layout for inputs on the left and outputs on the right.
- Examples: Pre-configured example images and questions to demonstrate the system's capabilities.
4. Behind the Scenes: Enhanced Multimodal Processing
When a user interacts with this application, several sophisticated processes occur:
- The uploaded image is automatically preprocessed by the Gradio interface to ensure compatibility with LLaVA's input requirements.
- The LLaVA processor handles both the text tokenization and image preprocessing, ensuring proper alignment between modalities.
- The question is formatted into a directive that helps the model understand the specific visual reasoning task required.
- Generation parameters provide user control over the response style - higher temperature produces more creative but potentially less precise answers.
- Post-processing extracts just the relevant answer, creating a cleaner conversational experience.
5. Potential Applications
This interactive application template could be adapted for numerous real-world use cases:
- Educational tools: Students could upload diagrams or historical images and ask for explanations.
- Accessibility services: Visually impaired users could ask detailed questions about photographs or documents.
- E-commerce: Shoppers could upload product images and ask specific questions about features or compatibility.
- Technical support: Users could share screenshots of error messages or hardware setups and ask for troubleshooting advice.
- Content moderation: Platforms could use a modified version to help analyze uploaded images for policy compliance.
6. Technical Considerations and Limitations
When implementing this type of application, it's important to consider:
- Hardware requirements: The 13B parameter model requires a GPU with at least 24GB VRAM for optimal performance.
- Inference speed: Response generation typically takes 2-10 seconds depending on hardware and response length.
- Image resolution: LLaVA processes images at a fixed resolution (typically 224x224 pixels), which may limit detailed analysis of very small elements.
- Privacy considerations: For sensitive applications, consider running this locally rather than on cloud infrastructure.
This example illustrates how LLaVA's capabilities can be packaged into user-friendly applications that bring multimodal AI's power to non-technical users. The combination of visual understanding, language generation, and interactive controls creates a flexible system for a wide range of visual reasoning tasks.
5.1.2 Flamingo (DeepMind)
Flamingo is a groundbreaking multimodal model developed by DeepMind, specifically engineered to excel at few-shot learning across text and image domains. Unlike models that require extensive task-specific training, Flamingo can adapt to new visual tasks with minimal examples. This represents a significant advancement in multimodal AI, as most earlier systems required dedicated training datasets for each new type of visual reasoning task they needed to perform.
At its architectural core, Flamingo uses a frozen language model (LLM) as its foundation and introduces specialized cross-attention layers that create bridges between visual representations and textual understanding. These cross-attention mechanisms serve as effective translators, allowing visual information to be meaningfully incorporated into the language model's processing pipeline without disrupting its pre-trained linguistic capabilities. The visual processing component of Flamingo utilizes a vision encoder based on a Normalizer-Free ResNet (NFNet), which transforms images into dense feature representations. These visual features are then processed through a perceiver resampler module that converts the variable-sized visual representations into a fixed number of visual tokens that can be efficiently processed by the language model.
What makes Flamingo particularly impressive is its ability to perform "in-context learning" with visual data. It can answer questions about previously unseen image-text tasks with remarkably little training data - often needing just 1-16 examples to achieve strong performance. This capability allows Flamingo to generalize to novel visual reasoning scenarios without extensive retraining, making it adaptable across domains like visual question answering, image captioning, and visual reasoning with minimal setup time. The model was trained on a massive multimodal dataset comprising hundreds of millions of image-text pairs gathered from diverse web sources, enabling it to develop a rich understanding of the relationships between visual and textual concepts.
During inference, Flamingo can process interleaved sequences of images and text, making it particularly well-suited for conversational interactions about visual content. For example, a user could show Flamingo several images of animals with corresponding descriptions as examples, then present a new animal image and ask for a similar description. The model would leverage its few-shot learning capabilities to generate an appropriate response following the pattern established in the examples. This flexibility extends to complex reasoning tasks as well, such as comparing multiple images, answering questions about specific visual details, or even generating creative content inspired by visual inputs.
The model's architecture has inspired subsequent research in efficient multimodal learning, particularly in how to effectively combine pre-trained unimodal models (like vision-only and language-only systems) into powerful multimodal reasoners without requiring extensive joint training from scratch. This approach has proven valuable for developing more accessible multimodal AI systems while leveraging the strengths of specialized models in each modality.
Flamingo Implementation Example: Multimodal Few-shot Learning
Below is a simplified implementation example of a Flamingo-inspired architecture using PyTorch. This example demonstrates the core components of Flamingo: a vision encoder, a perceiver resampler, and cross-attention layers integrated with a language model.
import torch
import torch.nn as nn
import torchvision.models as models
from transformers import GPT2LMHeadModel, GPT2Tokenizer
class PerceiverResampler(nn.Module):
"""
Perceiver Resampler module that converts variable-sized visual features
to a fixed number of tokens that can be processed by the language model.
"""
def __init__(self, input_dim=2048, latent_dim=768, num_latents=64, num_layers=4):
super().__init__()
self.latents = nn.Parameter(torch.randn(num_latents, latent_dim))
self.layers = nn.ModuleList([
nn.MultiheadAttention(embed_dim=latent_dim, num_heads=8, batch_first=True)
for _ in range(num_layers)
])
self.input_proj = nn.Linear(input_dim, latent_dim)
self.norm = nn.LayerNorm(latent_dim)
def forward(self, visual_features):
# Project visual features to latent dimension
visual_features = self.input_proj(visual_features)
# Expand latents to batch size
batch_size = visual_features.shape[0]
latents = self.latents.unsqueeze(0).expand(batch_size, -1, -1)
# Process through cross-attention layers
for layer in self.layers:
latents = latents + layer(
query=latents,
key=visual_features,
value=visual_features,
need_weights=False
)[0]
latents = self.norm(latents)
return latents
class CrossAttentionBlock(nn.Module):
"""
Cross-attention block that integrates visual information into the LLM.
"""
def __init__(self, hidden_size=768, num_heads=12):
super().__init__()
self.cross_attention = nn.MultiheadAttention(
embed_dim=hidden_size,
num_heads=num_heads,
batch_first=True
)
self.layer_norm1 = nn.LayerNorm(hidden_size)
self.layer_norm2 = nn.LayerNorm(hidden_size)
def forward(self, hidden_states, visual_features):
normed_hidden_states = self.layer_norm1(hidden_states)
# Apply cross-attention
attn_output = self.cross_attention(
query=normed_hidden_states,
key=visual_features,
value=visual_features,
need_weights=False
)[0]
# Residual connection and layer norm
hidden_states = hidden_states + attn_output
hidden_states = self.layer_norm2(hidden_states)
return hidden_states
class FlamingoModel(nn.Module):
"""
Simplified Flamingo model combining vision encoder, perceiver resampler,
and a language model with cross-attention layers.
"""
def __init__(self, vision_model_name="resnet50", num_visual_tokens=64):
super().__init__()
# Vision encoder (frozen)
self.vision_encoder = models.__dict__[vision_model_name](pretrained=True)
self.vision_encoder.fc = nn.Identity() # Remove classification head
for param in self.vision_encoder.parameters():
param.requires_grad = False
# Perceiver resampler
self.perceiver = PerceiverResampler(
input_dim=2048, # ResNet50 feature dim
latent_dim=768, # Match GPT2 hidden size
num_latents=num_visual_tokens
)
# Language model (frozen)
self.language_model = GPT2LMHeadModel.from_pretrained("gpt2")
self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
self.tokenizer.pad_token = self.tokenizer.eos_token
for param in self.language_model.parameters():
param.requires_grad = False
# Cross-attention layers (one per transformer block)
self.cross_attentions = nn.ModuleList([
CrossAttentionBlock(hidden_size=768, num_heads=12)
for _ in range(len(self.language_model.transformer.h))
])
# Save original forward methods
self.original_block_forward = self.language_model.transformer.h[0].forward
# Monkey patch the transformer blocks to include cross-attention
for i, block in enumerate(self.language_model.transformer.h):
block.flamingo_cross_attn = self.cross_attentions[i]
block.forward = self._make_new_forward(block, i)
# Visual features buffer for storing current visual context
self.register_buffer("visual_features", None, persistent=False)
def _make_new_forward(self, block, block_index):
"""Creates a new forward method for transformer blocks that includes cross-attention."""
original_forward = block.forward
cross_attn = self.cross_attentions[block_index]
def new_forward(x, **kwargs):
# Run original transformer block
hidden_states = original_forward(x, **kwargs)
# Apply cross-attention with visual features
if self.visual_features is not None:
hidden_states = cross_attn(hidden_states[0], self.visual_features)
return (hidden_states,) + hidden_states[1:] if isinstance(hidden_states, tuple) else (hidden_states,)
return hidden_states
return new_forward
def process_images(self, images):
"""Extract visual features from images and prepare them for conditioning."""
with torch.no_grad():
# Extract features from vision encoder
features = self.vision_encoder(images) # [batch_size, 2048]
features = features.unsqueeze(1) # Add sequence dimension [batch_size, 1, 2048]
# Process through perceiver resampler
visual_tokens = self.perceiver(features) # [batch_size, num_latents, hidden_size]
# Store visual features for cross-attention
self.visual_features = visual_tokens
def generate(self, prompt, images=None, max_length=100, temperature=0.7):
"""Generate text conditioned on images and text prompt."""
# Process images if provided
if images is not None:
self.process_images(images)
else:
self.visual_features = None
# Tokenize prompt
inputs = self.tokenizer(prompt, return_tensors="pt", padding=True)
input_ids = inputs.input_ids.to(next(self.parameters()).device)
attention_mask = inputs.attention_mask.to(next(self.parameters()).device)
# Generate text
output_ids = self.language_model.generate(
input_ids,
attention_mask=attention_mask,
max_length=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
)
# Decode output
generated_text = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
return generated_text
# Example usage
def flamingo_example():
from PIL import Image
import torchvision.transforms as transforms
# Initialize model
model = FlamingoModel().to("cuda" if torch.cuda.is_available() else "cpu")
# Prepare image transform
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Load and process image
image = Image.open("eiffel-tower.jpg")
image_tensor = transform(image).unsqueeze(0).to(next(model.parameters()).device)
# Example prompts for few-shot learning
few_shot_prompt = """
Image: [A photo of a busy street in Tokyo]
Description: The image shows a crowded street in Tokyo with neon signs, many pedestrians, and small restaurants.
Image: [A photo of the Grand Canyon]
Description: The image depicts the vast expanse of the Grand Canyon with its layered rock formations and deep ravines.
Image: [Current image]
Description:
"""
# Generate text based on image
output = model.generate(few_shot_prompt, images=image_tensor, max_length=200)
print(output)
if __name__ == "__main__":
flamingo_example()
For this example, download the Eiffel Tower image here: https://files.cuantum.tech/images/eiffel-tower.jpg
Code Breakdown: Flamingo-inspired Multimodal Model
The above implementation represents a simplified version of DeepMind's Flamingo architecture. Let's break down the key components:
1. Architecture Components
- Vision Encoder: A pretrained ResNet50 model that extracts visual features from images. In the full Flamingo model, this would be a more advanced vision model like NFNet.
- Perceiver Resampler: This critical component transforms variable-sized visual features into a fixed number of visual tokens. It uses cross-attention between learned latent vectors and visual features to condense the visual information.
- Language Model: A pretrained GPT-2 model serves as the language foundation. The original Flamingo used a more powerful Chinchilla LLM.
- Cross-Attention Layers: These layers are inserted into each transformer block of the language model, allowing visual information to influence text generation at multiple levels of processing.
2. Key Design Decisions
- Frozen Backbone Models: Both the vision encoder and language model are kept frozen, preserving their pretrained capabilities while only training the connecting components.
- Parameter Efficiency: By only training the perceiver resampler and cross-attention layers, Flamingo achieves multimodal capabilities with relatively few trainable parameters.
- Monkey Patching: The implementation uses a technique called "monkey patching" to insert cross-attention into the language model without modifying its original architecture.
3. How Visual Processing Works
- The image is passed through the vision encoder to extract high-level visual features (2048-dimensional for ResNet50).
- These features are then processed by the perceiver resampler, which condenses them into a fixed set of tokens (64 in this example).
- The resulting visual tokens are stored in a buffer and made available to all cross-attention layers during text generation.
4. How Few-Shot Learning Is Implemented
- The example demonstrates few-shot learning through a carefully formatted prompt containing example image-text pairs.
- Each example follows a pattern of "Image: [description]" followed by "Description: [detailed text]".
- The final prompt ends with "Image: [Current image]" and "Description:", prompting the model to generate a description for the new image following the pattern established by the examples.
- This in-context learning approach allows the model to adapt to specific tasks without parameter updates.
5. Practical Considerations and Limitations
- Computational Efficiency: The real Flamingo model uses sophisticated techniques for handling larger contexts and more efficiently processing visual information.
- Training Requirements: To fully train this model, you would need a large dataset of image-text pairs and significant computational resources.
- Simplified Architecture: This example omits some details of the full Flamingo architecture for clarity, such as gated cross-attention and more advanced visual processing.
6. Real-world Applications
- Visual question answering: Answering specific questions about image content with few or no examples.
- Image captioning: Generating detailed descriptions of images in various styles based on examples.
- Visual reasoning: Performing complex reasoning tasks about visual content, such as comparing images or identifying relationships.
- Multimodal chat: Enabling conversational interactions that seamlessly incorporate visual information.
This implementation provides a starting point for understanding and experimenting with Flamingo-style multimodal architectures. The real power of such models comes from their ability to perform in-context learning across modalities, adapting to new tasks with minimal examples.
Enhanced Flamingo Implementation with In-Context Learning
Let's explore a more comprehensive implementation of the Flamingo architecture that better demonstrates its in-context learning capabilities for visual question answering:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2Tokenizer, ViTModel, ViTImageProcessor
from PIL import Image
import requests
from io import BytesIO
class GatedCrossAttentionBlock(nn.Module):
"""
Enhanced cross-attention block with gating mechanism as used in Flamingo.
"""
def __init__(self, hidden_size=768, num_heads=12):
super().__init__()
self.hidden_size = hidden_size
self.cross_attention = nn.MultiheadAttention(
embed_dim=hidden_size,
num_heads=num_heads,
batch_first=True
)
# Gating mechanism
self.gate = nn.Linear(hidden_size, hidden_size)
self.gate_activation = nn.Sigmoid()
# Layer normalization
self.layer_norm1 = nn.LayerNorm(hidden_size)
self.layer_norm2 = nn.LayerNorm(hidden_size)
def forward(self, hidden_states, visual_features):
normed_hidden_states = self.layer_norm1(hidden_states)
# Apply cross-attention
attn_output, _ = self.cross_attention(
query=normed_hidden_states,
key=visual_features,
value=visual_features
)
# Apply gating mechanism
gate_values = self.gate_activation(self.gate(normed_hidden_states))
attn_output = gate_values * attn_output
# Residual connection and layer norm
hidden_states = hidden_states + attn_output
hidden_states = self.layer_norm2(hidden_states)
return hidden_states
class PerceiverResampler(nn.Module):
"""
Perceiver Resampler that converts variable-length visual features into
a fixed number of tokens through cross-attention with learned queries.
"""
def __init__(self, input_dim=768, latent_dim=768, num_latents=64, num_layers=4):
super().__init__()
self.latents = nn.Parameter(torch.randn(num_latents, latent_dim))
self.layers = nn.ModuleList([
nn.MultiheadAttention(
embed_dim=latent_dim,
num_heads=8,
batch_first=True
)
for _ in range(num_layers)
])
self.input_projection = nn.Linear(input_dim, latent_dim)
self.layer_norm = nn.LayerNorm(latent_dim)
def forward(self, x):
batch_size = x.shape[0]
# Project input features to match latent dimension
x = self.input_projection(x)
# Expand latents for each item in the batch
latents = self.latents.unsqueeze(0).expand(batch_size, -1, -1)
# Apply layers of cross-attention
for layer in self.layers:
latents, _ = layer(
query=latents,
key=x,
value=x
)
latents = self.layer_norm(latents)
return latents
class EnhancedFlamingoModel(nn.Module):
"""
Enhanced Flamingo model with improved components for in-context learning
and visual question answering tasks.
"""
def __init__(self, num_visual_tokens=64, vision_model_name="google/vit-base-patch16-224"):
super().__init__()
# Vision encoder (frozen ViT)
self.vision_encoder = ViTModel.from_pretrained(vision_model_name)
self.vision_processor = ViTImageProcessor.from_pretrained(vision_model_name)
for param in self.vision_encoder.parameters():
param.requires_grad = False
# Perceiver resampler
self.perceiver = PerceiverResampler(
input_dim=768, # ViT feature dim
latent_dim=768, # Match GPT2 hidden size
num_latents=num_visual_tokens,
num_layers=4
)
# Language model (frozen GPT-2)
self.language_model = GPT2LMHeadModel.from_pretrained("gpt2")
self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
self.tokenizer.pad_token = self.tokenizer.eos_token
# Keep LM frozen except for final layer norm and unembedding
for name, param in self.language_model.named_parameters():
if "ln_f" in name or "wte" in name:
param.requires_grad = True
else:
param.requires_grad = False
# Special tokens for marking image inputs
self.image_start_token = "<image>"
self.image_end_token = "</image>"
# Add special tokens to vocabulary
special_tokens = {"additional_special_tokens": [self.image_start_token, self.image_end_token]}
num_added = self.tokenizer.add_special_tokens(special_tokens)
self.language_model.resize_token_embeddings(len(self.tokenizer))
# Cross-attention blocks
self.cross_attentions = nn.ModuleList([
GatedCrossAttentionBlock(hidden_size=768, num_heads=12)
for _ in range(len(self.language_model.transformer.h))
])
# Create image token ID
self.image_start_token_id = self.tokenizer.convert_tokens_to_ids(self.image_start_token)
self.image_end_token_id = self.tokenizer.convert_tokens_to_ids(self.image_end_token)
# Register hook to modify the transformer layers
for i, block in enumerate(self.language_model.transformer.h):
block.register_forward_hook(self._make_cross_attention_hook(i))
# Buffer for storing visual features
self.register_buffer("visual_features", None, persistent=False)
def _make_cross_attention_hook(self, block_idx):
"""Create a forward hook for adding cross-attention at specified layer."""
cross_attn = self.cross_attentions[block_idx]
def hook(module, inputs, outputs):
if self.visual_features is None:
return outputs
hidden_states = outputs[0] if isinstance(outputs, tuple) else outputs
modified_hidden_states = cross_attn(hidden_states, self.visual_features)
if isinstance(outputs, tuple):
return (modified_hidden_states,) + outputs[1:]
return modified_hidden_states
return hook
def _encode_image(self, image_tensor):
"""Process a single image through the vision encoder and perceiver."""
with torch.no_grad():
vision_outputs = self.vision_encoder(image_tensor)
hidden_states = vision_outputs.last_hidden_state
# Process through perceiver resampler to get fixed number of tokens
visual_tokens = self.perceiver(hidden_states)
return visual_tokens
def _encode_images_batch(self, image_list):
"""Process a batch of images through the vision pipeline."""
processed_images = []
for image in image_list:
if isinstance(image, str):
# Load from URL if string
response = requests.get(image)
img = Image.open(BytesIO(response.content))
else:
# Assume PIL Image otherwise
img = image
# Preprocess for vision model
processed = self.vision_processor(img, return_tensors="pt")
processed_images.append(processed["pixel_values"])
# Stack into batch
image_tensors = torch.cat(processed_images, dim=0).to(next(self.parameters()).device)
return self._encode_image(image_tensors)
def format_prompt_with_images(self, text_prompt, images):
"""Format a prompt with image placeholders and encode the images."""
# Encode images first
self.visual_features = self._encode_images_batch(images)
# Replace placeholders with special tokens
formatted_prompt = text_prompt.replace("[IMAGE]", f"{self.image_start_token}{self.image_end_token}")
return formatted_prompt
def generate_answer(self, prompt, images=None, max_length=200, temperature=0.7):
"""Generate an answer for a visual question answering prompt with images."""
if images:
prompt = self.format_prompt_with_images(prompt, images)
# Tokenize prompt
inputs = self.tokenizer(prompt, return_tensors="pt").to(next(self.parameters()).device)
# Generate text
with torch.no_grad():
output_ids = self.language_model.generate(
inputs.input_ids,
max_length=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
pad_token_id=self.tokenizer.eos_token_id
)
# Get only the generated text (not the prompt)
generated_ids = output_ids[0][inputs.input_ids.shape[1]:]
generated_text = self.tokenizer.decode(generated_ids, skip_special_tokens=True)
# Clear visual features after generation
self.visual_features = None
return generated_text.strip()
def run_visual_qa_demo():
"""Demonstrate visual question answering with the Flamingo model."""
# Initialize model
model = EnhancedFlamingoModel().to("cuda" if torch.cuda.is_available() else "cpu")
# Example images (use URLs for convenience)
example_images = [
"https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg", # Image of a dog on a beach
"https://files.cuantum.tech/images/dog_drawing.jpg" # Drawing of a dog
]
# Few-shot prompt for VQA
few_shot_prompt = """
I will answer questions about images.
[IMAGE]
Question: What animal is in the image?
Answer: The image shows a dog running on the beach. It appears to be a golden retriever enjoying the sand and ocean.
[IMAGE]
Question: What is this a drawing of?
Answer: This is a simple drawing of a dog. It appears to be a cartoon-style sketch with basic lines representing a dog's features.
[IMAGE]
Question: What is shown in this image?
Answer:
"""
# New test image (Eiffel Tower)
test_image = "https://files.cuantum.tech/images/eiffel-tower.jpg"
# Generate answer
answer = model.generate_answer(
few_shot_prompt,
images=example_images + [test_image],
max_length=100
)
print("Model's answer:", answer)
if __name__ == "__main__":
run_visual_qa_demo()
Code Breakdown: Advanced Flamingo Implementation
This enhanced implementation of the Flamingo architecture includes several important improvements that make it more similar to the original DeepMind model:
1. Key Architecture Enhancements
- Gated Cross-Attention: Unlike the basic implementation, this version includes a gating mechanism that controls how much visual information flows into the language model at each layer. This prevents visual information from dominating and allows for more nuanced integration.
- Multi-layer Perceiver Resampler: The perceiver now uses multiple layers of cross-attention to refine the visual tokens, creating a more sophisticated visual representation.
- ViT Vision Encoder: Uses a modern Vision Transformer instead of ResNet, providing better visual feature extraction.
- Special Tokens: Adds special image tokens to the vocabulary, allowing the model to recognize where images appear in the context.
2. In-Context Learning Implementation
- Few-Shot Visual QA: The prompt structure demonstrates how Flamingo enables few-shot learning by showing examples of image-question-answer triplets.
- Image Placeholders: Uses [IMAGE] placeholders in the prompt that get replaced with special tokens, mimicking how the real Flamingo handles multiple images in context.
- Contextual Memory: The model processes multiple images and remembers their features during generation, allowing it to reference different examples.
3. Technical Implementation Details
- Forward Hooks: Uses PyTorch hooks instead of monkey patching to inject cross-attention into the transformer blocks, which is a cleaner implementation.
- Selective Fine-tuning: Only certain parts of the language model are trainable (final layer norm and embedding), while keeping most parameters frozen.
- Batched Image Processing: Handles multiple images efficiently by batching them through the vision pipeline.
4. User-Friendly Features
- URL Image Loading: Supports loading images directly from URLs, making demonstrations easier.
- Structured API: Provides a clean interface for formatting prompts with images and generating answers.
- Memory Management: Clears visual features after generation to free up memory.
5. Real-world Applications
This implementation demonstrates how Flamingo can be used for:
- Visual Question Answering: Answering specific questions about image content.
- Few-Shot Learning: Learning new tasks from just a few examples without parameter updates.
- Multi-image Reasoning: Processing information across multiple images to provide coherent answers.
The enhanced implementation shows how multimodal models can maintain the powerful in-context learning capabilities of large language models while incorporating rich visual information. This approach allows for flexible adaptation to new visual tasks without specialized fine-tuning, making it particularly valuable for real-world applications.
5.1.3 GPT-5 (OpenAI)
Launched on August 7, 2025, GPT-5 marks a new milestone in OpenAI’s large language model lineage. It is the first fully native multimodal model, trained jointly on text, images, and audio from the ground up, with a composed system design that integrates fast responses, deep reasoning, and intelligent routing. More than an incremental upgrade over GPT-4o, GPT-5 represents a paradigm shift: a model architected from the beginning to process and reason across modalities as a unified whole.
Native Multimodal Architecture
Unlike earlier models that retrofitted speech or vision modules onto a text-first transformer, GPT-5 is fundamentally multimodal. Text, image, and audio are processed in the same transformer backbone, creating shared internal representations that seamlessly connect concepts across formats.
This design produces fluid cross-modal reasoning. For example, if a user submits a photo of a math problem, GPT-5 not only recognizes the characters but also interprets the underlying mathematical structure. It then generates a step-by-step solution that references specific symbols in the image, checks for ambiguities, and explains the reasoning in natural language. This integrated comprehension extends to scientific diagrams, financial charts, architectural blueprints, and medical imagery.
By aligning modalities during training, GPT-5 develops deeper semantic coherence—understanding how textual descriptions, visual data, and spoken language reinforce or contradict each other. It can, for instance, highlight inconsistencies between a historical photograph and a written account, or correlate radiology images with patient notes.
Composed System and Intelligent Routing
GPT-5 is not a monolithic model but a composed system:
- A main fast model handles everyday queries with low latency.
- A thinking model engages when complex, multi-step reasoning is required, offering real-time chain-of-thought.
- Mini and nano variants optimize cost and speed for lightweight applications.
- A Pro reasoning variant (API only) extends test-time reasoning for the hardest problems.
An intelligent router automatically decides which component to use, sparing users from manually picking between “light” and “heavy” models. This dynamic composition ensures efficiency for simple prompts and depth for challenging ones.
Reasoning and Context Management
With real-time chain-of-thought reasoning, GPT-5 excels in tasks that require logic, multi-step deduction, or tool use. On external benchmarks, it sets new records: 74.9% accuracy on SWE-bench Verified (software engineering) and 88% on Aider polyglot (code editing).
The model’s expanded context window—up to 400,000 tokens via the API, with output lengths of up to 128,000 tokens—supports the analysis of entire books, multi-hour meetings, or large codebases without losing track of earlier information. This scale makes it suitable for legal discovery, research synthesis, and full-repository debugging.
Voice and Multilingual Capabilities
Through the Realtime API, GPT-5 offers natural speech-in/speech-out interactions with millisecond-level latency. The voice system is robust to accents, can modulate tone on command, and integrates with SIP protocols, enabling real-world phone calls and live agents. Users can now hold fluid conversations where GPT-5 reasons, speaks, and listens in real time.
Multilingual fluency has also advanced, making GPT-5 a practical tool for cross-border communication, customer support, education, and accessibility.
Developer Controls and Tool Integration
Developers gain fine-grained control via new parameters:
reasoning_effort: from minimal (fast) to extensive (deep reasoning).verbosity: low, medium, or high detail in responses.
The API exposes three model families—gpt-5, gpt-5-mini, and gpt-5-nano—to balance accuracy, cost, and latency. Pricing (per million tokens) at launch was $1.25 input / $10 output for GPT-5, with cheaper mini and nano tiers.
GPT-5 also supports custom tools: lightweight, plaintext tool calls with optional grammar constraints, allowing more reliable integration with external APIs. Enterprises can connect GPT-5 directly into Microsoft Copilot, Apple Intelligence, GitLab, Notion, and custom pipelines.
Accuracy, Safety, and Bias Reduction
OpenAI introduced safe-completions training in GPT-5. Instead of choosing between over-compliance and refusal, the model aims to generate the safest useful answer. Internal evaluations show:
- Substantially fewer hallucinations than GPT-4o.
- Lower sycophancy (over-agreeableness).
- Reduced deception, meaning the model is less likely to feign success on impossible tasks.
Safety frameworks classify GPT-5 Thinking as High capability in biology and chemistry, with layered safeguards, red-teaming, and monitoring.
Use Cases and Industry Impact
- Coding & Engineering: GPT-5 generates functional front-end code, debugs large repositories, and coordinates multi-tool development workflows.
- Automation & Productivity: From grading and summarizing to document review, it frees human bandwidth for higher-order work.
- Knowledge Work: Enterprises use GPT-5 for legal analysis, financial reporting, and R&D, where its long context and reasoning shine.
- Creative Workflows: Designers, writers, and researchers can mix text, images, and audio in prompts—e.g., analyzing a chart and drafting a report in one go.
- Voice Agents: Customer service and sales teams deploy GPT-5 via Realtime API to deliver human-like support, capturing alphanumeric details and following strict protocols.
The New Standard
GPT-5 establishes a new baseline for large multimodal models. Its unified architecture, dynamic routing, reasoning capabilities, and developer controls make it a versatile foundation for both consumer and enterprise AI. By natively fusing text, vision, and audio, GPT-5 doesn’t just respond across modalities—it reasons through them, enabling a generation of AI systems that operate more like collaborators than tools.
Basic Example: Multimodal Prompt with JSON Output (Chat Completions API)
A beginner-friendly example showing how to send an image and text together and receive a structured JSON response.
import requests
import json # You need this to parse the JSON string from the response
API_KEY = "YOUR_OPENAI_API_KEY"
# Use the correct API endpoint
API_URL = "https://api.openai.com/v1/chat/completions"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
# Example: Provide an image URL and a text query jointly
# Corrected input structure using 'type' and 'image_url' keys
image_part = {
"type": "image_url",
"image_url": {
"url": "https://files.cuantum.tech/images/chart.png" # Can also use a data URL for base64 images
}
}
# Corrected text part structure
text_part = {
"type": "text",
"text": "Summarize the main trend shown in the chart. Also, generate the Python code to recreate this visualization. Format the response as a JSON object with the keys 'summary', 'python_code', and 'key_points'."
}
# Corrected payload
payload = {
"model": "gpt-5",
"messages": [
{
"role": "user",
"content": [
image_part,
text_part
]
}
],
# Correct way to request JSON output
"response_format": { "type": "json_object" },
# The max_tokens parameter is standard
"max_tokens": 400
}
response = requests.post(API_URL, headers=headers, json=payload)
result = response.json()
# Correct way to handle the API response
try:
# The API returns a JSON string inside the message content, so we parse it
response_content = result['choices'][0]['message']['content']
parsed_output = json.loads(response_content)
# Print structured output from the parsed JSON
print("Summary:", parsed_output.get("summary"))
print("Python code:", parsed_output.get("python_code"))
print("Key points:", parsed_output.get("key_points"))
except (KeyError, IndexError, json.JSONDecodeError) as e:
print("Error parsing the API response:", e)
print("Raw response:", result)
Code Breakdown
This example demonstrates how to send a multimodal request to OpenAI's GPT-5 model, combining an image URL with a text query, and specifically asking for a structured JSON response.
1. Import Libraries
import requests
import jsonrequests: This library is essential for making HTTP requests in Python. We use it to send our data to the OpenAI API and receive the response.json: This library is used for working with JSON (JavaScript Object Notation) data. We'll use it to construct our request payload and, critically, to parse the JSON string that GPT-5 will return to us when we ask for structured output.
2. API Configuration
API_KEY = "YOUR_OPENAI_API_KEY"
API_URL = "https://api.openai.com/v1/chat/completions"API_KEY: This is a placeholder for your unique OpenAI API key. You must replace"YOUR_OPENAI_API_KEY"with your actual key, which you can obtain from the OpenAI developer dashboard. This key authenticates your requests.API_URL: This is the specific endpoint for OpenAI's chat completion API. All conversational and multimodal requests go to this URL. It's crucial that this is correct.
3. Request Headers
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}headers: This dictionary contains metadata sent with our HTTP request."Authorization": f"Bearer {API_KEY}": This header authenticates your request using your API key. TheBearertoken prefix is a standard for OAuth 2.0."Content-Type": "application/json": This header tells the server that the body of our request is formatted as JSON.
4. Defining Multimodal Input Parts
GPT-5 can process different types of input simultaneously. Here, we define an image and a text part.
image_part = {
"type": "image_url",
"image_url": {
"url": "https://files.cuantum.tech/images/chart.png"
}
}image_part: This dictionary represents the visual input."type": "image_url": Specifies that this content block is an image provided via a URL."image_url": {"url": "..."}: This nested structure is where the actual image URL is provided. The model will fetch and process the image from this link. You could also provide base64 encoded images here instead of a URL.
text_part = {
"type": "text",
"text": "Summarize the main trend shown in the chart. Also, generate the Python code to recreate this visualization. Format the response as a JSON object with the keys 'summary', 'python_code', and 'key_points'."
}text_part: This dictionary holds the textual instruction for the model."type": "text": Indicates this content block is plain text."text": "...": This is the actual prompt to GPT-5. Notice how we explicitly ask for a JSON object with specific keys (summary,python_code,key_points). This is crucial for getting structured output from the model.
5. Constructing the Request Payload
This is the main body of the request, containing all the instructions for the API.
payload = {
"model": "gpt-5",
"messages": [
{
"role": "user",
"content": [
image_part,
text_part
]
}
],
"response_format": { "type": "json_object" },
"max_tokens": 400
}"model": "gpt-5": Specifies which OpenAI model to use. In this case, it's the latest GPT-5."messages": [...]: This is a list of message objects, forming the conversation.- Each message has a
"role"(e.g.,"user","system","assistant") and"content". "role": "user": Indicates that this message comes from the user."content": [image_part, text_part]: This is the crucial part for multimodal input. Thecontentis a list containing both ourimage_partandtext_partdictionaries. The model will process them together.
- Each message has a
"response_format": { "type": "json_object" }: This parameter explicitly tells the API to constrain the model's output to a valid JSON object. This is essential when you want structured data back from the model, as we requested in ourtext_part."max_tokens": 400: Sets the maximum number of tokens (words or word pieces) the model should generate in its response. This helps control cost and response length.
6. Sending the Request
response = requests.post(API_URL, headers=headers, json=payload)
result = response.json()requests.post(...): This function sends an HTTP POST request to theAPI_URLwith ourheadersand thepayload(converted to JSON byrequests.post).response.json(): The API's reply comes back as a JSON string. This method parses that string into a Python dictionary, making it easy to access the data.
7. Handling and Parsing the Response
The API's response structure is standard, but the actual content we asked GPT-5 to generate is nested within it as a string.
try:
response_content = result['choices'][0]['message']['content']
parsed_output = json.loads(response_content)
print("Summary:", parsed_output.get("summary"))
print("Python code:", parsed_output.get("python_code"))
print("Key points:", parsed_output.get("key_points"))
except (KeyError, IndexError, json.JSONDecodeError) as e:
print("Error parsing the API response:", e)
print("Raw response:", result)try...except: This block is crucial for robust error handling. API calls can fail for many reasons (network issues, incorrect API key, malformed requests, or the model might not return valid JSON).result['choices'][0]['message']['content']: This is the path to extract the actual text generated by GPT-5.result['choices']: The API can return multiplechoices(different possible completions) based on parameters liken. We usually take the first one ([0]).['message']: Within each choice, themessageobject contains therole(e.g., "assistant") and the generatedcontent.
json.loads(response_content): Since we specifically asked the model to format its output as a JSON string within thecontentfield, we need to usejson.loads()to parse this string into a Python dictionary.parsed_output.get("summary"),parsed_output.get("python_code"),parsed_output.get("key_points"): Onceresponse_contentis parsed into a dictionary, we can access the individual fields we requested from GPT-5. Using.get()is safer than direct dictionary access ([]) as it preventsKeyErrorif a key is missing.- The
exceptblock catches potential errors during parsing or if the expected keys are not found, printing both the error and the raw API response for debugging.
Advanced Example: Production-Ready Multimodal Workflow (Responses API with JSON Schema)
A robust example demonstrating best practices for reliability, schema validation, retries, and safe execution of returned code.
"""
Multimodal (image + text) → structured JSON with GPT-5
- Uses the Responses API (recommended)
- Strict JSON schema for reliable structured output
- Optional: safely execute returned Matplotlib code in a subprocess to render a PNG
"""
import os
import json
import time
import base64
import requests
import tempfile
import subprocess
import sys
from textwrap import dedent
from typing import Dict, Any, List, Optional
# =========================
# Configuration
# =========================
API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
API_URL = "https://api.openai.com/v1/responses"
MODEL = "gpt-5" # or: gpt-5-mini / gpt-5-nano
HEADERS = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
}
# Use a public image URL OR a local file encoded as a data URL (see helper below).
IMAGE_URL = "https://cdn.example.com/chart.png" # <- replace for your test
# Strict JSON schema for the model’s response
RESPONSE_SCHEMA: Dict[str, Any] = {
"name": "ChartInsight",
"schema": {
"type": "object",
"properties": {
"summary": {"type": "string"},
"python_code": {"type": "string"},
"key_points": {
"type": "array",
"items": {"type": "string"},
"minItems": 3,
"maxItems": 7
}
},
"required": ["summary", "python_code", "key_points"],
"additionalProperties": False
},
"strict": True
}
PROMPT_TEXT = (
"You are a meticulous data analyst.\n"
"Tasks:\n"
"1) Summarize the main trend in the chart.\n"
"2) Generate minimal, runnable Python (matplotlib) code that recreates a similar visualization "
" using inferred placeholder data. Include clear axis labels and a title.\n"
"3) Provide 3–7 bullet key points.\n"
"Return a JSON object that matches the provided JSON schema exactly."
)
# =========================
# Helpers
# =========================
def local_image_to_data_url(path: str, mime: Optional[str] = None) -> str:
"""
Convert a local image file to a data URL usable as an image input.
Example usage:
IMAGE_URL = local_image_to_data_url("chart.png")
"""
if not mime:
# naive mime inference by extension
ext = os.path.splitext(path)[1].lower()
mime = "image/png" if ext in [".png"] else "image/jpeg"
with open(path, "rb") as f:
b64 = base64.b64encode(f.read()).decode("utf-8")
return f"data:{mime};base64,{b64}"
def build_payload(image_url: str) -> Dict[str, Any]:
"""
Build a Responses API payload with multimodal input and JSON schema output.
"""
return {
"model": MODEL,
"input": [
{
"role": "user",
"content": [
{"type": "input_image", "image_url": {"url": image_url}},
{"type": "input_text", "text": PROMPT_TEXT}
]
}
],
"response_format": {
"type": "json_schema",
"json_schema": RESPONSE_SCHEMA
},
"max_output_tokens": 900,
"temperature": 0.2
}
def post_with_retries(
url: str,
headers: Dict[str, str],
json_payload: Dict[str, Any],
retries: int = 3,
backoff: float = 1.5,
timeout: int = 60
) -> Dict[str, Any]:
"""
POST with simple exponential backoff for rate limits / transient errors.
"""
for attempt in range(1, retries + 1):
try:
resp = requests.post(url, headers=headers, json=json_payload, timeout=timeout)
if resp.status_code == 200:
return resp.json()
# Retry on typical transient statuses
if resp.status_code in (429, 500, 502, 503, 504):
time.sleep(backoff ** attempt)
continue
raise RuntimeError(f"HTTP {resp.status_code}: {resp.text}")
except requests.exceptions.Timeout as e:
if attempt == retries:
raise
time.sleep(backoff ** attempt)
except requests.exceptions.RequestException as e:
if attempt == retries:
raise
time.sleep(backoff ** attempt)
raise RuntimeError("Request failed after retries")
def parse_responses_api_json(result: Dict[str, Any]) -> Dict[str, Any]:
"""
Extract the schema-validated JSON text and parse it to a dict.
Responses API returns: output[0].content[0].text for text output.
"""
try:
content_blocks = result["output"][0]["content"]
# Find first text block
for block in content_blocks:
if block.get("type") == "output_text" or block.get("type") == "text":
text = block.get("text", "")
if not text:
continue
# In schema mode, text should be strict JSON
return json.loads(text)
raise KeyError("No text block found in the response output")
except (KeyError, IndexError, json.JSONDecodeError) as e:
debug = json.dumps(result, indent=2)[:2000] # truncate for readability
raise ValueError(f"Failed to parse structured output: {e}\nPartial payload:\n{debug}")
def run_matplotlib_script(py_code: str) -> None:
"""
Safely run returned Matplotlib code in a clean subprocess (not in-process exec).
Saves 'recreated_chart.png' in the current working directory.
"""
safe_prefix = dedent("""
import matplotlib
matplotlib.use('Agg') # headless backend for servers/CI
""")
# Force a save at the end, even if the model code forgets to save
force_save = dedent("""
import os
import matplotlib.pyplot as plt
out = 'recreated_chart.png'
try:
plt.savefig(out, dpi=150, bbox_inches='tight')
except Exception:
# Some scripts call show() only; ensure we still save a figure if present
try:
plt.gcf().savefig(out, dpi=150, bbox_inches='tight')
except Exception:
pass
print(f"[Saved] {os.path.abspath(out)}")
""")
script = safe_prefix + "\n" + py_code + "\n\n" + force_save
with tempfile.NamedTemporaryFile("w", suffix=".py", delete=False) as f:
f.write(script)
tmp_path = f.name
completed = subprocess.run(
[sys.executable, tmp_path],
capture_output=True,
text=True,
timeout=60
)
if completed.stdout:
print(completed.stdout)
if completed.returncode != 0:
print("Script error:\n", completed.stderr)
# =========================
# Main flow
# =========================
def main():
if not API_KEY or API_KEY == "YOUR_OPENAI_API_KEY":
raise EnvironmentError("Set OPENAI_API_KEY environment variable or hardcode API_KEY.")
# If you want to test with a local image:
# IMAGE_URL = local_image_to_data_url("path/to/chart.png")
payload = build_payload(IMAGE_URL)
result = post_with_retries(API_URL, HEADERS, payload)
data = parse_responses_api_json(result)
print("\n=== Summary ===\n", data["summary"])
print("\n=== Key points ===")
for i, kp in enumerate(data["key_points"], 1):
print(f"{i}. {kp}")
print("\n=== Python code (recreate chart) ===\n")
print(data["python_code"])
# Optional: render the returned chart
user_wants_render = True # set to False to skip rendering
if user_wants_render:
run_matplotlib_script(data["python_code"])
if __name__ == "__main__":
main()
Download the chart example image here: https://files.cuantum.tech/images/chart.png
Code breakdown:
- Configuration
API_URL = "https://api.openai.com/v1/responses"uses the Responses API (the current, multimodal-first endpoint).MODEL = "gpt-5"picks the full model; you can swap togpt-5-mini/gpt-5-nanofor cheaper/faster runs.IMAGE_URL: set a public URL or switch to a local file vialocal_image_to_data_url().
- Strict JSON via schema
RESPONSE_SCHEMAtells the model exactly what keys and types to return.- This is more reliable than a plain
json_objecthint because the model is constrained to a schema and will retry internally to satisfy it.
- Building the multimodal prompt
build_payload()composesinputwith two blocks:{"type": "input_image", "image_url": {...}}for the image,{"type": "input_text", "text": PROMPT_TEXT}for instructions.
- The
response_formatrequests schema-validated output; the model returns a single JSON string that parses cleanly.
- Network resilience
post_with_retries()adds basic retry/backoff on rate limits or transient 5xx errors and a timeout so calls don’t hang.- Non-retryable errors raise with the server’s message for quick diagnosis.
- Parsing the Responses API
parse_responses_api_json()extractsresult["output"][0]["content"][0]["text"](the schema-validated JSON) andjson.loads()it.- If the shape changes (e.g., future versions), the function fails loudly with a helpful snippet.
- Optional: safe Matplotlib execution
run_matplotlib_script()runs the code in a separate Python process, not viaexec()in your main process.- It forces a headless backend and ensures a saved file
recreated_chart.pngeven if the script forgets. - This pattern is good enough for demos and CI, but for production you might put further guards (resource limits, containers).
- Main flow
- Build payload → call API with retries → parse JSON → print
summary,key_points, andpython_code. - Optionally, render the chart with the sandboxed subprocess.
Tool-Calling Example: “Ask GPT-5 to fetch data with your function, then analyze and plot”
"""
Tool-calling with GPT-5 (Chat Completions API)
- The model asks to call our tool `get_prices` with {symbol, days}
- We run the tool (here: mock data), send results back, then GPT-5 completes:
-> JSON with 'summary', 'key_points', and 'python_code' (Matplotlib)
"""
import os
import json
import time
import math
import requests
from datetime import datetime, timedelta
from typing import Dict, Any, List
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
API_URL = "https://api.openai.com/v1/chat/completions"
MODEL = "gpt-5"
HEADERS = {
"Authorization": f"Bearer {OPENAI_API_KEY}",
"Content-Type": "application/json",
}
# ---------- Tool: mock market data ----------
def get_prices(symbol: str, days: int = 30) -> Dict[str, Any]:
"""
Return mock OHLC data for the past N days.
Replace this with your real data source later (DB/API/cache).
"""
end = datetime.utcnow().date()
dates = [(end - timedelta(days=i)).isoformat() for i in range(days)][::-1]
# Simple deterministic waveform so every run is similar
base = 100.0
prices = []
for i, d in enumerate(dates):
v = base + 10 * math.sin(i / 4.0) + (i * 0.15)
o = round(v + math.sin(i) * 0.3, 2)
c = round(v + math.cos(i) * 0.3, 2)
h = round(max(o, c) + 0.6, 2)
l = round(min(o, c) - 0.6, 2)
prices.append({"date": d, "open": o, "high": h, "low": l, "close": c})
return {"symbol": symbol.upper(), "series": prices}
# ---------- Tool spec for the model ----------
TOOLS = [
{
"type": "function",
"function": {
"name": "get_prices",
"description": "Get recent OHLC data for a ticker symbol.",
"parameters": {
"type": "object",
"properties": {
"symbol": {"type": "string", "description": "Ticker, e.g., AAPL"},
"days": {"type": "integer", "minimum": 5, "maximum": 200, "default": 30}
},
"required": ["symbol"]
}
}
}
]
SYSTEM = (
"You are a quantitative analyst. If needed, call tools to fetch data, "
"then return a structured JSON with keys: summary (string), key_points (array of strings), "
"python_code (string that plots the series with matplotlib)."
)
USER = (
"Analyze the recent trend for the symbol AAPL (last 60 days). "
"If you need prices, use the tool. Then return JSON with summary, key_points, python_code."
)
def chat(payload: Dict[str, Any]) -> Dict[str, Any]:
r = requests.post(API_URL, headers=HEADERS, json=payload, timeout=60)
if r.status_code != 200:
raise RuntimeError(f"HTTP {r.status_code}: {r.text}")
return r.json()
def main():
# 1) Ask GPT-5; allow tool calling
payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": USER}
],
"tools": TOOLS,
"tool_choice": "auto",
# Ask for JSON if model can comply directly
"response_format": {"type": "json_object"},
"temperature": 0.2,
"max_tokens": 900
}
first = chat(payload)
msg = first["choices"][0]["message"]
# 2) If the model wants to call tools, run them and send results back
tool_messages = []
if "tool_calls" in msg:
for call in msg["tool_calls"]:
name = call["function"]["name"]
args = json.loads(call["function"]["arguments"] or "{}")
if name == "get_prices":
tool_result = get_prices(symbol=args.get("symbol", "AAPL"),
days=int(args.get("days", 60)))
else:
tool_result = {"error": f"Unknown tool {name}"}
tool_messages.append({
"role": "tool",
"tool_call_id": call["id"],
"name": name,
"content": json.dumps(tool_result)
})
# 3) Send a follow-up message containing the tool outputs
follow_payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": USER},
msg, # the assistant message that requested tools
*tool_messages
],
"response_format": {"type": "json_object"},
"temperature": 0.2,
"max_tokens": 1200
}
final = chat(follow_payload)
out = final
else:
out = first # Model answered without tools
# 4) Parse the final JSON
content = out["choices"][0]["message"]["content"]
try:
data = json.loads(content)
except json.JSONDecodeError:
print("Model did not return valid JSON. Raw content:\n", content)
return
print("\n=== Summary ===\n", data.get("summary"))
print("\n=== Key points ===")
for i, kp in enumerate(data.get("key_points", []), 1):
print(f"{i}. {kp}")
print("\n=== Python code (plot) ===\n")
print(data.get("python_code"))
if __name__ == "__main__":
if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
raise SystemExit("Set OPENAI_API_KEY env var first.")
main()
Code breakdown:
Let GPT-5 decide when to call your function (get_prices), you execute it (mock or real API), feed results back, and let GPT-5 finish with analysis + Matplotlib code in JSON.
1) Imports & configuration
requestshandles HTTP calls to OpenAI.json,time,math,datetimeare used for parsing, retries (if added), and mock data generation.OPENAI_API_KEYis read from env; never hardcode secrets in real projects.API_URLtargets the Chat Completions endpoint (best known for tool calling).MODEL = "gpt-5"; you can swap togpt-5-minifor cheaper experiments.
Tip: In production, wrap network calls with retry/backoff (429/5xx). A simple helper function can centralize that (you can reuse the one from your Advanced example).
2) The tool you expose to the model
def get_prices(symbol: str, days: int = 30) -> Dict[str, Any]:
...- This is a mock OHLC generator. Replace with your real data source:
- A REST call (e.g., Yahoo, Polygon, your own DB/API).
- Caching layer (Redis) to keep latency/costs down.
- Output shape:
{
"symbol": "AAPL",
"series": [
{"date": "2025-07-01", "open": 101.2, "high": 102.0, "low": 100.6, "close": 101.8},
...
]
}Keep it consistent; the LLM will rely on the keys you return.
3) Advertising the tool (the TOOLS spec)
TOOLS = [
{
"type": "function",
"function": {
"name": "get_prices",
"description": "Get recent OHLC data...",
"parameters": { ... JSON Schema ... }
}
}
]- You define a JSON Schema (name, required fields, types).
- The model uses this to decide if and how to call your function.
- Keep schema minimal but precise (e.g., clamp
daysto a reasonable range).
4) System and User messages
- SYSTEM enforces role & output contract:
- “You are a quantitative analyst … return JSON with keys:
summary,key_points,python_code.”
- “You are a quantitative analyst … return JSON with keys:
- USER asks for “Analyze AAPL last 60 days,” nudging the model to use a tool if it needs data.
Tip: Always restate your desired output format in SYSTEM (and/or USER). This increases compliance, especially if you don’t use schema mode.
5) First request: allow tool calling
payload = {
"model": MODEL,
"messages": [system, user],
"tools": TOOLS,
"tool_choice": "auto",
"response_format": {"type": "json_object"},
...
}tool_choice: "auto"lets the model decide if it needs the tool.response_format: "json_object"asks for JSON, but not as strict as schema mode. (That’s okay here; the focus is tool calling.)- Low
temperature(0.2) boosts determinism.
6) Detect and execute tool calls
msg = first["choices"][0]["message"]
if "tool_calls" in msg:
for call in msg["tool_calls"]:
# 1) parse arguments
# 2) run your function
# 3) build a "tool" message with the resultstool_callsis the assistant’s intent to call your function with arguments.- You must parse
call["function"]["arguments"](stringified JSON), run your function, and post results as atoolrole message back to OpenAI.
Security notes:
- Never directly execute arbitrary code sent via tool args.
- Validate inputs (symbols, ranges). Add allowlists/ratelimits for external APIs.
7) Second request: provide tool outputs and ask GPT-5 to finish
follow_payload = {
"messages": [
system, user,
msg, # the assistant message that requested tools
*tool_messages # your tool outputs bound to the call IDs
],
"response_format": {"type":"json_object"}, ...
}- You include:
- The original assistant message that requested tools (so the model keeps context).
- Your tool result messages with the proper
tool_call_id.
- GPT-5 now has real data and completes the task (analysis + code).
8) Parse the final JSON
content = out["choices"][0]["message"]["content"]
data = json.loads(content)- Print
summary,key_points,python_code. - If parsing fails, dump raw content—often a sign the model deviated (rare at low temperature, but possible).
9) Customization knobs
- Switch to schema mode: If you want stronger guarantees on the final JSON, use:
response_format: { "type": "json_schema", "json_schema": {...} }
- Multiple tools: Add more function specs to
TOOLS. GPT-5 will pick the right one. - Parallel calls: The API can return multiple
tool_calls—run them all, then send all thetoolmessages back in one follow-up. - Logging: Log both the tool args and outputs to audit the agent’s steps.
10) Common pitfalls
- Forgetting
tool_call_idwhen sending the tool result message. - Mismatched schemas: If your returned JSON structure diverges from your documented shape, the model may misinterpret it later.
- Rate limits: Add retry/backoff for 429/5xx (especially if your tool triggers 3rd-party APIs).
11) Testing tips
- Start with mock data (like the example) for deterministic outputs.
- Add a unit test that asserts the model returns valid JSON with the required keys.
5.1.4 DeepSeek-VL
DeepSeek-VL is a Chinese open-source multimodal model developed by the DeepSeek team, designed to bridge the gap between vision and language processing. It represents China's significant contribution to the multimodal AI landscape, offering capabilities comparable to proprietary models but with open access for researchers and developers. The model emerged as part of China's growing AI research ecosystem, demonstrating the country's commitment to advancing state-of-the-art AI technologies while ensuring they remain accessible to the broader scientific community.
The model is specifically optimized for efficiency and vision-language reasoning, with architectural choices that prioritize computational performance while maintaining high-quality results. Its streamlined design makes it particularly suitable for deployment in resource-constrained environments, enabling advanced multimodal capabilities on more modest hardware configurations. DeepSeek-VL achieves this efficiency through careful attention to model size, training procedures, and inference optimizations. For example, it employs specialized vision encoders that extract rich visual features while minimizing computational overhead, and leverages knowledge distillation techniques to compress larger models' capabilities into more compact architectures.
In performance evaluations, DeepSeek-VL is often benchmarked against industry leaders like GPT-4V and Flamingo, where it demonstrates competitive results at a fraction of the computational cost. This makes it an attractive option for cost-effective deployments in production environments, particularly for organizations seeking multimodal capabilities without the expense associated with commercial API usage. Benchmark studies have shown that DeepSeek-VL achieves 85-90% of the performance of these larger models on standard vision-language tasks while requiring significantly less computational resources. This performance-to-cost ratio has made it particularly popular among startups, academic institutions, and developers in emerging markets.
The model excels in tasks requiring detailed visual understanding combined with natural language reasoning, such as image captioning, visual question answering, and complex scene interpretation. DeepSeek-VL's architecture incorporates specialized attention mechanisms that allow it to focus on relevant visual elements when answering questions or generating descriptions.
This capability enables applications ranging from assisting visually impaired users to automating content moderation and enhancing e-commerce product discovery through visual search. The model also demonstrates strong performance in cross-cultural visual contexts, making it particularly valuable for applications serving diverse global audiences.
Example: Using DeepSeek-VL for Image Understanding
# Install dependencies first
# pip install transformers torch pillow
from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image
import requests
import matplotlib.pyplot as plt
# Download and load an example image
image_url = "https://files.cuantum.tech/images/deep-seek-descriptive.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
# Load DeepSeek-VL model and processor
model_name = "deepseek-ai/deepseek-vl-7b-chat"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
# Create a prompt for the model
prompt = "Describe what you see in this image in detail."
# Process the inputs
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Generate a response
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False
)
# Decode the response
generated_text = processor.decode(outputs[0], skip_special_tokens=True)
# Display the image and response
plt.figure(figsize=(10, 8))
plt.imshow(image)
plt.axis('off')
plt.title('Input Image')
plt.show()
print("DeepSeek-VL's response:")
print(generated_text.split("ASSISTANT:")[-1].strip())
Code Breakdown: Using DeepSeek-VL for Image Understanding
The example above demonstrates how to use DeepSeek-VL for a basic image understanding task. Here's a detailed breakdown of each section:
1. Dependencies and Setup
- Key libraries: The code uses
transformersfor model access,torchfor tensor operations, andPILfor image handling. - Image acquisition: Fetches a sample image from a URL using
requestsand opens it with PIL.
2. Model Initialization
- Model selection: Uses the 7B parameter chat-tuned version of DeepSeek-VL (
deepseek-ai/deepseek-vl-7b-chat). - Processor loading: The
AutoProcessorhandles both tokenization of text and preprocessing of images. - Model loading:
trust_remote_code=Trueis required as DeepSeek-VL uses custom code for its implementation.
3. Input Processing
- Prompt creation: A simple prompt asking for image description, but you can use more specific prompts like "What objects are in this image?" or "Explain what's happening in this scene."
- Multimodal processing: The processor combines both text input (prompt) and image input into a format the model can understand.
- Return format:
return_tensors="pt"specifies PyTorch tensors as the output format.
4. Response Generation
- Inference with
torch.no_grad(): Disables gradient calculation for efficiency during inference. - Generation parameters:
max_new_tokens=512: Limits response length to 512 tokens.do_sample=False: Uses greedy decoding instead of sampling for deterministic outputs.
5. Response Processing and Visualization
- Decoding: Converts token IDs back to human-readable text.
- Response extraction: Splits the output to get only the assistant's response portion.
- Visualization: Displays the input image alongside the generated description.
Advanced Usage Patterns
Beyond this basic example, DeepSeek-VL supports several advanced capabilities:
- Visual reasoning: You can ask complex questions about relationships between objects in the image.
- Multi-image analysis: Process multiple images by passing a list to the processor.
- Fine-tuning: Adapt the model to specific domains using techniques like LoRA or QLoRA.
- Memory efficiency: For resource-constrained environments, consider using quantization:
# For 8-bit quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_enable_fp32_cpu_offload=True
)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
quantization_config=quantization_config,
device_map="auto"
)Implementation Considerations:
- Hardware requirements: DeepSeek-VL 7B requires at least 16GB GPU memory for full precision, but can run on consumer GPUs with quantization.
- Inference speed: First-time inference includes model loading time; subsequent calls are faster.
- Response format: The model follows a chat format with "ASSISTANT:" prefix. For cleaner outputs, always strip this prefix.
- Error handling: In production, add try-except blocks to handle image loading failures and timeout configurations for large images.
DeepSeek-VL represents a significant advancement in making multimodal AI accessible to developers, particularly those seeking open-source alternatives to proprietary models like GPT-4V or Gemini.
Example: Advanced Visual Question Answering with DeepSeek-VL
# Install required libraries
# pip install transformers torch pillow matplotlib requests
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
import matplotlib.pyplot as plt
from io import BytesIO
# Function to load and display an image from a URL
def load_and_display_image(image_url, title="Input Image"):
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
plt.figure(figsize=(10, 8))
plt.imshow(image)
plt.axis('off')
plt.title(title)
plt.show()
return image
# Load DeepSeek-VL model and processor
model_id = "deepseek-ai/deepseek-vl-7b-chat"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16, # Use half precision for efficiency
device_map="auto", # Automatically distribute across available GPUs
trust_remote_code=True
)
# Sample image URLs for visual reasoning tasks
image_urls = [
"https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg", # People at a table
"https://files.cuantum.tech/images/deep-seek-chart.jpg" # Charts/graphs
]
# Load and display the first image
image = load_and_display_image(image_urls[0])
# Function to generate responses for a given image and prompt
def generate_vl_response(image, prompt, max_new_tokens=256):
# Create chat message format
messages = [
{"role": "user", "content": prompt}
]
# Process inputs
inputs = processor(
messages=messages,
images=image,
return_tensors="pt"
).to(model.device)
# Generate response with customized parameters
generated_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True, # Enable sampling for more diverse outputs
temperature=0.7, # Control randomness (higher = more random)
top_p=0.9, # Nucleus sampling parameter
repetition_penalty=1.1 # Discourage repetition
)
# Decode response
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# Extract assistant's response
response = generated_text.split("ASSISTANT:")[-1].strip()
return response
# Example prompts for different visual reasoning tasks
prompts = [
"Describe this image in detail. What are the people doing?",
"Count how many people are in this image and describe what each person is wearing.",
"What emotions can you detect on people's faces in this image?",
"If you had to create a story based on this image, what would it be?"
]
# Generate and display responses
for i, prompt in enumerate(prompts):
print(f"\nPrompt {i+1}: {prompt}")
print("-" * 50)
response = generate_vl_response(image, prompt)
print(response)
print("=" * 80)
# Load the second image (charts/graphs) for technical analysis
technical_image = load_and_display_image(image_urls[1], "Technical Chart")
# Technical analysis prompt
technical_prompt = "Analyze this chart. What patterns do you observe? What conclusions can you draw from this data visualization?"
# Generate and display technical analysis
print(f"\nTechnical Analysis Prompt: {technical_prompt}")
print("-" * 50)
response = generate_vl_response(technical_image, technical_prompt, max_new_tokens=512)
print(response)
Comprehensive Code Breakdown: Advanced DeepSeek-VL Implementation
This code example demonstrates how to leverage DeepSeek-VL for sophisticated visual reasoning tasks. Let's break down each component:
1. Setup and Model Initialization
- Library imports: Beyond basic dependencies, we specifically import
AutoModelForCausalLMwhich provides a more flexible interface for generative tasks than the basicAutoModelused in the previous example. - Helper function:
load_and_display_image()encapsulates image loading logic, making the code more modular and reusable. - Model optimization:
torch_dtype=torch.float16enables half-precision computation, reducing memory usage by approximately 50% with minimal impact on output quality.device_map="auto"intelligently distributes model layers across available GPUs or uses CPU offloading when needed.
2. Multi-image Processing
- Image collection: Stores multiple image URLs for different analysis scenarios, demonstrating DeepSeek-VL's versatility.
- Sequential processing: The code is structured to analyze multiple images with different prompts, showcasing how the model handles diverse visual contexts.
3. Response Generation Function
- Chat-style formatting: Unlike the previous example, this implementation uses DeepSeek-VL's chat interface through the
messagesparameter, which better aligns with conversational applications. - Generation parameters:
do_sample=Trueandtemperature=0.7: Enables controlled randomness in outputs, producing more natural and diverse responses.top_p=0.9: Implements nucleus sampling, which dynamically filters the token probability distribution.repetition_penalty=1.1: Reduces the likelihood of generating repetitive phrases, improving response quality.
4. Task Diversification
- Multiple prompt types: The example includes different types of visual reasoning tasks:
- Descriptive: "Describe this image in detail..."
- Quantitative: "Count how many people..."
- Emotional analysis: "What emotions can you detect..."
- Creative: "If you had to create a story..."
- Technical analysis: "Analyze this chart..."
5. Performance Considerations
- Memory management: The example uses half-precision (
float16) and automatic device mapping to optimize memory usage. - Response length control:
max_new_tokensis adjusted based on the complexity of the task, with technical analysis allowed a longer response (512 tokens vs 256). - Prompt engineering: The prompts are carefully crafted to elicit specific types of visual reasoning, demonstrating how prompt design affects model output.
6. Real-world Application Scenarios
- This implementation demonstrates DeepSeek-VL's capabilities in several practical use cases:
- Social media content analysis: Understanding context and relationships in photos.
- Data visualization interpretation: Extracting insights from charts and graphs.
- Content moderation: Detecting emotional content and potentially sensitive material in images.
- Creative assistance: Helping generate stories or content based on visual inspiration.
7. Extension Possibilities
- This code could be extended in several ways:
- Batch processing: Modify to handle multiple images simultaneously for higher throughput.
- Interactive applications: Integrate into a web interface where users can upload images and select analysis types.
- Multi-turn conversations: Expand the
messagesarray to include previous exchanges for contextual understanding. - Integration with other models: Combine DeepSeek-VL's outputs with specialized models for tasks like object detection or sentiment analysis.
This advanced implementation highlights DeepSeek-VL's flexibility and power for complex visual-language reasoning tasks, making it suitable for both research and production applications where understanding images in context is critical.
5.1.5 Why Text+Image Matters
Accessibility: Helping visually impaired users understand images by providing detailed descriptions of visual content. These models can identify objects, people, scenes, and even interpret spatial relationships, allowing visually impaired individuals to "see" through AI-generated descriptions. They can also assist with navigation by describing surroundings or identifying potential hazards.
For visually impaired individuals, multimodal AI serves as an essential bridge to visual content. These systems go beyond simple object recognition to provide context-rich descriptions that convey the full meaning of images. When a visually impaired person encounters an image online, in a document, or through a specialized device, multimodal models can:
- Generate comprehensive scene descriptions that include not just what objects are present, but their arrangement, colors, lighting, and overall composition
- Identify and describe people in photos, including facial expressions, clothing, actions, and apparent relationships between individuals
- Read and interpret text within images, such as signs, menus, product labels, and instructions
- Recognize landmarks and provide spatial awareness in unfamiliar environments
In real-world applications, these capabilities are being integrated into smartphone apps that can narrate the visual world in real-time, smart glasses that provide audio descriptions of surroundings, and screen readers that can interpret complex visual elements on websites. The technology is particularly valuable for educational materials, allowing visually impaired students to access diagrams, charts, and illustrations that would otherwise be inaccessible without human assistance.
The advancement of these multimodal systems represents a significant step forward in digital inclusivity, empowering visually impaired users with greater independence and access to information that was previously unavailable to them.
Education: Explaining diagrams, charts, or historical photos to enhance learning experiences. Multimodal models can break down complex visualizations into understandable components, clarify scientific diagrams, provide historical context for photographs, and even translate visual mathematical notation into explanations. This makes educational content more accessible and comprehensible across various subjects and learning styles.
In educational contexts, multimodal AI serves as a powerful teaching assistant that bridges visual and textual information:
- For STEM education, these models can analyze complex scientific diagrams and:
- Convert abstract visual concepts into clear, step-by-step explanationsConvert abstract visual concepts into clear, step-by-step explanations
- Identify and label components of biological systems, chemical structures, or engineering schematicsIdentify and label components of biological systems, chemical structures, or engineering schematics
- Translate mathematical expressions and equations into plain language interpretationsTranslate mathematical expressions and equations into plain language interpretations
- In history and social studies, multimodal models enhance learning by:
- Providing detailed context for historical photographs, including time period, cultural significance, and historical relevanceProviding detailed context for historical photographs, including time period, cultural significance, and historical relevance
- Analyzing primary source documents with both textual and visual elementsAnalyzing primary source documents with both textual and visual elements
- Making connections between visual artifacts and broader historical narrativesMaking connections between visual artifacts and broader historical narratives
- For data literacy, these systems help students by:
- Breaking down complex charts and graphs into comprehensible insightsBreaking down complex charts and graphs into comprehensible insights
- Explaining statistical visualizations and data trends in accessible languageExplaining statistical visualizations and data trends in accessible language
- Teaching students how to interpret different types of data representationsTeaching students how to interpret different types of data representations
These capabilities are particularly valuable for students with different learning styles, allowing visual learners to receive verbal explanations and verbal learners to better understand visual content. They also support personalized learning by adapting explanations to different educational levels, from elementary to advanced university courses.
Creative work: Generating captions, stories, or descriptions that can inspire artists, writers, and content creators. These models can suggest creative interpretations of images, develop narratives based on visual scenes, assist with storyboarding by describing sequential images, and help marketers craft compelling visual content with appropriate messaging.
For creative professionals, multimodal AI serves as both muse and collaborator. Writers facing creative blocks can use these systems to generate story prompts from visual inspiration. When shown an image of a misty forest at dawn, for instance, the AI might suggest narrative elements like "a forgotten path leading to an ancient secret" or "the meeting place of two worlds." This capability transforms random visual stimuli into structured creative starting points.
Visual artists and designers benefit from AI-generated descriptions that highlight elements they might otherwise overlook. A photographer reviewing their portfolio might gain new perspective when the AI points out "the interplay of shadow and reflection creates a natural frame around the subject" or "the unexpected color contrast draws attention to the emotional center of the image."
In film and animation, these models streamline the pre-production process. Storyboard artists can quickly generate descriptive text for sequential panels, helping directors and producers visualize narrative flow before committing resources to production. The AI can suggest camera angles, lighting moods, and scene transitions based on visual references, accelerating the creative development cycle.
For content marketers, multimodal models bridge the gap between visual assets and compelling messaging. When analyzing product photography, these systems can generate targeted copy that aligns with both the visual elements and brand voice, ensuring consistent communication across channels. This capability is particularly valuable for social media campaigns where striking visuals must be paired with concise, engaging text in multiple formats and platforms.
Productivity: Extracting structured insights from documents, tables, or screenshots, which saves time and improves efficiency in professional settings. Instead of manually parsing visual data, users can leverage AI to convert tables into spreadsheets, extract key information from receipts or business cards, analyze graphs and charts in reports, and transform handwritten notes into searchable text.
This productivity advantage manifests across numerous professional workflows:
- In financial services, multimodal AI can automatically process invoices and receipts by:
- Identifying vendor information, dates, and payment amountsIdentifying vendor information, dates, and payment amounts
- Categorizing expenses according to predefined accounting codesCategorizing expenses according to predefined accounting codes
- Flagging potential discrepancies or unusual chargesFlagging potential discrepancies or unusual charges
- For research and analysis, these systems can:
- Extract precise numerical data from complex charts and graphsExtract precise numerical data from complex charts and graphs
- Convert statistical visualizations into structured datasetsConvert statistical visualizations into structured datasets
- Summarize key trends and outliers identified in visual dataSummarize key trends and outliers identified in visual data
- In administrative workflows, multimodal AI streamlines:
- Business card digitization for immediate contact database integrationBusiness card digitization for immediate contact database integration
- Form processing without manual data entryForm processing without manual data entry
- Meeting note transcription with automatic action item extractionMeeting note transcription with automatic action item extraction
The time savings are substantial—tasks that would require hours of manual data entry can be completed in seconds, while also reducing human error. For organizations handling large volumes of visual documents, this capability transforms information management by making previously inaccessible data searchable, analyzable, and actionable.
Multimodal models bring us closer to AI that interacts with the world as humans do: through multiple senses, not just words. By bridging the gap between visual perception and language understanding, these technologies create more intuitive and natural human-AI interactions that reflect how we naturally process information through multiple channels simultaneously.
5.1 Text+Image Models (LLaVA, Flamingo, GPT-4o, DeepSeek-VL)
So far, we have focused on models that live in the world of words. But human intelligence is multimodal: we learn by reading, seeing, hearing, and interacting with the world. For AI to approach this kind of understanding, language models must also expand beyond text.
This limitation of text-only models becomes evident when we consider how humans perceive and process information. We don't experience the world as isolated streams of text—we integrate visual cues, sounds, and physical interactions to form a comprehensive understanding. Traditional LLMs, despite their impressive capabilities with language, lack this holistic perception that comes naturally to humans.
This is where multimodal LLMs come in. By combining text with images, audio, or video, these models can:
- Describe what they "see" in pictures, recognizing objects, scenes, actions, and even emotional context within visual content.
- Answer questions about charts or diagrams, interpreting visual data representations and translating visual patterns into meaningful insights.
- Connect written descriptions to visual understanding, bridging the gap between abstract concepts described in words and their concrete visual manifestations.
- Support real-world tasks like tutoring, accessibility tools, and robotics, where understanding multiple forms of communication is essential for effective assistance.
Multimodal systems represent a significant leap forward in AI capabilities. Rather than processing each type of data in isolation, these models create connections between different forms of information, much like the human brain integrates signals from our various senses. This cross-modal reasoning allows for richer understanding and more natural interactions with AI systems.
In this chapter, we'll explore how researchers are pushing LLMs beyond text, starting with one of the most active areas: Text+Image models.
Text+Image models extend language models by integrating visual encoders with text-based transformers. This integration represents a significant advancement in AI, allowing models to process and understand both visual and textual information simultaneously. In practice, this integration involves several key components working together:
- An image encoder (like CLIP's vision transformer or a convolutional net) processes an image into embeddings. This encoder analyzes the visual content pixel by pixel, identifying features such as shapes, colors, objects, spatial relationships, and even contextual elements. The encoder works through multiple processing layers, each extracting increasingly complex information:
- Low-level features: First, the encoder detects basic elements like edges, textures, and color patterns across the image. This initial layer of processing works similarly to how our eyes first perceive visual information - identifying contrasts between light and dark, detecting boundaries between colors, and registering texture variations (like smooth vs. rough surfaces).
This stage is computationally intensive as the model must analyze every pixel and its relationship to neighboring pixels. For example, when processing a photograph of a forest, the encoder might identify:
- Vertical lines representing tree trunks
- Irregular patterns of green representing foliage
- Textural differences between rough bark and smooth leaves
- Shadow gradients indicating depth and lighting direction
- Color transitions between sky and terrain
The encoder uses specialized filters that respond to specific patterns - some detect horizontal lines, others vertical lines, while others identify specific color gradients or textural elements. These filters work in parallel across the entire image, creating feature maps that highlight where each pattern appears most strongly.
These fundamental visual elements form the building blocks for all higher-level recognition, much like how letters combine to form words and sentences in language processing. Without accurate detection at this stage, the more complex recognition tasks in subsequent layers would fail.
- Mid-level features: These basic elements are then combined to recognize more complex structures such as specific shapes, object parts, and spatial arrangements. At this stage, the model begins to identify meaningful patterns - recognizing that certain edges form the outline of a face, or that particular textures likely represent fur, fabric, or foliage.
This mid-level processing is crucial because it bridges the gap between raw visual data and semantic understanding. For example, when processing an image of a person walking a dog in a park:
- The model might recognize curved lines and color patterns that form the silhouette of a human figure
- It identifies four-legged shapes with characteristic proportions that indicate "dog"
- It detects textural patterns of grass, trees, and sky that suggest "outdoor environment"
- It recognizes spatial configurations that establish the relationship between person and dog (connected by a leash)
The model also starts to understand spatial relationships, determining when objects are above, below, or inside others. These spatial relationships provide critical context - a cup on a table has different implications than a table on a cup. The model learns to recognize standard spatial arrangements (like furniture in a room) and unusual configurations that might require special attention.
- High-level features: Finally, the encoder identifies complete objects, scenes, actions, and the relationships between elements in the image. This is where true "understanding" emerges, as the model recognizes not just isolated objects but meaningful context - distinguishing between a dog sitting on a sofa versus running through a park, or understanding that a person holding a tennis racket near a net represents a specific activity.
At this highest level of processing, the model performs several sophisticated cognitive tasks:
- Object recognition and classification: The model can identify whole entities (people, animals, vehicles, furniture) and categorize them into specific types or classes (German Shepherd dog, mid-century sofa, professional tennis player).
- Scene understanding: Beyond individual objects, the model comprehends entire environments - recognizing a kitchen from its appliances and layout, or a beach scene from the combination of sand, water, and distinctive lighting.
- Action recognition: The model can interpret dynamic elements - differentiating between someone running versus walking, or throwing versus catching - based on posture, positioning, and contextual cues.
- Relationship detection: Perhaps most impressively, the model identifies how objects relate to each other spatially and functionally - recognizing that a person is walking a dog (connected by a leash), riding a bicycle (positioned on top), or cooking food (performing actions on ingredients).
- Contextual inference: The model makes educated guesses about the broader situation - inferring a birthday celebration from candles on a cake and gathering of people, or a professional meeting from business attire and a conference room setting.
The model can also interpret emotional content, social interactions, and even infer potential narratives within the scene. It might recognize facial expressions indicating happiness or concern, body language suggesting tension or relaxation, or social dynamics like a teacher instructing students or friends enjoying a meal together. Through extensive training on millions of images with corresponding descriptions, the model learns to associate visual patterns with rich semantic concepts, enabling it to "see" at a level that approximates human understanding.
The result is a dense representation of the image's content in a numerical format that the model can process - essentially translating visual information into a "language" that the AI can understand and reason with.
- Low-level features: First, the encoder detects basic elements like edges, textures, and color patterns across the image. This initial layer of processing works similarly to how our eyes first perceive visual information - identifying contrasts between light and dark, detecting boundaries between colors, and registering texture variations (like smooth vs. rough surfaces).
- A projection layer maps those embeddings into the same space as the language model's tokens. This critical alignment step ensures that visual information and text information can be processed together. Without this projection, the model would struggle to make meaningful connections between what it sees and what it understands through language.
The projection layer essentially translates the "language of images" into a format compatible with the "language of text," allowing both modalities to coexist in the same computational space. This process involves several sophisticated transformations:
Dimensionality alignment: Image embeddings and text embeddings often have different dimensions and structures. The projection layer reshapes visual features to match the exact dimensions expected by the language model, ensuring that every visual concept can be represented in a way the text processing components can interpret. This process involves complex mathematical transformations that convert the high-dimensional tensors from the vision encoder (which might have shapes like [batch_size, sequence_length, vision_dimension]) into the format required by the language model (typically [batch_size, sequence_length, hidden_dimension]).
For example, a vision encoder might output features with 1024 dimensions per token, while the language model might work with 768-dimensional embeddings. The projection layer would then implement a learned linear transformation (essentially a matrix multiplication) that maps each 1024-dimensional vector to a 768-dimensional vector while preserving as much semantic information as possible.
This alignment is not just about matching numbers - it's about preserving the rich semantic relationships captured in the visual domain. The projection parameters are learned during training, allowing the model to discover optimal mappings between visual concepts and their linguistic counterparts. This ensures that when the language model attends to these projected visual features, it can extract meaningful information that corresponds to concepts it understands through language.
Semantic mapping: Beyond simple dimension matching, the projection layer learns to map visual concepts to their linguistic counterparts. For example, the visual features representing "a red apple" must be projected into a space where they can interact meaningfully with the text tokens for "red" and "apple."
This semantic mapping is a sophisticated translation process that bridges two fundamentally different representational systems. When processing an image of a red apple, the vision encoder extracts features capturing its roundness, smooth texture, red coloration, and stem. These visual features exist as abstract numerical patterns distributed across multiple embedding dimensions. The projection layer must transform these distributed visual patterns into representations that align with how language models understand concepts like "red" (a color attribute) and "apple" (a fruit category).
The challenge is significant because visual and linguistic representations are structured differently:
- In vision, concepts are often entangled - the "redness" and "appleness" exist simultaneously in the same pixels and are processed together.
- In language, concepts are more discrete - "red" and "apple" are separate tokens with distinct meanings that compose together.
Through extensive training on paired image-text data, the projection layer learns to disentangle these visual features and map them to their linguistic counterparts. When successful, the projected visual features will activate similar neural patterns as would be activated by the text "red apple" in the language model. This enables the language model to reason about the visual content using its language understanding capabilities - for instance, answering questions like "What color is the apple?" by connecting the visual representation to the appropriate linguistic concept "red".
This semantic alignment is what allows multimodal models to perform cross-modal reasoning tasks, such as describing unseen objects, answering questions about visual content, or generating text that references visual elements in contextually appropriate ways.
Contextual integration: The projection ensures that contextual relationships in the visual domain (like spatial relationships between objects) are preserved in a way that the language model can access and reason about. This allows the model to answer questions about relative positions or interactions between objects in an image.
This contextual integration is particularly crucial because visual scenes contain rich spatial and relational information that must be translated into a format the language model can process. For example, when looking at an image of a dining table, the model needs to understand not just that there are plates, glasses, and utensils, but their arrangement (plates in front of chairs, glasses above plates, forks to the left of plates), their groupings (place settings), and their functional relationships (napkins folded on plates).
The projection layer preserves these spatial hierarchies by maintaining relative positional information between visual features. Through specialized attention mechanisms, it ensures that:
- Proximity relationships ("the book is next to the lamp") are encoded in ways that language models can interpret
- Containment relationships ("the apple is in the bowl") maintain their hierarchical structure
- Directional relationships ("the dog is facing the camera") preserve orientation information
- Scale relationships ("the elephant is larger than the mouse") retain relative size information
This sophisticated mapping enables the model to correctly interpret questions like "What's above the bookshelf?", "Is the child holding the balloon?", or "Which way is the car facing?" - questions that require understanding not just what objects are present but how they relate to one another in physical space.
Without proper contextual integration, a model might recognize all objects in an image but fail to understand their meaningful relationships, severely limiting its ability to reason about scenes as humans naturally do.
- The language model treats visual embeddings as if they were special tokens, allowing it to "attend" to both words and pixels. Through self-attention mechanisms, the model can create connections between visual elements and textual concepts, forming a comprehensive understanding that spans both modalities.
This integration happens through a sophisticated process where the transformer architecture's self-attention mechanism simultaneously processes both text tokens and visual tokens. When a user asks "What color is the car in this image?", the model's attention heads can focus on:
- The visual embeddings representing the car in the image
- The textual tokens related to "color" and "car" in the query
- The contextual relationship between these elements
The self-attention weights form a complex web of connections, allowing information to flow bidirectionally between modalities. For example, when processing an image of a red sports car alongside text mentioning "vehicle," the model can:
- Associate visual features of the car with the word "vehicle" in the text
- Connect color properties from the visual embedding to potential color descriptions
- Link spatial relationships in the image (car on road) to potential scene descriptions
This cross-modal attention enables the model to perform tasks like visual question answering, image captioning, and text-conditional reasoning about visual content. The attention maps themselves reveal how the model distributes focus across different parts of both the image and text when forming its understanding.
This allows the model to reason about relationships between what it "sees" and what it "reads."
This fusion of visual and textual processing creates a powerful system that can understand context across modalities, enabling it to answer prompts like:
- "What's written on the sign in this photo?" - requiring text recognition within images and understanding of visual context. The model must identify text elements embedded within the visual scene, distinguish them from other visual features, and accurately transcribe the text while maintaining awareness of the sign's context in the broader image (whether it's a street sign, store front, warning notice, etc.).
- "Describe this chart in plain English." - requiring interpretation of data visualizations and translation into natural language. Here, the model must recognize the chart type (bar graph, pie chart, line graph, etc.), identify axes labels, data points, and trends, then synthesize this information into coherent prose that captures the key relationships and insights presented in the visualization.
- "Write a story about this image." - requiring creative generation based on visual stimuli and understanding of narrative elements. This complex task requires the model to recognize not just objects but their relationships, potential emotional content, implied actions or intentions, and then use these elements to create a coherent narrative with characters, setting, plot, and thematic elements that plausibly extend from what's visible in the image.
5.1.1 LLaVA (Large Language and Vision Assistant)
Open-source model combining CLIP for vision + Vicuna (LLM). CLIP (Contrastive Language-Image Pre-training) serves as the vision encoder that processes and extracts features from images, while Vicuna, a fine-tuned version of LLaMA, handles the language processing capabilities. The architecture leverages CLIP's powerful visual representation ability, which was trained on 400 million image-text pairs to understand visual concepts, and combines it with Vicuna's advanced language understanding and generation capabilities.
LLaVA follows a two-stage training process. First, it's pretrained on a large corpus of image-text pairs to establish basic connections between visual and linguistic information. Then, it's specifically trained on instruction-following data that pairs images with text prompts. This training approach enables LLaVA to understand and respond to specific instructions about visual content, going beyond simple image captioning to more complex reasoning about what it sees. This instruction-tuning is what gives LLaVA its ability to follow nuanced directions when analyzing images, rather than just generating generic descriptions.
The training dataset includes approximately 158,000 image-text instruction pairs, carefully curated to cover a wide range of visual reasoning tasks, from simple object identification to complex scene interpretation. This instruction-tuning phase is crucial as it teaches the model to follow specific directives when analyzing visual content. The dataset incorporates diverse image types including natural photographs, diagrams, charts, screenshots, and artistic images, ensuring the model can handle various visual formats. The text instructions are similarly diverse, ranging from simple requests like "What color is the car?" to more complex ones like "Explain the relationship between the people in this image and what they might be feeling."
Example task: describing an image in detail. LLaVA can generate comprehensive descriptions that include object identification, spatial relationships, attributes, actions, and even infer context or emotions from visual scenes. Its descriptions can range from factual observations to more interpretative analyses depending on the prompt.
For instance, when shown an image of a city street, LLaVA can identify not only the vehicles, pedestrians, and buildings, but also describe their relationships (e.g., "a person crossing the street while cars wait at a red light"), infer weather conditions based on visual cues (e.g., "wet pavement suggests recent rainfall"), and even comment on the likely time of day based on lighting conditions and shadows. The model can also perform more specialized tasks like reading text in images, analyzing charts or graphs, identifying landmarks, and recognizing famous people or artwork, demonstrating its versatility across different visual analysis scenarios.
LLaVA stands out for its efficient architecture that achieves strong performance while requiring relatively modest computational resources compared to proprietary alternatives. Its open-source nature has made it a popular choice for researchers and developers working on vision-language applications. The model's architecture is notably streamlined, using a simple projection layer to connect CLIP's vision embeddings with Vicuna's language processing capabilities. This approach avoids the computational overhead of more complex cross-attention mechanisms while still enabling effective communication between the visual and language components. The smaller variants of LLaVA can run on consumer-grade GPUs with 16GB of memory, making advanced multimodal AI accessible to a much broader range of researchers and developers than closed-source alternatives that may require specialized hardware.
The model achieves competitive performance on benchmarks such as VQAv2 (Visual Question Answering) and GQA (Grounded Question Answering), while being significantly more resource-efficient than closed-source alternatives like GPT-4V. On the VQAv2 benchmark, which evaluates a model's ability to answer questions about images, LLaVA-1.5 achieves scores comparable to much larger proprietary models. Its accessibility allows developers to fine-tune it for specific domains or applications, such as medical image analysis (interpreting X-rays, CT scans, and other medical imaging), retail product recognition (identifying products in shelves or catalog images), or educational content development (explaining scientific diagrams or historical artifacts), fostering a growing ecosystem of specialized multimodal AI applications. The model has inspired numerous derivatives and extensions in the open-source community, including versions optimized for different languages, specialized for particular domains like document understanding, or modified to work with video input rather than static images.
Code Example: Using LLaVA for Multimodal Processing
# Complete LLaVA implementation example
import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
# Step 1: Load the pre-trained LLaVA model and processor
model_id = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Step 2: Prepare the image
image = Image.open("colosseum.jpg")
# Step 3: Define your prompt
prompt = "Describe this image in detail."
# Step 4: Process the inputs
inputs = processor(
prompt,
images=image,
return_tensors="pt"
).to("cuda")
# Step 5: Generate the response
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
# Step 6: Decode and print the response
generated_text = processor.decode(output[0], skip_special_tokens=True)
print(generated_text)
For this example, download the Colosseum image here: https://files.cuantum.tech/images/colosseum.jpg
Code Breakdown: Using LLaVA for Multimodal Processing
This code demonstrates how to use the LLaVA (Large Language and Vision Assistant) model to process images and generate descriptive text. Let's break down each part in detail:
1. Imports and Setup
- torch: The PyTorch library provides tensor computation and neural networks functionality.
- PIL.Image: The Python Imaging Library allows us to open and manipulate image files.
- AutoProcessor: Automatically selects the appropriate processor for the model, handling both text tokenization and image preprocessing.
- LlavaForConditionalGeneration: The main LLaVA model class that combines vision and language capabilities.
2. Model Loading
The code loads the LLaVA 1.5 7B model from Hugging Face, which is a moderate-sized variant balancing performance and resource requirements:
- torch_dtype=torch.float16: Uses half-precision floating-point format to reduce memory usage.
- device_map="auto": Automatically determines the optimal device placement strategy, distributing model components across available GPUs or using CPU as needed.
3. Input Preparation
The code prepares two key inputs:
- An image loaded using PIL's Image.open() function.
- A text prompt that specifies the task ("Describe this image in detail").
The processor then:
- Resizes and normalizes the image to match CLIP's expected input format (224x224 pixels).
- Tokenizes the text prompt into input IDs for the language model component.
- Creates attention masks and other required tensor inputs.
4. Generation Process
The model.generate() method creates the text response with several parameters controlling the generation:
- max_new_tokens=256: Limits the response length to a maximum of 256 new tokens.
- do_sample=True: Enables sampling-based generation rather than greedy decoding.
- temperature=0.6: Controls randomness in the generation (lower values are more deterministic).
- top_p=0.9: Implements nucleus sampling, considering only tokens whose cumulative probability exceeds 90%.
5. Behind the Scenes: How LLaVA Processes the Image
When you run this code, LLaVA performs several sophisticated operations:
- The CLIP vision encoder extracts visual features from the image, creating a high-dimensional representation that captures objects, attributes, spatial relationships, and other visual information.
- The projection layer transforms these visual embeddings into a format compatible with the language model's embedding space, essentially "translating" visual concepts into a language the LLM can understand.
- The Vicuna language model (based on LLaMA) receives both the projected visual embeddings and the tokenized prompt, treating the visual information as special tokens in its context window.
- The self-attention mechanism allows the model to focus on relevant parts of both the image representation and the text prompt when generating each token of the response.
- The decoder generates a coherent, contextually appropriate text response based on both the visual content and the text instruction.
6. Advanced Customization Options
The basic example above can be extended with additional parameters for more control:
# Advanced parameters for more control
output = model.generate(
**inputs,
max_new_tokens=512, # Generate longer responses
do_sample=True, # Enable sampling-based generation
temperature=0.7, # Slightly more creative responses
top_p=0.9, # Nucleus sampling parameter
top_k=50, # Limit vocabulary to top 50 tokens
repetition_penalty=1.2, # Discourage repetition of phrases
length_penalty=1.0, # No penalty based on length
no_repeat_ngram_size=3, # Avoid repeating 3-grams
)
7. Practical Applications
This code structure can be adapted for various multimodal tasks by modifying the prompt:
- Visual question answering: "What color is the car in this image?"
- Image reasoning: "Explain what might happen next in this scene."
- Content extraction: "Extract all text visible in this image."
- Creative generation: "Write a short story inspired by this image."
LLaVA's architecture effectively bridges vision and language, enabling these diverse applications with the same underlying model.
Advanced Example: Interactive Visual Question Answering with LLaVA
The following code demonstrates a more sophisticated use case for LLaVA: building an interactive visual question answering application that can process uploaded images and answer questions about them in real-time.
# Advanced LLaVA application: Interactive Visual QA with Gradio
import torch
import gradio as gr
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
# Load the LLaVA model and processor
model_id = "llava-hf/llava-1.5-13b-hf" # Using larger 13B parameter version
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
def process_image_and_question(image, question, temperature=0.7, max_length=500):
"""Process an image and a question to generate a response using LLaVA."""
# Prepare the prompt with the user's question
prompt = f"Answer this question about the image: {question}"
# Process inputs
inputs = processor(
prompt,
images=image,
return_tensors="pt"
).to(model.device)
# Generate the response
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
)
# Decode the response
generated_text = processor.decode(output[0], skip_special_tokens=True)
# Return just the model's answer, removing the original question
response = generated_text.split("Answer this question about the image:")[-1].strip()
return response
# Set up the Gradio interface
with gr.Blocks() as demo:
gr.Markdown("# LLaVA Visual Question Answering")
gr.Markdown("Upload an image and ask a question about it.")
with gr.Row():
with gr.Column():
image_input = gr.Image(type="pil", label="Upload Image")
question_input = gr.Textbox(label="Your Question", placeholder="What's happening in this image?")
temperature = gr.Slider(0.1, 1.0, value=0.7, label="Temperature (creativity)")
max_length = gr.Slider(50, 1000, value=500, step=50, label="Maximum response length")
submit_button = gr.Button("Get Answer")
with gr.Column():
output_text = gr.Textbox(label="LLaVA's Answer", lines=10)
# Connect the interface to the processing function
submit_button.click(
fn=process_image_and_question,
inputs=[image_input, question_input, temperature, max_length],
outputs=output_text
)
# Add example images and questions
gr.Examples(
examples=[
["example_street_scene.jpg", "What safety hazards do you see in this image?"],
["example_chart.jpg", "Explain the main trend shown in this chart."],
["example_food.jpg", "What ingredients might be in this dish?"]
],
inputs=[image_input, question_input]
)
# Launch the application
demo.launch()
For this example, download the required images from these links:
Street Scene: files.cuantum.tech/images/example_street_scene.jpg
Chart: https://files.cuantum.tech/images/example_chart.jpg
Food: https://files.cuantum.tech/images/example_food.jpg
Code Breakdown: Interactive Visual QA Application
This advanced example demonstrates how to build a user-friendly application for visual question answering using LLaVA. Let's break down the key components:
1. Model Selection and Setup
- LLaVA 1.5-13B: This code uses the larger 13B parameter version of LLaVA (compared to the 7B in the previous example), which offers improved reasoning capabilities at the cost of requiring more computational resources.
- The same initialization approach is used, with float16 precision and automatic device mapping to optimize for available hardware.
2. Core Processing Function
The process_image_and_question() function handles the core multimodal processing:
- It takes four inputs: an image, a question, and two generation parameters (temperature and max length).
- The question is formatted into a standardized prompt format that helps guide LLaVA's response generation.
- After processing, it extracts just the relevant answer portion, removing the original prompt for a cleaner user experience.
3. Gradio Interface Construction
The code uses Gradio to create an intuitive web interface for the application:
- User inputs: Image upload, question text box, and generation parameter sliders for fine-tuning responses.
- Layout organization: Arranged in a two-column layout for inputs on the left and outputs on the right.
- Examples: Pre-configured example images and questions to demonstrate the system's capabilities.
4. Behind the Scenes: Enhanced Multimodal Processing
When a user interacts with this application, several sophisticated processes occur:
- The uploaded image is automatically preprocessed by the Gradio interface to ensure compatibility with LLaVA's input requirements.
- The LLaVA processor handles both the text tokenization and image preprocessing, ensuring proper alignment between modalities.
- The question is formatted into a directive that helps the model understand the specific visual reasoning task required.
- Generation parameters provide user control over the response style - higher temperature produces more creative but potentially less precise answers.
- Post-processing extracts just the relevant answer, creating a cleaner conversational experience.
5. Potential Applications
This interactive application template could be adapted for numerous real-world use cases:
- Educational tools: Students could upload diagrams or historical images and ask for explanations.
- Accessibility services: Visually impaired users could ask detailed questions about photographs or documents.
- E-commerce: Shoppers could upload product images and ask specific questions about features or compatibility.
- Technical support: Users could share screenshots of error messages or hardware setups and ask for troubleshooting advice.
- Content moderation: Platforms could use a modified version to help analyze uploaded images for policy compliance.
6. Technical Considerations and Limitations
When implementing this type of application, it's important to consider:
- Hardware requirements: The 13B parameter model requires a GPU with at least 24GB VRAM for optimal performance.
- Inference speed: Response generation typically takes 2-10 seconds depending on hardware and response length.
- Image resolution: LLaVA processes images at a fixed resolution (typically 224x224 pixels), which may limit detailed analysis of very small elements.
- Privacy considerations: For sensitive applications, consider running this locally rather than on cloud infrastructure.
This example illustrates how LLaVA's capabilities can be packaged into user-friendly applications that bring multimodal AI's power to non-technical users. The combination of visual understanding, language generation, and interactive controls creates a flexible system for a wide range of visual reasoning tasks.
5.1.2 Flamingo (DeepMind)
Flamingo is a groundbreaking multimodal model developed by DeepMind, specifically engineered to excel at few-shot learning across text and image domains. Unlike models that require extensive task-specific training, Flamingo can adapt to new visual tasks with minimal examples. This represents a significant advancement in multimodal AI, as most earlier systems required dedicated training datasets for each new type of visual reasoning task they needed to perform.
At its architectural core, Flamingo uses a frozen language model (LLM) as its foundation and introduces specialized cross-attention layers that create bridges between visual representations and textual understanding. These cross-attention mechanisms serve as effective translators, allowing visual information to be meaningfully incorporated into the language model's processing pipeline without disrupting its pre-trained linguistic capabilities. The visual processing component of Flamingo utilizes a vision encoder based on a Normalizer-Free ResNet (NFNet), which transforms images into dense feature representations. These visual features are then processed through a perceiver resampler module that converts the variable-sized visual representations into a fixed number of visual tokens that can be efficiently processed by the language model.
What makes Flamingo particularly impressive is its ability to perform "in-context learning" with visual data. It can answer questions about previously unseen image-text tasks with remarkably little training data - often needing just 1-16 examples to achieve strong performance. This capability allows Flamingo to generalize to novel visual reasoning scenarios without extensive retraining, making it adaptable across domains like visual question answering, image captioning, and visual reasoning with minimal setup time. The model was trained on a massive multimodal dataset comprising hundreds of millions of image-text pairs gathered from diverse web sources, enabling it to develop a rich understanding of the relationships between visual and textual concepts.
During inference, Flamingo can process interleaved sequences of images and text, making it particularly well-suited for conversational interactions about visual content. For example, a user could show Flamingo several images of animals with corresponding descriptions as examples, then present a new animal image and ask for a similar description. The model would leverage its few-shot learning capabilities to generate an appropriate response following the pattern established in the examples. This flexibility extends to complex reasoning tasks as well, such as comparing multiple images, answering questions about specific visual details, or even generating creative content inspired by visual inputs.
The model's architecture has inspired subsequent research in efficient multimodal learning, particularly in how to effectively combine pre-trained unimodal models (like vision-only and language-only systems) into powerful multimodal reasoners without requiring extensive joint training from scratch. This approach has proven valuable for developing more accessible multimodal AI systems while leveraging the strengths of specialized models in each modality.
Flamingo Implementation Example: Multimodal Few-shot Learning
Below is a simplified implementation example of a Flamingo-inspired architecture using PyTorch. This example demonstrates the core components of Flamingo: a vision encoder, a perceiver resampler, and cross-attention layers integrated with a language model.
import torch
import torch.nn as nn
import torchvision.models as models
from transformers import GPT2LMHeadModel, GPT2Tokenizer
class PerceiverResampler(nn.Module):
"""
Perceiver Resampler module that converts variable-sized visual features
to a fixed number of tokens that can be processed by the language model.
"""
def __init__(self, input_dim=2048, latent_dim=768, num_latents=64, num_layers=4):
super().__init__()
self.latents = nn.Parameter(torch.randn(num_latents, latent_dim))
self.layers = nn.ModuleList([
nn.MultiheadAttention(embed_dim=latent_dim, num_heads=8, batch_first=True)
for _ in range(num_layers)
])
self.input_proj = nn.Linear(input_dim, latent_dim)
self.norm = nn.LayerNorm(latent_dim)
def forward(self, visual_features):
# Project visual features to latent dimension
visual_features = self.input_proj(visual_features)
# Expand latents to batch size
batch_size = visual_features.shape[0]
latents = self.latents.unsqueeze(0).expand(batch_size, -1, -1)
# Process through cross-attention layers
for layer in self.layers:
latents = latents + layer(
query=latents,
key=visual_features,
value=visual_features,
need_weights=False
)[0]
latents = self.norm(latents)
return latents
class CrossAttentionBlock(nn.Module):
"""
Cross-attention block that integrates visual information into the LLM.
"""
def __init__(self, hidden_size=768, num_heads=12):
super().__init__()
self.cross_attention = nn.MultiheadAttention(
embed_dim=hidden_size,
num_heads=num_heads,
batch_first=True
)
self.layer_norm1 = nn.LayerNorm(hidden_size)
self.layer_norm2 = nn.LayerNorm(hidden_size)
def forward(self, hidden_states, visual_features):
normed_hidden_states = self.layer_norm1(hidden_states)
# Apply cross-attention
attn_output = self.cross_attention(
query=normed_hidden_states,
key=visual_features,
value=visual_features,
need_weights=False
)[0]
# Residual connection and layer norm
hidden_states = hidden_states + attn_output
hidden_states = self.layer_norm2(hidden_states)
return hidden_states
class FlamingoModel(nn.Module):
"""
Simplified Flamingo model combining vision encoder, perceiver resampler,
and a language model with cross-attention layers.
"""
def __init__(self, vision_model_name="resnet50", num_visual_tokens=64):
super().__init__()
# Vision encoder (frozen)
self.vision_encoder = models.__dict__[vision_model_name](pretrained=True)
self.vision_encoder.fc = nn.Identity() # Remove classification head
for param in self.vision_encoder.parameters():
param.requires_grad = False
# Perceiver resampler
self.perceiver = PerceiverResampler(
input_dim=2048, # ResNet50 feature dim
latent_dim=768, # Match GPT2 hidden size
num_latents=num_visual_tokens
)
# Language model (frozen)
self.language_model = GPT2LMHeadModel.from_pretrained("gpt2")
self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
self.tokenizer.pad_token = self.tokenizer.eos_token
for param in self.language_model.parameters():
param.requires_grad = False
# Cross-attention layers (one per transformer block)
self.cross_attentions = nn.ModuleList([
CrossAttentionBlock(hidden_size=768, num_heads=12)
for _ in range(len(self.language_model.transformer.h))
])
# Save original forward methods
self.original_block_forward = self.language_model.transformer.h[0].forward
# Monkey patch the transformer blocks to include cross-attention
for i, block in enumerate(self.language_model.transformer.h):
block.flamingo_cross_attn = self.cross_attentions[i]
block.forward = self._make_new_forward(block, i)
# Visual features buffer for storing current visual context
self.register_buffer("visual_features", None, persistent=False)
def _make_new_forward(self, block, block_index):
"""Creates a new forward method for transformer blocks that includes cross-attention."""
original_forward = block.forward
cross_attn = self.cross_attentions[block_index]
def new_forward(x, **kwargs):
# Run original transformer block
hidden_states = original_forward(x, **kwargs)
# Apply cross-attention with visual features
if self.visual_features is not None:
hidden_states = cross_attn(hidden_states[0], self.visual_features)
return (hidden_states,) + hidden_states[1:] if isinstance(hidden_states, tuple) else (hidden_states,)
return hidden_states
return new_forward
def process_images(self, images):
"""Extract visual features from images and prepare them for conditioning."""
with torch.no_grad():
# Extract features from vision encoder
features = self.vision_encoder(images) # [batch_size, 2048]
features = features.unsqueeze(1) # Add sequence dimension [batch_size, 1, 2048]
# Process through perceiver resampler
visual_tokens = self.perceiver(features) # [batch_size, num_latents, hidden_size]
# Store visual features for cross-attention
self.visual_features = visual_tokens
def generate(self, prompt, images=None, max_length=100, temperature=0.7):
"""Generate text conditioned on images and text prompt."""
# Process images if provided
if images is not None:
self.process_images(images)
else:
self.visual_features = None
# Tokenize prompt
inputs = self.tokenizer(prompt, return_tensors="pt", padding=True)
input_ids = inputs.input_ids.to(next(self.parameters()).device)
attention_mask = inputs.attention_mask.to(next(self.parameters()).device)
# Generate text
output_ids = self.language_model.generate(
input_ids,
attention_mask=attention_mask,
max_length=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
)
# Decode output
generated_text = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
return generated_text
# Example usage
def flamingo_example():
from PIL import Image
import torchvision.transforms as transforms
# Initialize model
model = FlamingoModel().to("cuda" if torch.cuda.is_available() else "cpu")
# Prepare image transform
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Load and process image
image = Image.open("eiffel-tower.jpg")
image_tensor = transform(image).unsqueeze(0).to(next(model.parameters()).device)
# Example prompts for few-shot learning
few_shot_prompt = """
Image: [A photo of a busy street in Tokyo]
Description: The image shows a crowded street in Tokyo with neon signs, many pedestrians, and small restaurants.
Image: [A photo of the Grand Canyon]
Description: The image depicts the vast expanse of the Grand Canyon with its layered rock formations and deep ravines.
Image: [Current image]
Description:
"""
# Generate text based on image
output = model.generate(few_shot_prompt, images=image_tensor, max_length=200)
print(output)
if __name__ == "__main__":
flamingo_example()
For this example, download the Eiffel Tower image here: https://files.cuantum.tech/images/eiffel-tower.jpg
Code Breakdown: Flamingo-inspired Multimodal Model
The above implementation represents a simplified version of DeepMind's Flamingo architecture. Let's break down the key components:
1. Architecture Components
- Vision Encoder: A pretrained ResNet50 model that extracts visual features from images. In the full Flamingo model, this would be a more advanced vision model like NFNet.
- Perceiver Resampler: This critical component transforms variable-sized visual features into a fixed number of visual tokens. It uses cross-attention between learned latent vectors and visual features to condense the visual information.
- Language Model: A pretrained GPT-2 model serves as the language foundation. The original Flamingo used a more powerful Chinchilla LLM.
- Cross-Attention Layers: These layers are inserted into each transformer block of the language model, allowing visual information to influence text generation at multiple levels of processing.
2. Key Design Decisions
- Frozen Backbone Models: Both the vision encoder and language model are kept frozen, preserving their pretrained capabilities while only training the connecting components.
- Parameter Efficiency: By only training the perceiver resampler and cross-attention layers, Flamingo achieves multimodal capabilities with relatively few trainable parameters.
- Monkey Patching: The implementation uses a technique called "monkey patching" to insert cross-attention into the language model without modifying its original architecture.
3. How Visual Processing Works
- The image is passed through the vision encoder to extract high-level visual features (2048-dimensional for ResNet50).
- These features are then processed by the perceiver resampler, which condenses them into a fixed set of tokens (64 in this example).
- The resulting visual tokens are stored in a buffer and made available to all cross-attention layers during text generation.
4. How Few-Shot Learning Is Implemented
- The example demonstrates few-shot learning through a carefully formatted prompt containing example image-text pairs.
- Each example follows a pattern of "Image: [description]" followed by "Description: [detailed text]".
- The final prompt ends with "Image: [Current image]" and "Description:", prompting the model to generate a description for the new image following the pattern established by the examples.
- This in-context learning approach allows the model to adapt to specific tasks without parameter updates.
5. Practical Considerations and Limitations
- Computational Efficiency: The real Flamingo model uses sophisticated techniques for handling larger contexts and more efficiently processing visual information.
- Training Requirements: To fully train this model, you would need a large dataset of image-text pairs and significant computational resources.
- Simplified Architecture: This example omits some details of the full Flamingo architecture for clarity, such as gated cross-attention and more advanced visual processing.
6. Real-world Applications
- Visual question answering: Answering specific questions about image content with few or no examples.
- Image captioning: Generating detailed descriptions of images in various styles based on examples.
- Visual reasoning: Performing complex reasoning tasks about visual content, such as comparing images or identifying relationships.
- Multimodal chat: Enabling conversational interactions that seamlessly incorporate visual information.
This implementation provides a starting point for understanding and experimenting with Flamingo-style multimodal architectures. The real power of such models comes from their ability to perform in-context learning across modalities, adapting to new tasks with minimal examples.
Enhanced Flamingo Implementation with In-Context Learning
Let's explore a more comprehensive implementation of the Flamingo architecture that better demonstrates its in-context learning capabilities for visual question answering:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2Tokenizer, ViTModel, ViTImageProcessor
from PIL import Image
import requests
from io import BytesIO
class GatedCrossAttentionBlock(nn.Module):
"""
Enhanced cross-attention block with gating mechanism as used in Flamingo.
"""
def __init__(self, hidden_size=768, num_heads=12):
super().__init__()
self.hidden_size = hidden_size
self.cross_attention = nn.MultiheadAttention(
embed_dim=hidden_size,
num_heads=num_heads,
batch_first=True
)
# Gating mechanism
self.gate = nn.Linear(hidden_size, hidden_size)
self.gate_activation = nn.Sigmoid()
# Layer normalization
self.layer_norm1 = nn.LayerNorm(hidden_size)
self.layer_norm2 = nn.LayerNorm(hidden_size)
def forward(self, hidden_states, visual_features):
normed_hidden_states = self.layer_norm1(hidden_states)
# Apply cross-attention
attn_output, _ = self.cross_attention(
query=normed_hidden_states,
key=visual_features,
value=visual_features
)
# Apply gating mechanism
gate_values = self.gate_activation(self.gate(normed_hidden_states))
attn_output = gate_values * attn_output
# Residual connection and layer norm
hidden_states = hidden_states + attn_output
hidden_states = self.layer_norm2(hidden_states)
return hidden_states
class PerceiverResampler(nn.Module):
"""
Perceiver Resampler that converts variable-length visual features into
a fixed number of tokens through cross-attention with learned queries.
"""
def __init__(self, input_dim=768, latent_dim=768, num_latents=64, num_layers=4):
super().__init__()
self.latents = nn.Parameter(torch.randn(num_latents, latent_dim))
self.layers = nn.ModuleList([
nn.MultiheadAttention(
embed_dim=latent_dim,
num_heads=8,
batch_first=True
)
for _ in range(num_layers)
])
self.input_projection = nn.Linear(input_dim, latent_dim)
self.layer_norm = nn.LayerNorm(latent_dim)
def forward(self, x):
batch_size = x.shape[0]
# Project input features to match latent dimension
x = self.input_projection(x)
# Expand latents for each item in the batch
latents = self.latents.unsqueeze(0).expand(batch_size, -1, -1)
# Apply layers of cross-attention
for layer in self.layers:
latents, _ = layer(
query=latents,
key=x,
value=x
)
latents = self.layer_norm(latents)
return latents
class EnhancedFlamingoModel(nn.Module):
"""
Enhanced Flamingo model with improved components for in-context learning
and visual question answering tasks.
"""
def __init__(self, num_visual_tokens=64, vision_model_name="google/vit-base-patch16-224"):
super().__init__()
# Vision encoder (frozen ViT)
self.vision_encoder = ViTModel.from_pretrained(vision_model_name)
self.vision_processor = ViTImageProcessor.from_pretrained(vision_model_name)
for param in self.vision_encoder.parameters():
param.requires_grad = False
# Perceiver resampler
self.perceiver = PerceiverResampler(
input_dim=768, # ViT feature dim
latent_dim=768, # Match GPT2 hidden size
num_latents=num_visual_tokens,
num_layers=4
)
# Language model (frozen GPT-2)
self.language_model = GPT2LMHeadModel.from_pretrained("gpt2")
self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
self.tokenizer.pad_token = self.tokenizer.eos_token
# Keep LM frozen except for final layer norm and unembedding
for name, param in self.language_model.named_parameters():
if "ln_f" in name or "wte" in name:
param.requires_grad = True
else:
param.requires_grad = False
# Special tokens for marking image inputs
self.image_start_token = "<image>"
self.image_end_token = "</image>"
# Add special tokens to vocabulary
special_tokens = {"additional_special_tokens": [self.image_start_token, self.image_end_token]}
num_added = self.tokenizer.add_special_tokens(special_tokens)
self.language_model.resize_token_embeddings(len(self.tokenizer))
# Cross-attention blocks
self.cross_attentions = nn.ModuleList([
GatedCrossAttentionBlock(hidden_size=768, num_heads=12)
for _ in range(len(self.language_model.transformer.h))
])
# Create image token ID
self.image_start_token_id = self.tokenizer.convert_tokens_to_ids(self.image_start_token)
self.image_end_token_id = self.tokenizer.convert_tokens_to_ids(self.image_end_token)
# Register hook to modify the transformer layers
for i, block in enumerate(self.language_model.transformer.h):
block.register_forward_hook(self._make_cross_attention_hook(i))
# Buffer for storing visual features
self.register_buffer("visual_features", None, persistent=False)
def _make_cross_attention_hook(self, block_idx):
"""Create a forward hook for adding cross-attention at specified layer."""
cross_attn = self.cross_attentions[block_idx]
def hook(module, inputs, outputs):
if self.visual_features is None:
return outputs
hidden_states = outputs[0] if isinstance(outputs, tuple) else outputs
modified_hidden_states = cross_attn(hidden_states, self.visual_features)
if isinstance(outputs, tuple):
return (modified_hidden_states,) + outputs[1:]
return modified_hidden_states
return hook
def _encode_image(self, image_tensor):
"""Process a single image through the vision encoder and perceiver."""
with torch.no_grad():
vision_outputs = self.vision_encoder(image_tensor)
hidden_states = vision_outputs.last_hidden_state
# Process through perceiver resampler to get fixed number of tokens
visual_tokens = self.perceiver(hidden_states)
return visual_tokens
def _encode_images_batch(self, image_list):
"""Process a batch of images through the vision pipeline."""
processed_images = []
for image in image_list:
if isinstance(image, str):
# Load from URL if string
response = requests.get(image)
img = Image.open(BytesIO(response.content))
else:
# Assume PIL Image otherwise
img = image
# Preprocess for vision model
processed = self.vision_processor(img, return_tensors="pt")
processed_images.append(processed["pixel_values"])
# Stack into batch
image_tensors = torch.cat(processed_images, dim=0).to(next(self.parameters()).device)
return self._encode_image(image_tensors)
def format_prompt_with_images(self, text_prompt, images):
"""Format a prompt with image placeholders and encode the images."""
# Encode images first
self.visual_features = self._encode_images_batch(images)
# Replace placeholders with special tokens
formatted_prompt = text_prompt.replace("[IMAGE]", f"{self.image_start_token}{self.image_end_token}")
return formatted_prompt
def generate_answer(self, prompt, images=None, max_length=200, temperature=0.7):
"""Generate an answer for a visual question answering prompt with images."""
if images:
prompt = self.format_prompt_with_images(prompt, images)
# Tokenize prompt
inputs = self.tokenizer(prompt, return_tensors="pt").to(next(self.parameters()).device)
# Generate text
with torch.no_grad():
output_ids = self.language_model.generate(
inputs.input_ids,
max_length=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
pad_token_id=self.tokenizer.eos_token_id
)
# Get only the generated text (not the prompt)
generated_ids = output_ids[0][inputs.input_ids.shape[1]:]
generated_text = self.tokenizer.decode(generated_ids, skip_special_tokens=True)
# Clear visual features after generation
self.visual_features = None
return generated_text.strip()
def run_visual_qa_demo():
"""Demonstrate visual question answering with the Flamingo model."""
# Initialize model
model = EnhancedFlamingoModel().to("cuda" if torch.cuda.is_available() else "cpu")
# Example images (use URLs for convenience)
example_images = [
"https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg", # Image of a dog on a beach
"https://files.cuantum.tech/images/dog_drawing.jpg" # Drawing of a dog
]
# Few-shot prompt for VQA
few_shot_prompt = """
I will answer questions about images.
[IMAGE]
Question: What animal is in the image?
Answer: The image shows a dog running on the beach. It appears to be a golden retriever enjoying the sand and ocean.
[IMAGE]
Question: What is this a drawing of?
Answer: This is a simple drawing of a dog. It appears to be a cartoon-style sketch with basic lines representing a dog's features.
[IMAGE]
Question: What is shown in this image?
Answer:
"""
# New test image (Eiffel Tower)
test_image = "https://files.cuantum.tech/images/eiffel-tower.jpg"
# Generate answer
answer = model.generate_answer(
few_shot_prompt,
images=example_images + [test_image],
max_length=100
)
print("Model's answer:", answer)
if __name__ == "__main__":
run_visual_qa_demo()
Code Breakdown: Advanced Flamingo Implementation
This enhanced implementation of the Flamingo architecture includes several important improvements that make it more similar to the original DeepMind model:
1. Key Architecture Enhancements
- Gated Cross-Attention: Unlike the basic implementation, this version includes a gating mechanism that controls how much visual information flows into the language model at each layer. This prevents visual information from dominating and allows for more nuanced integration.
- Multi-layer Perceiver Resampler: The perceiver now uses multiple layers of cross-attention to refine the visual tokens, creating a more sophisticated visual representation.
- ViT Vision Encoder: Uses a modern Vision Transformer instead of ResNet, providing better visual feature extraction.
- Special Tokens: Adds special image tokens to the vocabulary, allowing the model to recognize where images appear in the context.
2. In-Context Learning Implementation
- Few-Shot Visual QA: The prompt structure demonstrates how Flamingo enables few-shot learning by showing examples of image-question-answer triplets.
- Image Placeholders: Uses [IMAGE] placeholders in the prompt that get replaced with special tokens, mimicking how the real Flamingo handles multiple images in context.
- Contextual Memory: The model processes multiple images and remembers their features during generation, allowing it to reference different examples.
3. Technical Implementation Details
- Forward Hooks: Uses PyTorch hooks instead of monkey patching to inject cross-attention into the transformer blocks, which is a cleaner implementation.
- Selective Fine-tuning: Only certain parts of the language model are trainable (final layer norm and embedding), while keeping most parameters frozen.
- Batched Image Processing: Handles multiple images efficiently by batching them through the vision pipeline.
4. User-Friendly Features
- URL Image Loading: Supports loading images directly from URLs, making demonstrations easier.
- Structured API: Provides a clean interface for formatting prompts with images and generating answers.
- Memory Management: Clears visual features after generation to free up memory.
5. Real-world Applications
This implementation demonstrates how Flamingo can be used for:
- Visual Question Answering: Answering specific questions about image content.
- Few-Shot Learning: Learning new tasks from just a few examples without parameter updates.
- Multi-image Reasoning: Processing information across multiple images to provide coherent answers.
The enhanced implementation shows how multimodal models can maintain the powerful in-context learning capabilities of large language models while incorporating rich visual information. This approach allows for flexible adaptation to new visual tasks without specialized fine-tuning, making it particularly valuable for real-world applications.
5.1.3 GPT-5 (OpenAI)
Launched on August 7, 2025, GPT-5 marks a new milestone in OpenAI’s large language model lineage. It is the first fully native multimodal model, trained jointly on text, images, and audio from the ground up, with a composed system design that integrates fast responses, deep reasoning, and intelligent routing. More than an incremental upgrade over GPT-4o, GPT-5 represents a paradigm shift: a model architected from the beginning to process and reason across modalities as a unified whole.
Native Multimodal Architecture
Unlike earlier models that retrofitted speech or vision modules onto a text-first transformer, GPT-5 is fundamentally multimodal. Text, image, and audio are processed in the same transformer backbone, creating shared internal representations that seamlessly connect concepts across formats.
This design produces fluid cross-modal reasoning. For example, if a user submits a photo of a math problem, GPT-5 not only recognizes the characters but also interprets the underlying mathematical structure. It then generates a step-by-step solution that references specific symbols in the image, checks for ambiguities, and explains the reasoning in natural language. This integrated comprehension extends to scientific diagrams, financial charts, architectural blueprints, and medical imagery.
By aligning modalities during training, GPT-5 develops deeper semantic coherence—understanding how textual descriptions, visual data, and spoken language reinforce or contradict each other. It can, for instance, highlight inconsistencies between a historical photograph and a written account, or correlate radiology images with patient notes.
Composed System and Intelligent Routing
GPT-5 is not a monolithic model but a composed system:
- A main fast model handles everyday queries with low latency.
- A thinking model engages when complex, multi-step reasoning is required, offering real-time chain-of-thought.
- Mini and nano variants optimize cost and speed for lightweight applications.
- A Pro reasoning variant (API only) extends test-time reasoning for the hardest problems.
An intelligent router automatically decides which component to use, sparing users from manually picking between “light” and “heavy” models. This dynamic composition ensures efficiency for simple prompts and depth for challenging ones.
Reasoning and Context Management
With real-time chain-of-thought reasoning, GPT-5 excels in tasks that require logic, multi-step deduction, or tool use. On external benchmarks, it sets new records: 74.9% accuracy on SWE-bench Verified (software engineering) and 88% on Aider polyglot (code editing).
The model’s expanded context window—up to 400,000 tokens via the API, with output lengths of up to 128,000 tokens—supports the analysis of entire books, multi-hour meetings, or large codebases without losing track of earlier information. This scale makes it suitable for legal discovery, research synthesis, and full-repository debugging.
Voice and Multilingual Capabilities
Through the Realtime API, GPT-5 offers natural speech-in/speech-out interactions with millisecond-level latency. The voice system is robust to accents, can modulate tone on command, and integrates with SIP protocols, enabling real-world phone calls and live agents. Users can now hold fluid conversations where GPT-5 reasons, speaks, and listens in real time.
Multilingual fluency has also advanced, making GPT-5 a practical tool for cross-border communication, customer support, education, and accessibility.
Developer Controls and Tool Integration
Developers gain fine-grained control via new parameters:
reasoning_effort: from minimal (fast) to extensive (deep reasoning).verbosity: low, medium, or high detail in responses.
The API exposes three model families—gpt-5, gpt-5-mini, and gpt-5-nano—to balance accuracy, cost, and latency. Pricing (per million tokens) at launch was $1.25 input / $10 output for GPT-5, with cheaper mini and nano tiers.
GPT-5 also supports custom tools: lightweight, plaintext tool calls with optional grammar constraints, allowing more reliable integration with external APIs. Enterprises can connect GPT-5 directly into Microsoft Copilot, Apple Intelligence, GitLab, Notion, and custom pipelines.
Accuracy, Safety, and Bias Reduction
OpenAI introduced safe-completions training in GPT-5. Instead of choosing between over-compliance and refusal, the model aims to generate the safest useful answer. Internal evaluations show:
- Substantially fewer hallucinations than GPT-4o.
- Lower sycophancy (over-agreeableness).
- Reduced deception, meaning the model is less likely to feign success on impossible tasks.
Safety frameworks classify GPT-5 Thinking as High capability in biology and chemistry, with layered safeguards, red-teaming, and monitoring.
Use Cases and Industry Impact
- Coding & Engineering: GPT-5 generates functional front-end code, debugs large repositories, and coordinates multi-tool development workflows.
- Automation & Productivity: From grading and summarizing to document review, it frees human bandwidth for higher-order work.
- Knowledge Work: Enterprises use GPT-5 for legal analysis, financial reporting, and R&D, where its long context and reasoning shine.
- Creative Workflows: Designers, writers, and researchers can mix text, images, and audio in prompts—e.g., analyzing a chart and drafting a report in one go.
- Voice Agents: Customer service and sales teams deploy GPT-5 via Realtime API to deliver human-like support, capturing alphanumeric details and following strict protocols.
The New Standard
GPT-5 establishes a new baseline for large multimodal models. Its unified architecture, dynamic routing, reasoning capabilities, and developer controls make it a versatile foundation for both consumer and enterprise AI. By natively fusing text, vision, and audio, GPT-5 doesn’t just respond across modalities—it reasons through them, enabling a generation of AI systems that operate more like collaborators than tools.
Basic Example: Multimodal Prompt with JSON Output (Chat Completions API)
A beginner-friendly example showing how to send an image and text together and receive a structured JSON response.
import requests
import json # You need this to parse the JSON string from the response
API_KEY = "YOUR_OPENAI_API_KEY"
# Use the correct API endpoint
API_URL = "https://api.openai.com/v1/chat/completions"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
# Example: Provide an image URL and a text query jointly
# Corrected input structure using 'type' and 'image_url' keys
image_part = {
"type": "image_url",
"image_url": {
"url": "https://files.cuantum.tech/images/chart.png" # Can also use a data URL for base64 images
}
}
# Corrected text part structure
text_part = {
"type": "text",
"text": "Summarize the main trend shown in the chart. Also, generate the Python code to recreate this visualization. Format the response as a JSON object with the keys 'summary', 'python_code', and 'key_points'."
}
# Corrected payload
payload = {
"model": "gpt-5",
"messages": [
{
"role": "user",
"content": [
image_part,
text_part
]
}
],
# Correct way to request JSON output
"response_format": { "type": "json_object" },
# The max_tokens parameter is standard
"max_tokens": 400
}
response = requests.post(API_URL, headers=headers, json=payload)
result = response.json()
# Correct way to handle the API response
try:
# The API returns a JSON string inside the message content, so we parse it
response_content = result['choices'][0]['message']['content']
parsed_output = json.loads(response_content)
# Print structured output from the parsed JSON
print("Summary:", parsed_output.get("summary"))
print("Python code:", parsed_output.get("python_code"))
print("Key points:", parsed_output.get("key_points"))
except (KeyError, IndexError, json.JSONDecodeError) as e:
print("Error parsing the API response:", e)
print("Raw response:", result)
Code Breakdown
This example demonstrates how to send a multimodal request to OpenAI's GPT-5 model, combining an image URL with a text query, and specifically asking for a structured JSON response.
1. Import Libraries
import requests
import jsonrequests: This library is essential for making HTTP requests in Python. We use it to send our data to the OpenAI API and receive the response.json: This library is used for working with JSON (JavaScript Object Notation) data. We'll use it to construct our request payload and, critically, to parse the JSON string that GPT-5 will return to us when we ask for structured output.
2. API Configuration
API_KEY = "YOUR_OPENAI_API_KEY"
API_URL = "https://api.openai.com/v1/chat/completions"API_KEY: This is a placeholder for your unique OpenAI API key. You must replace"YOUR_OPENAI_API_KEY"with your actual key, which you can obtain from the OpenAI developer dashboard. This key authenticates your requests.API_URL: This is the specific endpoint for OpenAI's chat completion API. All conversational and multimodal requests go to this URL. It's crucial that this is correct.
3. Request Headers
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}headers: This dictionary contains metadata sent with our HTTP request."Authorization": f"Bearer {API_KEY}": This header authenticates your request using your API key. TheBearertoken prefix is a standard for OAuth 2.0."Content-Type": "application/json": This header tells the server that the body of our request is formatted as JSON.
4. Defining Multimodal Input Parts
GPT-5 can process different types of input simultaneously. Here, we define an image and a text part.
image_part = {
"type": "image_url",
"image_url": {
"url": "https://files.cuantum.tech/images/chart.png"
}
}image_part: This dictionary represents the visual input."type": "image_url": Specifies that this content block is an image provided via a URL."image_url": {"url": "..."}: This nested structure is where the actual image URL is provided. The model will fetch and process the image from this link. You could also provide base64 encoded images here instead of a URL.
text_part = {
"type": "text",
"text": "Summarize the main trend shown in the chart. Also, generate the Python code to recreate this visualization. Format the response as a JSON object with the keys 'summary', 'python_code', and 'key_points'."
}text_part: This dictionary holds the textual instruction for the model."type": "text": Indicates this content block is plain text."text": "...": This is the actual prompt to GPT-5. Notice how we explicitly ask for a JSON object with specific keys (summary,python_code,key_points). This is crucial for getting structured output from the model.
5. Constructing the Request Payload
This is the main body of the request, containing all the instructions for the API.
payload = {
"model": "gpt-5",
"messages": [
{
"role": "user",
"content": [
image_part,
text_part
]
}
],
"response_format": { "type": "json_object" },
"max_tokens": 400
}"model": "gpt-5": Specifies which OpenAI model to use. In this case, it's the latest GPT-5."messages": [...]: This is a list of message objects, forming the conversation.- Each message has a
"role"(e.g.,"user","system","assistant") and"content". "role": "user": Indicates that this message comes from the user."content": [image_part, text_part]: This is the crucial part for multimodal input. Thecontentis a list containing both ourimage_partandtext_partdictionaries. The model will process them together.
- Each message has a
"response_format": { "type": "json_object" }: This parameter explicitly tells the API to constrain the model's output to a valid JSON object. This is essential when you want structured data back from the model, as we requested in ourtext_part."max_tokens": 400: Sets the maximum number of tokens (words or word pieces) the model should generate in its response. This helps control cost and response length.
6. Sending the Request
response = requests.post(API_URL, headers=headers, json=payload)
result = response.json()requests.post(...): This function sends an HTTP POST request to theAPI_URLwith ourheadersand thepayload(converted to JSON byrequests.post).response.json(): The API's reply comes back as a JSON string. This method parses that string into a Python dictionary, making it easy to access the data.
7. Handling and Parsing the Response
The API's response structure is standard, but the actual content we asked GPT-5 to generate is nested within it as a string.
try:
response_content = result['choices'][0]['message']['content']
parsed_output = json.loads(response_content)
print("Summary:", parsed_output.get("summary"))
print("Python code:", parsed_output.get("python_code"))
print("Key points:", parsed_output.get("key_points"))
except (KeyError, IndexError, json.JSONDecodeError) as e:
print("Error parsing the API response:", e)
print("Raw response:", result)try...except: This block is crucial for robust error handling. API calls can fail for many reasons (network issues, incorrect API key, malformed requests, or the model might not return valid JSON).result['choices'][0]['message']['content']: This is the path to extract the actual text generated by GPT-5.result['choices']: The API can return multiplechoices(different possible completions) based on parameters liken. We usually take the first one ([0]).['message']: Within each choice, themessageobject contains therole(e.g., "assistant") and the generatedcontent.
json.loads(response_content): Since we specifically asked the model to format its output as a JSON string within thecontentfield, we need to usejson.loads()to parse this string into a Python dictionary.parsed_output.get("summary"),parsed_output.get("python_code"),parsed_output.get("key_points"): Onceresponse_contentis parsed into a dictionary, we can access the individual fields we requested from GPT-5. Using.get()is safer than direct dictionary access ([]) as it preventsKeyErrorif a key is missing.- The
exceptblock catches potential errors during parsing or if the expected keys are not found, printing both the error and the raw API response for debugging.
Advanced Example: Production-Ready Multimodal Workflow (Responses API with JSON Schema)
A robust example demonstrating best practices for reliability, schema validation, retries, and safe execution of returned code.
"""
Multimodal (image + text) → structured JSON with GPT-5
- Uses the Responses API (recommended)
- Strict JSON schema for reliable structured output
- Optional: safely execute returned Matplotlib code in a subprocess to render a PNG
"""
import os
import json
import time
import base64
import requests
import tempfile
import subprocess
import sys
from textwrap import dedent
from typing import Dict, Any, List, Optional
# =========================
# Configuration
# =========================
API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
API_URL = "https://api.openai.com/v1/responses"
MODEL = "gpt-5" # or: gpt-5-mini / gpt-5-nano
HEADERS = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
}
# Use a public image URL OR a local file encoded as a data URL (see helper below).
IMAGE_URL = "https://cdn.example.com/chart.png" # <- replace for your test
# Strict JSON schema for the model’s response
RESPONSE_SCHEMA: Dict[str, Any] = {
"name": "ChartInsight",
"schema": {
"type": "object",
"properties": {
"summary": {"type": "string"},
"python_code": {"type": "string"},
"key_points": {
"type": "array",
"items": {"type": "string"},
"minItems": 3,
"maxItems": 7
}
},
"required": ["summary", "python_code", "key_points"],
"additionalProperties": False
},
"strict": True
}
PROMPT_TEXT = (
"You are a meticulous data analyst.\n"
"Tasks:\n"
"1) Summarize the main trend in the chart.\n"
"2) Generate minimal, runnable Python (matplotlib) code that recreates a similar visualization "
" using inferred placeholder data. Include clear axis labels and a title.\n"
"3) Provide 3–7 bullet key points.\n"
"Return a JSON object that matches the provided JSON schema exactly."
)
# =========================
# Helpers
# =========================
def local_image_to_data_url(path: str, mime: Optional[str] = None) -> str:
"""
Convert a local image file to a data URL usable as an image input.
Example usage:
IMAGE_URL = local_image_to_data_url("chart.png")
"""
if not mime:
# naive mime inference by extension
ext = os.path.splitext(path)[1].lower()
mime = "image/png" if ext in [".png"] else "image/jpeg"
with open(path, "rb") as f:
b64 = base64.b64encode(f.read()).decode("utf-8")
return f"data:{mime};base64,{b64}"
def build_payload(image_url: str) -> Dict[str, Any]:
"""
Build a Responses API payload with multimodal input and JSON schema output.
"""
return {
"model": MODEL,
"input": [
{
"role": "user",
"content": [
{"type": "input_image", "image_url": {"url": image_url}},
{"type": "input_text", "text": PROMPT_TEXT}
]
}
],
"response_format": {
"type": "json_schema",
"json_schema": RESPONSE_SCHEMA
},
"max_output_tokens": 900,
"temperature": 0.2
}
def post_with_retries(
url: str,
headers: Dict[str, str],
json_payload: Dict[str, Any],
retries: int = 3,
backoff: float = 1.5,
timeout: int = 60
) -> Dict[str, Any]:
"""
POST with simple exponential backoff for rate limits / transient errors.
"""
for attempt in range(1, retries + 1):
try:
resp = requests.post(url, headers=headers, json=json_payload, timeout=timeout)
if resp.status_code == 200:
return resp.json()
# Retry on typical transient statuses
if resp.status_code in (429, 500, 502, 503, 504):
time.sleep(backoff ** attempt)
continue
raise RuntimeError(f"HTTP {resp.status_code}: {resp.text}")
except requests.exceptions.Timeout as e:
if attempt == retries:
raise
time.sleep(backoff ** attempt)
except requests.exceptions.RequestException as e:
if attempt == retries:
raise
time.sleep(backoff ** attempt)
raise RuntimeError("Request failed after retries")
def parse_responses_api_json(result: Dict[str, Any]) -> Dict[str, Any]:
"""
Extract the schema-validated JSON text and parse it to a dict.
Responses API returns: output[0].content[0].text for text output.
"""
try:
content_blocks = result["output"][0]["content"]
# Find first text block
for block in content_blocks:
if block.get("type") == "output_text" or block.get("type") == "text":
text = block.get("text", "")
if not text:
continue
# In schema mode, text should be strict JSON
return json.loads(text)
raise KeyError("No text block found in the response output")
except (KeyError, IndexError, json.JSONDecodeError) as e:
debug = json.dumps(result, indent=2)[:2000] # truncate for readability
raise ValueError(f"Failed to parse structured output: {e}\nPartial payload:\n{debug}")
def run_matplotlib_script(py_code: str) -> None:
"""
Safely run returned Matplotlib code in a clean subprocess (not in-process exec).
Saves 'recreated_chart.png' in the current working directory.
"""
safe_prefix = dedent("""
import matplotlib
matplotlib.use('Agg') # headless backend for servers/CI
""")
# Force a save at the end, even if the model code forgets to save
force_save = dedent("""
import os
import matplotlib.pyplot as plt
out = 'recreated_chart.png'
try:
plt.savefig(out, dpi=150, bbox_inches='tight')
except Exception:
# Some scripts call show() only; ensure we still save a figure if present
try:
plt.gcf().savefig(out, dpi=150, bbox_inches='tight')
except Exception:
pass
print(f"[Saved] {os.path.abspath(out)}")
""")
script = safe_prefix + "\n" + py_code + "\n\n" + force_save
with tempfile.NamedTemporaryFile("w", suffix=".py", delete=False) as f:
f.write(script)
tmp_path = f.name
completed = subprocess.run(
[sys.executable, tmp_path],
capture_output=True,
text=True,
timeout=60
)
if completed.stdout:
print(completed.stdout)
if completed.returncode != 0:
print("Script error:\n", completed.stderr)
# =========================
# Main flow
# =========================
def main():
if not API_KEY or API_KEY == "YOUR_OPENAI_API_KEY":
raise EnvironmentError("Set OPENAI_API_KEY environment variable or hardcode API_KEY.")
# If you want to test with a local image:
# IMAGE_URL = local_image_to_data_url("path/to/chart.png")
payload = build_payload(IMAGE_URL)
result = post_with_retries(API_URL, HEADERS, payload)
data = parse_responses_api_json(result)
print("\n=== Summary ===\n", data["summary"])
print("\n=== Key points ===")
for i, kp in enumerate(data["key_points"], 1):
print(f"{i}. {kp}")
print("\n=== Python code (recreate chart) ===\n")
print(data["python_code"])
# Optional: render the returned chart
user_wants_render = True # set to False to skip rendering
if user_wants_render:
run_matplotlib_script(data["python_code"])
if __name__ == "__main__":
main()
Download the chart example image here: https://files.cuantum.tech/images/chart.png
Code breakdown:
- Configuration
API_URL = "https://api.openai.com/v1/responses"uses the Responses API (the current, multimodal-first endpoint).MODEL = "gpt-5"picks the full model; you can swap togpt-5-mini/gpt-5-nanofor cheaper/faster runs.IMAGE_URL: set a public URL or switch to a local file vialocal_image_to_data_url().
- Strict JSON via schema
RESPONSE_SCHEMAtells the model exactly what keys and types to return.- This is more reliable than a plain
json_objecthint because the model is constrained to a schema and will retry internally to satisfy it.
- Building the multimodal prompt
build_payload()composesinputwith two blocks:{"type": "input_image", "image_url": {...}}for the image,{"type": "input_text", "text": PROMPT_TEXT}for instructions.
- The
response_formatrequests schema-validated output; the model returns a single JSON string that parses cleanly.
- Network resilience
post_with_retries()adds basic retry/backoff on rate limits or transient 5xx errors and a timeout so calls don’t hang.- Non-retryable errors raise with the server’s message for quick diagnosis.
- Parsing the Responses API
parse_responses_api_json()extractsresult["output"][0]["content"][0]["text"](the schema-validated JSON) andjson.loads()it.- If the shape changes (e.g., future versions), the function fails loudly with a helpful snippet.
- Optional: safe Matplotlib execution
run_matplotlib_script()runs the code in a separate Python process, not viaexec()in your main process.- It forces a headless backend and ensures a saved file
recreated_chart.pngeven if the script forgets. - This pattern is good enough for demos and CI, but for production you might put further guards (resource limits, containers).
- Main flow
- Build payload → call API with retries → parse JSON → print
summary,key_points, andpython_code. - Optionally, render the chart with the sandboxed subprocess.
Tool-Calling Example: “Ask GPT-5 to fetch data with your function, then analyze and plot”
"""
Tool-calling with GPT-5 (Chat Completions API)
- The model asks to call our tool `get_prices` with {symbol, days}
- We run the tool (here: mock data), send results back, then GPT-5 completes:
-> JSON with 'summary', 'key_points', and 'python_code' (Matplotlib)
"""
import os
import json
import time
import math
import requests
from datetime import datetime, timedelta
from typing import Dict, Any, List
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
API_URL = "https://api.openai.com/v1/chat/completions"
MODEL = "gpt-5"
HEADERS = {
"Authorization": f"Bearer {OPENAI_API_KEY}",
"Content-Type": "application/json",
}
# ---------- Tool: mock market data ----------
def get_prices(symbol: str, days: int = 30) -> Dict[str, Any]:
"""
Return mock OHLC data for the past N days.
Replace this with your real data source later (DB/API/cache).
"""
end = datetime.utcnow().date()
dates = [(end - timedelta(days=i)).isoformat() for i in range(days)][::-1]
# Simple deterministic waveform so every run is similar
base = 100.0
prices = []
for i, d in enumerate(dates):
v = base + 10 * math.sin(i / 4.0) + (i * 0.15)
o = round(v + math.sin(i) * 0.3, 2)
c = round(v + math.cos(i) * 0.3, 2)
h = round(max(o, c) + 0.6, 2)
l = round(min(o, c) - 0.6, 2)
prices.append({"date": d, "open": o, "high": h, "low": l, "close": c})
return {"symbol": symbol.upper(), "series": prices}
# ---------- Tool spec for the model ----------
TOOLS = [
{
"type": "function",
"function": {
"name": "get_prices",
"description": "Get recent OHLC data for a ticker symbol.",
"parameters": {
"type": "object",
"properties": {
"symbol": {"type": "string", "description": "Ticker, e.g., AAPL"},
"days": {"type": "integer", "minimum": 5, "maximum": 200, "default": 30}
},
"required": ["symbol"]
}
}
}
]
SYSTEM = (
"You are a quantitative analyst. If needed, call tools to fetch data, "
"then return a structured JSON with keys: summary (string), key_points (array of strings), "
"python_code (string that plots the series with matplotlib)."
)
USER = (
"Analyze the recent trend for the symbol AAPL (last 60 days). "
"If you need prices, use the tool. Then return JSON with summary, key_points, python_code."
)
def chat(payload: Dict[str, Any]) -> Dict[str, Any]:
r = requests.post(API_URL, headers=HEADERS, json=payload, timeout=60)
if r.status_code != 200:
raise RuntimeError(f"HTTP {r.status_code}: {r.text}")
return r.json()
def main():
# 1) Ask GPT-5; allow tool calling
payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": USER}
],
"tools": TOOLS,
"tool_choice": "auto",
# Ask for JSON if model can comply directly
"response_format": {"type": "json_object"},
"temperature": 0.2,
"max_tokens": 900
}
first = chat(payload)
msg = first["choices"][0]["message"]
# 2) If the model wants to call tools, run them and send results back
tool_messages = []
if "tool_calls" in msg:
for call in msg["tool_calls"]:
name = call["function"]["name"]
args = json.loads(call["function"]["arguments"] or "{}")
if name == "get_prices":
tool_result = get_prices(symbol=args.get("symbol", "AAPL"),
days=int(args.get("days", 60)))
else:
tool_result = {"error": f"Unknown tool {name}"}
tool_messages.append({
"role": "tool",
"tool_call_id": call["id"],
"name": name,
"content": json.dumps(tool_result)
})
# 3) Send a follow-up message containing the tool outputs
follow_payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": USER},
msg, # the assistant message that requested tools
*tool_messages
],
"response_format": {"type": "json_object"},
"temperature": 0.2,
"max_tokens": 1200
}
final = chat(follow_payload)
out = final
else:
out = first # Model answered without tools
# 4) Parse the final JSON
content = out["choices"][0]["message"]["content"]
try:
data = json.loads(content)
except json.JSONDecodeError:
print("Model did not return valid JSON. Raw content:\n", content)
return
print("\n=== Summary ===\n", data.get("summary"))
print("\n=== Key points ===")
for i, kp in enumerate(data.get("key_points", []), 1):
print(f"{i}. {kp}")
print("\n=== Python code (plot) ===\n")
print(data.get("python_code"))
if __name__ == "__main__":
if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
raise SystemExit("Set OPENAI_API_KEY env var first.")
main()
Code breakdown:
Let GPT-5 decide when to call your function (get_prices), you execute it (mock or real API), feed results back, and let GPT-5 finish with analysis + Matplotlib code in JSON.
1) Imports & configuration
requestshandles HTTP calls to OpenAI.json,time,math,datetimeare used for parsing, retries (if added), and mock data generation.OPENAI_API_KEYis read from env; never hardcode secrets in real projects.API_URLtargets the Chat Completions endpoint (best known for tool calling).MODEL = "gpt-5"; you can swap togpt-5-minifor cheaper experiments.
Tip: In production, wrap network calls with retry/backoff (429/5xx). A simple helper function can centralize that (you can reuse the one from your Advanced example).
2) The tool you expose to the model
def get_prices(symbol: str, days: int = 30) -> Dict[str, Any]:
...- This is a mock OHLC generator. Replace with your real data source:
- A REST call (e.g., Yahoo, Polygon, your own DB/API).
- Caching layer (Redis) to keep latency/costs down.
- Output shape:
{
"symbol": "AAPL",
"series": [
{"date": "2025-07-01", "open": 101.2, "high": 102.0, "low": 100.6, "close": 101.8},
...
]
}Keep it consistent; the LLM will rely on the keys you return.
3) Advertising the tool (the TOOLS spec)
TOOLS = [
{
"type": "function",
"function": {
"name": "get_prices",
"description": "Get recent OHLC data...",
"parameters": { ... JSON Schema ... }
}
}
]- You define a JSON Schema (name, required fields, types).
- The model uses this to decide if and how to call your function.
- Keep schema minimal but precise (e.g., clamp
daysto a reasonable range).
4) System and User messages
- SYSTEM enforces role & output contract:
- “You are a quantitative analyst … return JSON with keys:
summary,key_points,python_code.”
- “You are a quantitative analyst … return JSON with keys:
- USER asks for “Analyze AAPL last 60 days,” nudging the model to use a tool if it needs data.
Tip: Always restate your desired output format in SYSTEM (and/or USER). This increases compliance, especially if you don’t use schema mode.
5) First request: allow tool calling
payload = {
"model": MODEL,
"messages": [system, user],
"tools": TOOLS,
"tool_choice": "auto",
"response_format": {"type": "json_object"},
...
}tool_choice: "auto"lets the model decide if it needs the tool.response_format: "json_object"asks for JSON, but not as strict as schema mode. (That’s okay here; the focus is tool calling.)- Low
temperature(0.2) boosts determinism.
6) Detect and execute tool calls
msg = first["choices"][0]["message"]
if "tool_calls" in msg:
for call in msg["tool_calls"]:
# 1) parse arguments
# 2) run your function
# 3) build a "tool" message with the resultstool_callsis the assistant’s intent to call your function with arguments.- You must parse
call["function"]["arguments"](stringified JSON), run your function, and post results as atoolrole message back to OpenAI.
Security notes:
- Never directly execute arbitrary code sent via tool args.
- Validate inputs (symbols, ranges). Add allowlists/ratelimits for external APIs.
7) Second request: provide tool outputs and ask GPT-5 to finish
follow_payload = {
"messages": [
system, user,
msg, # the assistant message that requested tools
*tool_messages # your tool outputs bound to the call IDs
],
"response_format": {"type":"json_object"}, ...
}- You include:
- The original assistant message that requested tools (so the model keeps context).
- Your tool result messages with the proper
tool_call_id.
- GPT-5 now has real data and completes the task (analysis + code).
8) Parse the final JSON
content = out["choices"][0]["message"]["content"]
data = json.loads(content)- Print
summary,key_points,python_code. - If parsing fails, dump raw content—often a sign the model deviated (rare at low temperature, but possible).
9) Customization knobs
- Switch to schema mode: If you want stronger guarantees on the final JSON, use:
response_format: { "type": "json_schema", "json_schema": {...} }
- Multiple tools: Add more function specs to
TOOLS. GPT-5 will pick the right one. - Parallel calls: The API can return multiple
tool_calls—run them all, then send all thetoolmessages back in one follow-up. - Logging: Log both the tool args and outputs to audit the agent’s steps.
10) Common pitfalls
- Forgetting
tool_call_idwhen sending the tool result message. - Mismatched schemas: If your returned JSON structure diverges from your documented shape, the model may misinterpret it later.
- Rate limits: Add retry/backoff for 429/5xx (especially if your tool triggers 3rd-party APIs).
11) Testing tips
- Start with mock data (like the example) for deterministic outputs.
- Add a unit test that asserts the model returns valid JSON with the required keys.
5.1.4 DeepSeek-VL
DeepSeek-VL is a Chinese open-source multimodal model developed by the DeepSeek team, designed to bridge the gap between vision and language processing. It represents China's significant contribution to the multimodal AI landscape, offering capabilities comparable to proprietary models but with open access for researchers and developers. The model emerged as part of China's growing AI research ecosystem, demonstrating the country's commitment to advancing state-of-the-art AI technologies while ensuring they remain accessible to the broader scientific community.
The model is specifically optimized for efficiency and vision-language reasoning, with architectural choices that prioritize computational performance while maintaining high-quality results. Its streamlined design makes it particularly suitable for deployment in resource-constrained environments, enabling advanced multimodal capabilities on more modest hardware configurations. DeepSeek-VL achieves this efficiency through careful attention to model size, training procedures, and inference optimizations. For example, it employs specialized vision encoders that extract rich visual features while minimizing computational overhead, and leverages knowledge distillation techniques to compress larger models' capabilities into more compact architectures.
In performance evaluations, DeepSeek-VL is often benchmarked against industry leaders like GPT-4V and Flamingo, where it demonstrates competitive results at a fraction of the computational cost. This makes it an attractive option for cost-effective deployments in production environments, particularly for organizations seeking multimodal capabilities without the expense associated with commercial API usage. Benchmark studies have shown that DeepSeek-VL achieves 85-90% of the performance of these larger models on standard vision-language tasks while requiring significantly less computational resources. This performance-to-cost ratio has made it particularly popular among startups, academic institutions, and developers in emerging markets.
The model excels in tasks requiring detailed visual understanding combined with natural language reasoning, such as image captioning, visual question answering, and complex scene interpretation. DeepSeek-VL's architecture incorporates specialized attention mechanisms that allow it to focus on relevant visual elements when answering questions or generating descriptions.
This capability enables applications ranging from assisting visually impaired users to automating content moderation and enhancing e-commerce product discovery through visual search. The model also demonstrates strong performance in cross-cultural visual contexts, making it particularly valuable for applications serving diverse global audiences.
Example: Using DeepSeek-VL for Image Understanding
# Install dependencies first
# pip install transformers torch pillow
from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image
import requests
import matplotlib.pyplot as plt
# Download and load an example image
image_url = "https://files.cuantum.tech/images/deep-seek-descriptive.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
# Load DeepSeek-VL model and processor
model_name = "deepseek-ai/deepseek-vl-7b-chat"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
# Create a prompt for the model
prompt = "Describe what you see in this image in detail."
# Process the inputs
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Generate a response
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False
)
# Decode the response
generated_text = processor.decode(outputs[0], skip_special_tokens=True)
# Display the image and response
plt.figure(figsize=(10, 8))
plt.imshow(image)
plt.axis('off')
plt.title('Input Image')
plt.show()
print("DeepSeek-VL's response:")
print(generated_text.split("ASSISTANT:")[-1].strip())
Code Breakdown: Using DeepSeek-VL for Image Understanding
The example above demonstrates how to use DeepSeek-VL for a basic image understanding task. Here's a detailed breakdown of each section:
1. Dependencies and Setup
- Key libraries: The code uses
transformersfor model access,torchfor tensor operations, andPILfor image handling. - Image acquisition: Fetches a sample image from a URL using
requestsand opens it with PIL.
2. Model Initialization
- Model selection: Uses the 7B parameter chat-tuned version of DeepSeek-VL (
deepseek-ai/deepseek-vl-7b-chat). - Processor loading: The
AutoProcessorhandles both tokenization of text and preprocessing of images. - Model loading:
trust_remote_code=Trueis required as DeepSeek-VL uses custom code for its implementation.
3. Input Processing
- Prompt creation: A simple prompt asking for image description, but you can use more specific prompts like "What objects are in this image?" or "Explain what's happening in this scene."
- Multimodal processing: The processor combines both text input (prompt) and image input into a format the model can understand.
- Return format:
return_tensors="pt"specifies PyTorch tensors as the output format.
4. Response Generation
- Inference with
torch.no_grad(): Disables gradient calculation for efficiency during inference. - Generation parameters:
max_new_tokens=512: Limits response length to 512 tokens.do_sample=False: Uses greedy decoding instead of sampling for deterministic outputs.
5. Response Processing and Visualization
- Decoding: Converts token IDs back to human-readable text.
- Response extraction: Splits the output to get only the assistant's response portion.
- Visualization: Displays the input image alongside the generated description.
Advanced Usage Patterns
Beyond this basic example, DeepSeek-VL supports several advanced capabilities:
- Visual reasoning: You can ask complex questions about relationships between objects in the image.
- Multi-image analysis: Process multiple images by passing a list to the processor.
- Fine-tuning: Adapt the model to specific domains using techniques like LoRA or QLoRA.
- Memory efficiency: For resource-constrained environments, consider using quantization:
# For 8-bit quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_enable_fp32_cpu_offload=True
)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
quantization_config=quantization_config,
device_map="auto"
)Implementation Considerations:
- Hardware requirements: DeepSeek-VL 7B requires at least 16GB GPU memory for full precision, but can run on consumer GPUs with quantization.
- Inference speed: First-time inference includes model loading time; subsequent calls are faster.
- Response format: The model follows a chat format with "ASSISTANT:" prefix. For cleaner outputs, always strip this prefix.
- Error handling: In production, add try-except blocks to handle image loading failures and timeout configurations for large images.
DeepSeek-VL represents a significant advancement in making multimodal AI accessible to developers, particularly those seeking open-source alternatives to proprietary models like GPT-4V or Gemini.
Example: Advanced Visual Question Answering with DeepSeek-VL
# Install required libraries
# pip install transformers torch pillow matplotlib requests
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
import matplotlib.pyplot as plt
from io import BytesIO
# Function to load and display an image from a URL
def load_and_display_image(image_url, title="Input Image"):
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
plt.figure(figsize=(10, 8))
plt.imshow(image)
plt.axis('off')
plt.title(title)
plt.show()
return image
# Load DeepSeek-VL model and processor
model_id = "deepseek-ai/deepseek-vl-7b-chat"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16, # Use half precision for efficiency
device_map="auto", # Automatically distribute across available GPUs
trust_remote_code=True
)
# Sample image URLs for visual reasoning tasks
image_urls = [
"https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg", # People at a table
"https://files.cuantum.tech/images/deep-seek-chart.jpg" # Charts/graphs
]
# Load and display the first image
image = load_and_display_image(image_urls[0])
# Function to generate responses for a given image and prompt
def generate_vl_response(image, prompt, max_new_tokens=256):
# Create chat message format
messages = [
{"role": "user", "content": prompt}
]
# Process inputs
inputs = processor(
messages=messages,
images=image,
return_tensors="pt"
).to(model.device)
# Generate response with customized parameters
generated_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True, # Enable sampling for more diverse outputs
temperature=0.7, # Control randomness (higher = more random)
top_p=0.9, # Nucleus sampling parameter
repetition_penalty=1.1 # Discourage repetition
)
# Decode response
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# Extract assistant's response
response = generated_text.split("ASSISTANT:")[-1].strip()
return response
# Example prompts for different visual reasoning tasks
prompts = [
"Describe this image in detail. What are the people doing?",
"Count how many people are in this image and describe what each person is wearing.",
"What emotions can you detect on people's faces in this image?",
"If you had to create a story based on this image, what would it be?"
]
# Generate and display responses
for i, prompt in enumerate(prompts):
print(f"\nPrompt {i+1}: {prompt}")
print("-" * 50)
response = generate_vl_response(image, prompt)
print(response)
print("=" * 80)
# Load the second image (charts/graphs) for technical analysis
technical_image = load_and_display_image(image_urls[1], "Technical Chart")
# Technical analysis prompt
technical_prompt = "Analyze this chart. What patterns do you observe? What conclusions can you draw from this data visualization?"
# Generate and display technical analysis
print(f"\nTechnical Analysis Prompt: {technical_prompt}")
print("-" * 50)
response = generate_vl_response(technical_image, technical_prompt, max_new_tokens=512)
print(response)
Comprehensive Code Breakdown: Advanced DeepSeek-VL Implementation
This code example demonstrates how to leverage DeepSeek-VL for sophisticated visual reasoning tasks. Let's break down each component:
1. Setup and Model Initialization
- Library imports: Beyond basic dependencies, we specifically import
AutoModelForCausalLMwhich provides a more flexible interface for generative tasks than the basicAutoModelused in the previous example. - Helper function:
load_and_display_image()encapsulates image loading logic, making the code more modular and reusable. - Model optimization:
torch_dtype=torch.float16enables half-precision computation, reducing memory usage by approximately 50% with minimal impact on output quality.device_map="auto"intelligently distributes model layers across available GPUs or uses CPU offloading when needed.
2. Multi-image Processing
- Image collection: Stores multiple image URLs for different analysis scenarios, demonstrating DeepSeek-VL's versatility.
- Sequential processing: The code is structured to analyze multiple images with different prompts, showcasing how the model handles diverse visual contexts.
3. Response Generation Function
- Chat-style formatting: Unlike the previous example, this implementation uses DeepSeek-VL's chat interface through the
messagesparameter, which better aligns with conversational applications. - Generation parameters:
do_sample=Trueandtemperature=0.7: Enables controlled randomness in outputs, producing more natural and diverse responses.top_p=0.9: Implements nucleus sampling, which dynamically filters the token probability distribution.repetition_penalty=1.1: Reduces the likelihood of generating repetitive phrases, improving response quality.
4. Task Diversification
- Multiple prompt types: The example includes different types of visual reasoning tasks:
- Descriptive: "Describe this image in detail..."
- Quantitative: "Count how many people..."
- Emotional analysis: "What emotions can you detect..."
- Creative: "If you had to create a story..."
- Technical analysis: "Analyze this chart..."
5. Performance Considerations
- Memory management: The example uses half-precision (
float16) and automatic device mapping to optimize memory usage. - Response length control:
max_new_tokensis adjusted based on the complexity of the task, with technical analysis allowed a longer response (512 tokens vs 256). - Prompt engineering: The prompts are carefully crafted to elicit specific types of visual reasoning, demonstrating how prompt design affects model output.
6. Real-world Application Scenarios
- This implementation demonstrates DeepSeek-VL's capabilities in several practical use cases:
- Social media content analysis: Understanding context and relationships in photos.
- Data visualization interpretation: Extracting insights from charts and graphs.
- Content moderation: Detecting emotional content and potentially sensitive material in images.
- Creative assistance: Helping generate stories or content based on visual inspiration.
7. Extension Possibilities
- This code could be extended in several ways:
- Batch processing: Modify to handle multiple images simultaneously for higher throughput.
- Interactive applications: Integrate into a web interface where users can upload images and select analysis types.
- Multi-turn conversations: Expand the
messagesarray to include previous exchanges for contextual understanding. - Integration with other models: Combine DeepSeek-VL's outputs with specialized models for tasks like object detection or sentiment analysis.
This advanced implementation highlights DeepSeek-VL's flexibility and power for complex visual-language reasoning tasks, making it suitable for both research and production applications where understanding images in context is critical.
5.1.5 Why Text+Image Matters
Accessibility: Helping visually impaired users understand images by providing detailed descriptions of visual content. These models can identify objects, people, scenes, and even interpret spatial relationships, allowing visually impaired individuals to "see" through AI-generated descriptions. They can also assist with navigation by describing surroundings or identifying potential hazards.
For visually impaired individuals, multimodal AI serves as an essential bridge to visual content. These systems go beyond simple object recognition to provide context-rich descriptions that convey the full meaning of images. When a visually impaired person encounters an image online, in a document, or through a specialized device, multimodal models can:
- Generate comprehensive scene descriptions that include not just what objects are present, but their arrangement, colors, lighting, and overall composition
- Identify and describe people in photos, including facial expressions, clothing, actions, and apparent relationships between individuals
- Read and interpret text within images, such as signs, menus, product labels, and instructions
- Recognize landmarks and provide spatial awareness in unfamiliar environments
In real-world applications, these capabilities are being integrated into smartphone apps that can narrate the visual world in real-time, smart glasses that provide audio descriptions of surroundings, and screen readers that can interpret complex visual elements on websites. The technology is particularly valuable for educational materials, allowing visually impaired students to access diagrams, charts, and illustrations that would otherwise be inaccessible without human assistance.
The advancement of these multimodal systems represents a significant step forward in digital inclusivity, empowering visually impaired users with greater independence and access to information that was previously unavailable to them.
Education: Explaining diagrams, charts, or historical photos to enhance learning experiences. Multimodal models can break down complex visualizations into understandable components, clarify scientific diagrams, provide historical context for photographs, and even translate visual mathematical notation into explanations. This makes educational content more accessible and comprehensible across various subjects and learning styles.
In educational contexts, multimodal AI serves as a powerful teaching assistant that bridges visual and textual information:
- For STEM education, these models can analyze complex scientific diagrams and:
- Convert abstract visual concepts into clear, step-by-step explanationsConvert abstract visual concepts into clear, step-by-step explanations
- Identify and label components of biological systems, chemical structures, or engineering schematicsIdentify and label components of biological systems, chemical structures, or engineering schematics
- Translate mathematical expressions and equations into plain language interpretationsTranslate mathematical expressions and equations into plain language interpretations
- In history and social studies, multimodal models enhance learning by:
- Providing detailed context for historical photographs, including time period, cultural significance, and historical relevanceProviding detailed context for historical photographs, including time period, cultural significance, and historical relevance
- Analyzing primary source documents with both textual and visual elementsAnalyzing primary source documents with both textual and visual elements
- Making connections between visual artifacts and broader historical narrativesMaking connections between visual artifacts and broader historical narratives
- For data literacy, these systems help students by:
- Breaking down complex charts and graphs into comprehensible insightsBreaking down complex charts and graphs into comprehensible insights
- Explaining statistical visualizations and data trends in accessible languageExplaining statistical visualizations and data trends in accessible language
- Teaching students how to interpret different types of data representationsTeaching students how to interpret different types of data representations
These capabilities are particularly valuable for students with different learning styles, allowing visual learners to receive verbal explanations and verbal learners to better understand visual content. They also support personalized learning by adapting explanations to different educational levels, from elementary to advanced university courses.
Creative work: Generating captions, stories, or descriptions that can inspire artists, writers, and content creators. These models can suggest creative interpretations of images, develop narratives based on visual scenes, assist with storyboarding by describing sequential images, and help marketers craft compelling visual content with appropriate messaging.
For creative professionals, multimodal AI serves as both muse and collaborator. Writers facing creative blocks can use these systems to generate story prompts from visual inspiration. When shown an image of a misty forest at dawn, for instance, the AI might suggest narrative elements like "a forgotten path leading to an ancient secret" or "the meeting place of two worlds." This capability transforms random visual stimuli into structured creative starting points.
Visual artists and designers benefit from AI-generated descriptions that highlight elements they might otherwise overlook. A photographer reviewing their portfolio might gain new perspective when the AI points out "the interplay of shadow and reflection creates a natural frame around the subject" or "the unexpected color contrast draws attention to the emotional center of the image."
In film and animation, these models streamline the pre-production process. Storyboard artists can quickly generate descriptive text for sequential panels, helping directors and producers visualize narrative flow before committing resources to production. The AI can suggest camera angles, lighting moods, and scene transitions based on visual references, accelerating the creative development cycle.
For content marketers, multimodal models bridge the gap between visual assets and compelling messaging. When analyzing product photography, these systems can generate targeted copy that aligns with both the visual elements and brand voice, ensuring consistent communication across channels. This capability is particularly valuable for social media campaigns where striking visuals must be paired with concise, engaging text in multiple formats and platforms.
Productivity: Extracting structured insights from documents, tables, or screenshots, which saves time and improves efficiency in professional settings. Instead of manually parsing visual data, users can leverage AI to convert tables into spreadsheets, extract key information from receipts or business cards, analyze graphs and charts in reports, and transform handwritten notes into searchable text.
This productivity advantage manifests across numerous professional workflows:
- In financial services, multimodal AI can automatically process invoices and receipts by:
- Identifying vendor information, dates, and payment amountsIdentifying vendor information, dates, and payment amounts
- Categorizing expenses according to predefined accounting codesCategorizing expenses according to predefined accounting codes
- Flagging potential discrepancies or unusual chargesFlagging potential discrepancies or unusual charges
- For research and analysis, these systems can:
- Extract precise numerical data from complex charts and graphsExtract precise numerical data from complex charts and graphs
- Convert statistical visualizations into structured datasetsConvert statistical visualizations into structured datasets
- Summarize key trends and outliers identified in visual dataSummarize key trends and outliers identified in visual data
- In administrative workflows, multimodal AI streamlines:
- Business card digitization for immediate contact database integrationBusiness card digitization for immediate contact database integration
- Form processing without manual data entryForm processing without manual data entry
- Meeting note transcription with automatic action item extractionMeeting note transcription with automatic action item extraction
The time savings are substantial—tasks that would require hours of manual data entry can be completed in seconds, while also reducing human error. For organizations handling large volumes of visual documents, this capability transforms information management by making previously inaccessible data searchable, analyzable, and actionable.
Multimodal models bring us closer to AI that interacts with the world as humans do: through multiple senses, not just words. By bridging the gap between visual perception and language understanding, these technologies create more intuitive and natural human-AI interactions that reflect how we naturally process information through multiple channels simultaneously.
5.1 Text+Image Models (LLaVA, Flamingo, GPT-4o, DeepSeek-VL)
So far, we have focused on models that live in the world of words. But human intelligence is multimodal: we learn by reading, seeing, hearing, and interacting with the world. For AI to approach this kind of understanding, language models must also expand beyond text.
This limitation of text-only models becomes evident when we consider how humans perceive and process information. We don't experience the world as isolated streams of text—we integrate visual cues, sounds, and physical interactions to form a comprehensive understanding. Traditional LLMs, despite their impressive capabilities with language, lack this holistic perception that comes naturally to humans.
This is where multimodal LLMs come in. By combining text with images, audio, or video, these models can:
- Describe what they "see" in pictures, recognizing objects, scenes, actions, and even emotional context within visual content.
- Answer questions about charts or diagrams, interpreting visual data representations and translating visual patterns into meaningful insights.
- Connect written descriptions to visual understanding, bridging the gap between abstract concepts described in words and their concrete visual manifestations.
- Support real-world tasks like tutoring, accessibility tools, and robotics, where understanding multiple forms of communication is essential for effective assistance.
Multimodal systems represent a significant leap forward in AI capabilities. Rather than processing each type of data in isolation, these models create connections between different forms of information, much like the human brain integrates signals from our various senses. This cross-modal reasoning allows for richer understanding and more natural interactions with AI systems.
In this chapter, we'll explore how researchers are pushing LLMs beyond text, starting with one of the most active areas: Text+Image models.
Text+Image models extend language models by integrating visual encoders with text-based transformers. This integration represents a significant advancement in AI, allowing models to process and understand both visual and textual information simultaneously. In practice, this integration involves several key components working together:
- An image encoder (like CLIP's vision transformer or a convolutional net) processes an image into embeddings. This encoder analyzes the visual content pixel by pixel, identifying features such as shapes, colors, objects, spatial relationships, and even contextual elements. The encoder works through multiple processing layers, each extracting increasingly complex information:
- Low-level features: First, the encoder detects basic elements like edges, textures, and color patterns across the image. This initial layer of processing works similarly to how our eyes first perceive visual information - identifying contrasts between light and dark, detecting boundaries between colors, and registering texture variations (like smooth vs. rough surfaces).
This stage is computationally intensive as the model must analyze every pixel and its relationship to neighboring pixels. For example, when processing a photograph of a forest, the encoder might identify:
- Vertical lines representing tree trunks
- Irregular patterns of green representing foliage
- Textural differences between rough bark and smooth leaves
- Shadow gradients indicating depth and lighting direction
- Color transitions between sky and terrain
The encoder uses specialized filters that respond to specific patterns - some detect horizontal lines, others vertical lines, while others identify specific color gradients or textural elements. These filters work in parallel across the entire image, creating feature maps that highlight where each pattern appears most strongly.
These fundamental visual elements form the building blocks for all higher-level recognition, much like how letters combine to form words and sentences in language processing. Without accurate detection at this stage, the more complex recognition tasks in subsequent layers would fail.
- Mid-level features: These basic elements are then combined to recognize more complex structures such as specific shapes, object parts, and spatial arrangements. At this stage, the model begins to identify meaningful patterns - recognizing that certain edges form the outline of a face, or that particular textures likely represent fur, fabric, or foliage.
This mid-level processing is crucial because it bridges the gap between raw visual data and semantic understanding. For example, when processing an image of a person walking a dog in a park:
- The model might recognize curved lines and color patterns that form the silhouette of a human figure
- It identifies four-legged shapes with characteristic proportions that indicate "dog"
- It detects textural patterns of grass, trees, and sky that suggest "outdoor environment"
- It recognizes spatial configurations that establish the relationship between person and dog (connected by a leash)
The model also starts to understand spatial relationships, determining when objects are above, below, or inside others. These spatial relationships provide critical context - a cup on a table has different implications than a table on a cup. The model learns to recognize standard spatial arrangements (like furniture in a room) and unusual configurations that might require special attention.
- High-level features: Finally, the encoder identifies complete objects, scenes, actions, and the relationships between elements in the image. This is where true "understanding" emerges, as the model recognizes not just isolated objects but meaningful context - distinguishing between a dog sitting on a sofa versus running through a park, or understanding that a person holding a tennis racket near a net represents a specific activity.
At this highest level of processing, the model performs several sophisticated cognitive tasks:
- Object recognition and classification: The model can identify whole entities (people, animals, vehicles, furniture) and categorize them into specific types or classes (German Shepherd dog, mid-century sofa, professional tennis player).
- Scene understanding: Beyond individual objects, the model comprehends entire environments - recognizing a kitchen from its appliances and layout, or a beach scene from the combination of sand, water, and distinctive lighting.
- Action recognition: The model can interpret dynamic elements - differentiating between someone running versus walking, or throwing versus catching - based on posture, positioning, and contextual cues.
- Relationship detection: Perhaps most impressively, the model identifies how objects relate to each other spatially and functionally - recognizing that a person is walking a dog (connected by a leash), riding a bicycle (positioned on top), or cooking food (performing actions on ingredients).
- Contextual inference: The model makes educated guesses about the broader situation - inferring a birthday celebration from candles on a cake and gathering of people, or a professional meeting from business attire and a conference room setting.
The model can also interpret emotional content, social interactions, and even infer potential narratives within the scene. It might recognize facial expressions indicating happiness or concern, body language suggesting tension or relaxation, or social dynamics like a teacher instructing students or friends enjoying a meal together. Through extensive training on millions of images with corresponding descriptions, the model learns to associate visual patterns with rich semantic concepts, enabling it to "see" at a level that approximates human understanding.
The result is a dense representation of the image's content in a numerical format that the model can process - essentially translating visual information into a "language" that the AI can understand and reason with.
- Low-level features: First, the encoder detects basic elements like edges, textures, and color patterns across the image. This initial layer of processing works similarly to how our eyes first perceive visual information - identifying contrasts between light and dark, detecting boundaries between colors, and registering texture variations (like smooth vs. rough surfaces).
- A projection layer maps those embeddings into the same space as the language model's tokens. This critical alignment step ensures that visual information and text information can be processed together. Without this projection, the model would struggle to make meaningful connections between what it sees and what it understands through language.
The projection layer essentially translates the "language of images" into a format compatible with the "language of text," allowing both modalities to coexist in the same computational space. This process involves several sophisticated transformations:
Dimensionality alignment: Image embeddings and text embeddings often have different dimensions and structures. The projection layer reshapes visual features to match the exact dimensions expected by the language model, ensuring that every visual concept can be represented in a way the text processing components can interpret. This process involves complex mathematical transformations that convert the high-dimensional tensors from the vision encoder (which might have shapes like [batch_size, sequence_length, vision_dimension]) into the format required by the language model (typically [batch_size, sequence_length, hidden_dimension]).
For example, a vision encoder might output features with 1024 dimensions per token, while the language model might work with 768-dimensional embeddings. The projection layer would then implement a learned linear transformation (essentially a matrix multiplication) that maps each 1024-dimensional vector to a 768-dimensional vector while preserving as much semantic information as possible.
This alignment is not just about matching numbers - it's about preserving the rich semantic relationships captured in the visual domain. The projection parameters are learned during training, allowing the model to discover optimal mappings between visual concepts and their linguistic counterparts. This ensures that when the language model attends to these projected visual features, it can extract meaningful information that corresponds to concepts it understands through language.
Semantic mapping: Beyond simple dimension matching, the projection layer learns to map visual concepts to their linguistic counterparts. For example, the visual features representing "a red apple" must be projected into a space where they can interact meaningfully with the text tokens for "red" and "apple."
This semantic mapping is a sophisticated translation process that bridges two fundamentally different representational systems. When processing an image of a red apple, the vision encoder extracts features capturing its roundness, smooth texture, red coloration, and stem. These visual features exist as abstract numerical patterns distributed across multiple embedding dimensions. The projection layer must transform these distributed visual patterns into representations that align with how language models understand concepts like "red" (a color attribute) and "apple" (a fruit category).
The challenge is significant because visual and linguistic representations are structured differently:
- In vision, concepts are often entangled - the "redness" and "appleness" exist simultaneously in the same pixels and are processed together.
- In language, concepts are more discrete - "red" and "apple" are separate tokens with distinct meanings that compose together.
Through extensive training on paired image-text data, the projection layer learns to disentangle these visual features and map them to their linguistic counterparts. When successful, the projected visual features will activate similar neural patterns as would be activated by the text "red apple" in the language model. This enables the language model to reason about the visual content using its language understanding capabilities - for instance, answering questions like "What color is the apple?" by connecting the visual representation to the appropriate linguistic concept "red".
This semantic alignment is what allows multimodal models to perform cross-modal reasoning tasks, such as describing unseen objects, answering questions about visual content, or generating text that references visual elements in contextually appropriate ways.
Contextual integration: The projection ensures that contextual relationships in the visual domain (like spatial relationships between objects) are preserved in a way that the language model can access and reason about. This allows the model to answer questions about relative positions or interactions between objects in an image.
This contextual integration is particularly crucial because visual scenes contain rich spatial and relational information that must be translated into a format the language model can process. For example, when looking at an image of a dining table, the model needs to understand not just that there are plates, glasses, and utensils, but their arrangement (plates in front of chairs, glasses above plates, forks to the left of plates), their groupings (place settings), and their functional relationships (napkins folded on plates).
The projection layer preserves these spatial hierarchies by maintaining relative positional information between visual features. Through specialized attention mechanisms, it ensures that:
- Proximity relationships ("the book is next to the lamp") are encoded in ways that language models can interpret
- Containment relationships ("the apple is in the bowl") maintain their hierarchical structure
- Directional relationships ("the dog is facing the camera") preserve orientation information
- Scale relationships ("the elephant is larger than the mouse") retain relative size information
This sophisticated mapping enables the model to correctly interpret questions like "What's above the bookshelf?", "Is the child holding the balloon?", or "Which way is the car facing?" - questions that require understanding not just what objects are present but how they relate to one another in physical space.
Without proper contextual integration, a model might recognize all objects in an image but fail to understand their meaningful relationships, severely limiting its ability to reason about scenes as humans naturally do.
- The language model treats visual embeddings as if they were special tokens, allowing it to "attend" to both words and pixels. Through self-attention mechanisms, the model can create connections between visual elements and textual concepts, forming a comprehensive understanding that spans both modalities.
This integration happens through a sophisticated process where the transformer architecture's self-attention mechanism simultaneously processes both text tokens and visual tokens. When a user asks "What color is the car in this image?", the model's attention heads can focus on:
- The visual embeddings representing the car in the image
- The textual tokens related to "color" and "car" in the query
- The contextual relationship between these elements
The self-attention weights form a complex web of connections, allowing information to flow bidirectionally between modalities. For example, when processing an image of a red sports car alongside text mentioning "vehicle," the model can:
- Associate visual features of the car with the word "vehicle" in the text
- Connect color properties from the visual embedding to potential color descriptions
- Link spatial relationships in the image (car on road) to potential scene descriptions
This cross-modal attention enables the model to perform tasks like visual question answering, image captioning, and text-conditional reasoning about visual content. The attention maps themselves reveal how the model distributes focus across different parts of both the image and text when forming its understanding.
This allows the model to reason about relationships between what it "sees" and what it "reads."
This fusion of visual and textual processing creates a powerful system that can understand context across modalities, enabling it to answer prompts like:
- "What's written on the sign in this photo?" - requiring text recognition within images and understanding of visual context. The model must identify text elements embedded within the visual scene, distinguish them from other visual features, and accurately transcribe the text while maintaining awareness of the sign's context in the broader image (whether it's a street sign, store front, warning notice, etc.).
- "Describe this chart in plain English." - requiring interpretation of data visualizations and translation into natural language. Here, the model must recognize the chart type (bar graph, pie chart, line graph, etc.), identify axes labels, data points, and trends, then synthesize this information into coherent prose that captures the key relationships and insights presented in the visualization.
- "Write a story about this image." - requiring creative generation based on visual stimuli and understanding of narrative elements. This complex task requires the model to recognize not just objects but their relationships, potential emotional content, implied actions or intentions, and then use these elements to create a coherent narrative with characters, setting, plot, and thematic elements that plausibly extend from what's visible in the image.
5.1.1 LLaVA (Large Language and Vision Assistant)
Open-source model combining CLIP for vision + Vicuna (LLM). CLIP (Contrastive Language-Image Pre-training) serves as the vision encoder that processes and extracts features from images, while Vicuna, a fine-tuned version of LLaMA, handles the language processing capabilities. The architecture leverages CLIP's powerful visual representation ability, which was trained on 400 million image-text pairs to understand visual concepts, and combines it with Vicuna's advanced language understanding and generation capabilities.
LLaVA follows a two-stage training process. First, it's pretrained on a large corpus of image-text pairs to establish basic connections between visual and linguistic information. Then, it's specifically trained on instruction-following data that pairs images with text prompts. This training approach enables LLaVA to understand and respond to specific instructions about visual content, going beyond simple image captioning to more complex reasoning about what it sees. This instruction-tuning is what gives LLaVA its ability to follow nuanced directions when analyzing images, rather than just generating generic descriptions.
The training dataset includes approximately 158,000 image-text instruction pairs, carefully curated to cover a wide range of visual reasoning tasks, from simple object identification to complex scene interpretation. This instruction-tuning phase is crucial as it teaches the model to follow specific directives when analyzing visual content. The dataset incorporates diverse image types including natural photographs, diagrams, charts, screenshots, and artistic images, ensuring the model can handle various visual formats. The text instructions are similarly diverse, ranging from simple requests like "What color is the car?" to more complex ones like "Explain the relationship between the people in this image and what they might be feeling."
Example task: describing an image in detail. LLaVA can generate comprehensive descriptions that include object identification, spatial relationships, attributes, actions, and even infer context or emotions from visual scenes. Its descriptions can range from factual observations to more interpretative analyses depending on the prompt.
For instance, when shown an image of a city street, LLaVA can identify not only the vehicles, pedestrians, and buildings, but also describe their relationships (e.g., "a person crossing the street while cars wait at a red light"), infer weather conditions based on visual cues (e.g., "wet pavement suggests recent rainfall"), and even comment on the likely time of day based on lighting conditions and shadows. The model can also perform more specialized tasks like reading text in images, analyzing charts or graphs, identifying landmarks, and recognizing famous people or artwork, demonstrating its versatility across different visual analysis scenarios.
LLaVA stands out for its efficient architecture that achieves strong performance while requiring relatively modest computational resources compared to proprietary alternatives. Its open-source nature has made it a popular choice for researchers and developers working on vision-language applications. The model's architecture is notably streamlined, using a simple projection layer to connect CLIP's vision embeddings with Vicuna's language processing capabilities. This approach avoids the computational overhead of more complex cross-attention mechanisms while still enabling effective communication between the visual and language components. The smaller variants of LLaVA can run on consumer-grade GPUs with 16GB of memory, making advanced multimodal AI accessible to a much broader range of researchers and developers than closed-source alternatives that may require specialized hardware.
The model achieves competitive performance on benchmarks such as VQAv2 (Visual Question Answering) and GQA (Grounded Question Answering), while being significantly more resource-efficient than closed-source alternatives like GPT-4V. On the VQAv2 benchmark, which evaluates a model's ability to answer questions about images, LLaVA-1.5 achieves scores comparable to much larger proprietary models. Its accessibility allows developers to fine-tune it for specific domains or applications, such as medical image analysis (interpreting X-rays, CT scans, and other medical imaging), retail product recognition (identifying products in shelves or catalog images), or educational content development (explaining scientific diagrams or historical artifacts), fostering a growing ecosystem of specialized multimodal AI applications. The model has inspired numerous derivatives and extensions in the open-source community, including versions optimized for different languages, specialized for particular domains like document understanding, or modified to work with video input rather than static images.
Code Example: Using LLaVA for Multimodal Processing
# Complete LLaVA implementation example
import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
# Step 1: Load the pre-trained LLaVA model and processor
model_id = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Step 2: Prepare the image
image = Image.open("colosseum.jpg")
# Step 3: Define your prompt
prompt = "Describe this image in detail."
# Step 4: Process the inputs
inputs = processor(
prompt,
images=image,
return_tensors="pt"
).to("cuda")
# Step 5: Generate the response
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
# Step 6: Decode and print the response
generated_text = processor.decode(output[0], skip_special_tokens=True)
print(generated_text)
For this example, download the Colosseum image here: https://files.cuantum.tech/images/colosseum.jpg
Code Breakdown: Using LLaVA for Multimodal Processing
This code demonstrates how to use the LLaVA (Large Language and Vision Assistant) model to process images and generate descriptive text. Let's break down each part in detail:
1. Imports and Setup
- torch: The PyTorch library provides tensor computation and neural networks functionality.
- PIL.Image: The Python Imaging Library allows us to open and manipulate image files.
- AutoProcessor: Automatically selects the appropriate processor for the model, handling both text tokenization and image preprocessing.
- LlavaForConditionalGeneration: The main LLaVA model class that combines vision and language capabilities.
2. Model Loading
The code loads the LLaVA 1.5 7B model from Hugging Face, which is a moderate-sized variant balancing performance and resource requirements:
- torch_dtype=torch.float16: Uses half-precision floating-point format to reduce memory usage.
- device_map="auto": Automatically determines the optimal device placement strategy, distributing model components across available GPUs or using CPU as needed.
3. Input Preparation
The code prepares two key inputs:
- An image loaded using PIL's Image.open() function.
- A text prompt that specifies the task ("Describe this image in detail").
The processor then:
- Resizes and normalizes the image to match CLIP's expected input format (224x224 pixels).
- Tokenizes the text prompt into input IDs for the language model component.
- Creates attention masks and other required tensor inputs.
4. Generation Process
The model.generate() method creates the text response with several parameters controlling the generation:
- max_new_tokens=256: Limits the response length to a maximum of 256 new tokens.
- do_sample=True: Enables sampling-based generation rather than greedy decoding.
- temperature=0.6: Controls randomness in the generation (lower values are more deterministic).
- top_p=0.9: Implements nucleus sampling, considering only tokens whose cumulative probability exceeds 90%.
5. Behind the Scenes: How LLaVA Processes the Image
When you run this code, LLaVA performs several sophisticated operations:
- The CLIP vision encoder extracts visual features from the image, creating a high-dimensional representation that captures objects, attributes, spatial relationships, and other visual information.
- The projection layer transforms these visual embeddings into a format compatible with the language model's embedding space, essentially "translating" visual concepts into a language the LLM can understand.
- The Vicuna language model (based on LLaMA) receives both the projected visual embeddings and the tokenized prompt, treating the visual information as special tokens in its context window.
- The self-attention mechanism allows the model to focus on relevant parts of both the image representation and the text prompt when generating each token of the response.
- The decoder generates a coherent, contextually appropriate text response based on both the visual content and the text instruction.
6. Advanced Customization Options
The basic example above can be extended with additional parameters for more control:
# Advanced parameters for more control
output = model.generate(
**inputs,
max_new_tokens=512, # Generate longer responses
do_sample=True, # Enable sampling-based generation
temperature=0.7, # Slightly more creative responses
top_p=0.9, # Nucleus sampling parameter
top_k=50, # Limit vocabulary to top 50 tokens
repetition_penalty=1.2, # Discourage repetition of phrases
length_penalty=1.0, # No penalty based on length
no_repeat_ngram_size=3, # Avoid repeating 3-grams
)
7. Practical Applications
This code structure can be adapted for various multimodal tasks by modifying the prompt:
- Visual question answering: "What color is the car in this image?"
- Image reasoning: "Explain what might happen next in this scene."
- Content extraction: "Extract all text visible in this image."
- Creative generation: "Write a short story inspired by this image."
LLaVA's architecture effectively bridges vision and language, enabling these diverse applications with the same underlying model.
Advanced Example: Interactive Visual Question Answering with LLaVA
The following code demonstrates a more sophisticated use case for LLaVA: building an interactive visual question answering application that can process uploaded images and answer questions about them in real-time.
# Advanced LLaVA application: Interactive Visual QA with Gradio
import torch
import gradio as gr
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
# Load the LLaVA model and processor
model_id = "llava-hf/llava-1.5-13b-hf" # Using larger 13B parameter version
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
def process_image_and_question(image, question, temperature=0.7, max_length=500):
"""Process an image and a question to generate a response using LLaVA."""
# Prepare the prompt with the user's question
prompt = f"Answer this question about the image: {question}"
# Process inputs
inputs = processor(
prompt,
images=image,
return_tensors="pt"
).to(model.device)
# Generate the response
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
)
# Decode the response
generated_text = processor.decode(output[0], skip_special_tokens=True)
# Return just the model's answer, removing the original question
response = generated_text.split("Answer this question about the image:")[-1].strip()
return response
# Set up the Gradio interface
with gr.Blocks() as demo:
gr.Markdown("# LLaVA Visual Question Answering")
gr.Markdown("Upload an image and ask a question about it.")
with gr.Row():
with gr.Column():
image_input = gr.Image(type="pil", label="Upload Image")
question_input = gr.Textbox(label="Your Question", placeholder="What's happening in this image?")
temperature = gr.Slider(0.1, 1.0, value=0.7, label="Temperature (creativity)")
max_length = gr.Slider(50, 1000, value=500, step=50, label="Maximum response length")
submit_button = gr.Button("Get Answer")
with gr.Column():
output_text = gr.Textbox(label="LLaVA's Answer", lines=10)
# Connect the interface to the processing function
submit_button.click(
fn=process_image_and_question,
inputs=[image_input, question_input, temperature, max_length],
outputs=output_text
)
# Add example images and questions
gr.Examples(
examples=[
["example_street_scene.jpg", "What safety hazards do you see in this image?"],
["example_chart.jpg", "Explain the main trend shown in this chart."],
["example_food.jpg", "What ingredients might be in this dish?"]
],
inputs=[image_input, question_input]
)
# Launch the application
demo.launch()
For this example, download the required images from these links:
Street Scene: files.cuantum.tech/images/example_street_scene.jpg
Chart: https://files.cuantum.tech/images/example_chart.jpg
Food: https://files.cuantum.tech/images/example_food.jpg
Code Breakdown: Interactive Visual QA Application
This advanced example demonstrates how to build a user-friendly application for visual question answering using LLaVA. Let's break down the key components:
1. Model Selection and Setup
- LLaVA 1.5-13B: This code uses the larger 13B parameter version of LLaVA (compared to the 7B in the previous example), which offers improved reasoning capabilities at the cost of requiring more computational resources.
- The same initialization approach is used, with float16 precision and automatic device mapping to optimize for available hardware.
2. Core Processing Function
The process_image_and_question() function handles the core multimodal processing:
- It takes four inputs: an image, a question, and two generation parameters (temperature and max length).
- The question is formatted into a standardized prompt format that helps guide LLaVA's response generation.
- After processing, it extracts just the relevant answer portion, removing the original prompt for a cleaner user experience.
3. Gradio Interface Construction
The code uses Gradio to create an intuitive web interface for the application:
- User inputs: Image upload, question text box, and generation parameter sliders for fine-tuning responses.
- Layout organization: Arranged in a two-column layout for inputs on the left and outputs on the right.
- Examples: Pre-configured example images and questions to demonstrate the system's capabilities.
4. Behind the Scenes: Enhanced Multimodal Processing
When a user interacts with this application, several sophisticated processes occur:
- The uploaded image is automatically preprocessed by the Gradio interface to ensure compatibility with LLaVA's input requirements.
- The LLaVA processor handles both the text tokenization and image preprocessing, ensuring proper alignment between modalities.
- The question is formatted into a directive that helps the model understand the specific visual reasoning task required.
- Generation parameters provide user control over the response style - higher temperature produces more creative but potentially less precise answers.
- Post-processing extracts just the relevant answer, creating a cleaner conversational experience.
5. Potential Applications
This interactive application template could be adapted for numerous real-world use cases:
- Educational tools: Students could upload diagrams or historical images and ask for explanations.
- Accessibility services: Visually impaired users could ask detailed questions about photographs or documents.
- E-commerce: Shoppers could upload product images and ask specific questions about features or compatibility.
- Technical support: Users could share screenshots of error messages or hardware setups and ask for troubleshooting advice.
- Content moderation: Platforms could use a modified version to help analyze uploaded images for policy compliance.
6. Technical Considerations and Limitations
When implementing this type of application, it's important to consider:
- Hardware requirements: The 13B parameter model requires a GPU with at least 24GB VRAM for optimal performance.
- Inference speed: Response generation typically takes 2-10 seconds depending on hardware and response length.
- Image resolution: LLaVA processes images at a fixed resolution (typically 224x224 pixels), which may limit detailed analysis of very small elements.
- Privacy considerations: For sensitive applications, consider running this locally rather than on cloud infrastructure.
This example illustrates how LLaVA's capabilities can be packaged into user-friendly applications that bring multimodal AI's power to non-technical users. The combination of visual understanding, language generation, and interactive controls creates a flexible system for a wide range of visual reasoning tasks.
5.1.2 Flamingo (DeepMind)
Flamingo is a groundbreaking multimodal model developed by DeepMind, specifically engineered to excel at few-shot learning across text and image domains. Unlike models that require extensive task-specific training, Flamingo can adapt to new visual tasks with minimal examples. This represents a significant advancement in multimodal AI, as most earlier systems required dedicated training datasets for each new type of visual reasoning task they needed to perform.
At its architectural core, Flamingo uses a frozen language model (LLM) as its foundation and introduces specialized cross-attention layers that create bridges between visual representations and textual understanding. These cross-attention mechanisms serve as effective translators, allowing visual information to be meaningfully incorporated into the language model's processing pipeline without disrupting its pre-trained linguistic capabilities. The visual processing component of Flamingo utilizes a vision encoder based on a Normalizer-Free ResNet (NFNet), which transforms images into dense feature representations. These visual features are then processed through a perceiver resampler module that converts the variable-sized visual representations into a fixed number of visual tokens that can be efficiently processed by the language model.
What makes Flamingo particularly impressive is its ability to perform "in-context learning" with visual data. It can answer questions about previously unseen image-text tasks with remarkably little training data - often needing just 1-16 examples to achieve strong performance. This capability allows Flamingo to generalize to novel visual reasoning scenarios without extensive retraining, making it adaptable across domains like visual question answering, image captioning, and visual reasoning with minimal setup time. The model was trained on a massive multimodal dataset comprising hundreds of millions of image-text pairs gathered from diverse web sources, enabling it to develop a rich understanding of the relationships between visual and textual concepts.
During inference, Flamingo can process interleaved sequences of images and text, making it particularly well-suited for conversational interactions about visual content. For example, a user could show Flamingo several images of animals with corresponding descriptions as examples, then present a new animal image and ask for a similar description. The model would leverage its few-shot learning capabilities to generate an appropriate response following the pattern established in the examples. This flexibility extends to complex reasoning tasks as well, such as comparing multiple images, answering questions about specific visual details, or even generating creative content inspired by visual inputs.
The model's architecture has inspired subsequent research in efficient multimodal learning, particularly in how to effectively combine pre-trained unimodal models (like vision-only and language-only systems) into powerful multimodal reasoners without requiring extensive joint training from scratch. This approach has proven valuable for developing more accessible multimodal AI systems while leveraging the strengths of specialized models in each modality.
Flamingo Implementation Example: Multimodal Few-shot Learning
Below is a simplified implementation example of a Flamingo-inspired architecture using PyTorch. This example demonstrates the core components of Flamingo: a vision encoder, a perceiver resampler, and cross-attention layers integrated with a language model.
import torch
import torch.nn as nn
import torchvision.models as models
from transformers import GPT2LMHeadModel, GPT2Tokenizer
class PerceiverResampler(nn.Module):
"""
Perceiver Resampler module that converts variable-sized visual features
to a fixed number of tokens that can be processed by the language model.
"""
def __init__(self, input_dim=2048, latent_dim=768, num_latents=64, num_layers=4):
super().__init__()
self.latents = nn.Parameter(torch.randn(num_latents, latent_dim))
self.layers = nn.ModuleList([
nn.MultiheadAttention(embed_dim=latent_dim, num_heads=8, batch_first=True)
for _ in range(num_layers)
])
self.input_proj = nn.Linear(input_dim, latent_dim)
self.norm = nn.LayerNorm(latent_dim)
def forward(self, visual_features):
# Project visual features to latent dimension
visual_features = self.input_proj(visual_features)
# Expand latents to batch size
batch_size = visual_features.shape[0]
latents = self.latents.unsqueeze(0).expand(batch_size, -1, -1)
# Process through cross-attention layers
for layer in self.layers:
latents = latents + layer(
query=latents,
key=visual_features,
value=visual_features,
need_weights=False
)[0]
latents = self.norm(latents)
return latents
class CrossAttentionBlock(nn.Module):
"""
Cross-attention block that integrates visual information into the LLM.
"""
def __init__(self, hidden_size=768, num_heads=12):
super().__init__()
self.cross_attention = nn.MultiheadAttention(
embed_dim=hidden_size,
num_heads=num_heads,
batch_first=True
)
self.layer_norm1 = nn.LayerNorm(hidden_size)
self.layer_norm2 = nn.LayerNorm(hidden_size)
def forward(self, hidden_states, visual_features):
normed_hidden_states = self.layer_norm1(hidden_states)
# Apply cross-attention
attn_output = self.cross_attention(
query=normed_hidden_states,
key=visual_features,
value=visual_features,
need_weights=False
)[0]
# Residual connection and layer norm
hidden_states = hidden_states + attn_output
hidden_states = self.layer_norm2(hidden_states)
return hidden_states
class FlamingoModel(nn.Module):
"""
Simplified Flamingo model combining vision encoder, perceiver resampler,
and a language model with cross-attention layers.
"""
def __init__(self, vision_model_name="resnet50", num_visual_tokens=64):
super().__init__()
# Vision encoder (frozen)
self.vision_encoder = models.__dict__[vision_model_name](pretrained=True)
self.vision_encoder.fc = nn.Identity() # Remove classification head
for param in self.vision_encoder.parameters():
param.requires_grad = False
# Perceiver resampler
self.perceiver = PerceiverResampler(
input_dim=2048, # ResNet50 feature dim
latent_dim=768, # Match GPT2 hidden size
num_latents=num_visual_tokens
)
# Language model (frozen)
self.language_model = GPT2LMHeadModel.from_pretrained("gpt2")
self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
self.tokenizer.pad_token = self.tokenizer.eos_token
for param in self.language_model.parameters():
param.requires_grad = False
# Cross-attention layers (one per transformer block)
self.cross_attentions = nn.ModuleList([
CrossAttentionBlock(hidden_size=768, num_heads=12)
for _ in range(len(self.language_model.transformer.h))
])
# Save original forward methods
self.original_block_forward = self.language_model.transformer.h[0].forward
# Monkey patch the transformer blocks to include cross-attention
for i, block in enumerate(self.language_model.transformer.h):
block.flamingo_cross_attn = self.cross_attentions[i]
block.forward = self._make_new_forward(block, i)
# Visual features buffer for storing current visual context
self.register_buffer("visual_features", None, persistent=False)
def _make_new_forward(self, block, block_index):
"""Creates a new forward method for transformer blocks that includes cross-attention."""
original_forward = block.forward
cross_attn = self.cross_attentions[block_index]
def new_forward(x, **kwargs):
# Run original transformer block
hidden_states = original_forward(x, **kwargs)
# Apply cross-attention with visual features
if self.visual_features is not None:
hidden_states = cross_attn(hidden_states[0], self.visual_features)
return (hidden_states,) + hidden_states[1:] if isinstance(hidden_states, tuple) else (hidden_states,)
return hidden_states
return new_forward
def process_images(self, images):
"""Extract visual features from images and prepare them for conditioning."""
with torch.no_grad():
# Extract features from vision encoder
features = self.vision_encoder(images) # [batch_size, 2048]
features = features.unsqueeze(1) # Add sequence dimension [batch_size, 1, 2048]
# Process through perceiver resampler
visual_tokens = self.perceiver(features) # [batch_size, num_latents, hidden_size]
# Store visual features for cross-attention
self.visual_features = visual_tokens
def generate(self, prompt, images=None, max_length=100, temperature=0.7):
"""Generate text conditioned on images and text prompt."""
# Process images if provided
if images is not None:
self.process_images(images)
else:
self.visual_features = None
# Tokenize prompt
inputs = self.tokenizer(prompt, return_tensors="pt", padding=True)
input_ids = inputs.input_ids.to(next(self.parameters()).device)
attention_mask = inputs.attention_mask.to(next(self.parameters()).device)
# Generate text
output_ids = self.language_model.generate(
input_ids,
attention_mask=attention_mask,
max_length=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
)
# Decode output
generated_text = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
return generated_text
# Example usage
def flamingo_example():
from PIL import Image
import torchvision.transforms as transforms
# Initialize model
model = FlamingoModel().to("cuda" if torch.cuda.is_available() else "cpu")
# Prepare image transform
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Load and process image
image = Image.open("eiffel-tower.jpg")
image_tensor = transform(image).unsqueeze(0).to(next(model.parameters()).device)
# Example prompts for few-shot learning
few_shot_prompt = """
Image: [A photo of a busy street in Tokyo]
Description: The image shows a crowded street in Tokyo with neon signs, many pedestrians, and small restaurants.
Image: [A photo of the Grand Canyon]
Description: The image depicts the vast expanse of the Grand Canyon with its layered rock formations and deep ravines.
Image: [Current image]
Description:
"""
# Generate text based on image
output = model.generate(few_shot_prompt, images=image_tensor, max_length=200)
print(output)
if __name__ == "__main__":
flamingo_example()
For this example, download the Eiffel Tower image here: https://files.cuantum.tech/images/eiffel-tower.jpg
Code Breakdown: Flamingo-inspired Multimodal Model
The above implementation represents a simplified version of DeepMind's Flamingo architecture. Let's break down the key components:
1. Architecture Components
- Vision Encoder: A pretrained ResNet50 model that extracts visual features from images. In the full Flamingo model, this would be a more advanced vision model like NFNet.
- Perceiver Resampler: This critical component transforms variable-sized visual features into a fixed number of visual tokens. It uses cross-attention between learned latent vectors and visual features to condense the visual information.
- Language Model: A pretrained GPT-2 model serves as the language foundation. The original Flamingo used a more powerful Chinchilla LLM.
- Cross-Attention Layers: These layers are inserted into each transformer block of the language model, allowing visual information to influence text generation at multiple levels of processing.
2. Key Design Decisions
- Frozen Backbone Models: Both the vision encoder and language model are kept frozen, preserving their pretrained capabilities while only training the connecting components.
- Parameter Efficiency: By only training the perceiver resampler and cross-attention layers, Flamingo achieves multimodal capabilities with relatively few trainable parameters.
- Monkey Patching: The implementation uses a technique called "monkey patching" to insert cross-attention into the language model without modifying its original architecture.
3. How Visual Processing Works
- The image is passed through the vision encoder to extract high-level visual features (2048-dimensional for ResNet50).
- These features are then processed by the perceiver resampler, which condenses them into a fixed set of tokens (64 in this example).
- The resulting visual tokens are stored in a buffer and made available to all cross-attention layers during text generation.
4. How Few-Shot Learning Is Implemented
- The example demonstrates few-shot learning through a carefully formatted prompt containing example image-text pairs.
- Each example follows a pattern of "Image: [description]" followed by "Description: [detailed text]".
- The final prompt ends with "Image: [Current image]" and "Description:", prompting the model to generate a description for the new image following the pattern established by the examples.
- This in-context learning approach allows the model to adapt to specific tasks without parameter updates.
5. Practical Considerations and Limitations
- Computational Efficiency: The real Flamingo model uses sophisticated techniques for handling larger contexts and more efficiently processing visual information.
- Training Requirements: To fully train this model, you would need a large dataset of image-text pairs and significant computational resources.
- Simplified Architecture: This example omits some details of the full Flamingo architecture for clarity, such as gated cross-attention and more advanced visual processing.
6. Real-world Applications
- Visual question answering: Answering specific questions about image content with few or no examples.
- Image captioning: Generating detailed descriptions of images in various styles based on examples.
- Visual reasoning: Performing complex reasoning tasks about visual content, such as comparing images or identifying relationships.
- Multimodal chat: Enabling conversational interactions that seamlessly incorporate visual information.
This implementation provides a starting point for understanding and experimenting with Flamingo-style multimodal architectures. The real power of such models comes from their ability to perform in-context learning across modalities, adapting to new tasks with minimal examples.
Enhanced Flamingo Implementation with In-Context Learning
Let's explore a more comprehensive implementation of the Flamingo architecture that better demonstrates its in-context learning capabilities for visual question answering:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2Tokenizer, ViTModel, ViTImageProcessor
from PIL import Image
import requests
from io import BytesIO
class GatedCrossAttentionBlock(nn.Module):
"""
Enhanced cross-attention block with gating mechanism as used in Flamingo.
"""
def __init__(self, hidden_size=768, num_heads=12):
super().__init__()
self.hidden_size = hidden_size
self.cross_attention = nn.MultiheadAttention(
embed_dim=hidden_size,
num_heads=num_heads,
batch_first=True
)
# Gating mechanism
self.gate = nn.Linear(hidden_size, hidden_size)
self.gate_activation = nn.Sigmoid()
# Layer normalization
self.layer_norm1 = nn.LayerNorm(hidden_size)
self.layer_norm2 = nn.LayerNorm(hidden_size)
def forward(self, hidden_states, visual_features):
normed_hidden_states = self.layer_norm1(hidden_states)
# Apply cross-attention
attn_output, _ = self.cross_attention(
query=normed_hidden_states,
key=visual_features,
value=visual_features
)
# Apply gating mechanism
gate_values = self.gate_activation(self.gate(normed_hidden_states))
attn_output = gate_values * attn_output
# Residual connection and layer norm
hidden_states = hidden_states + attn_output
hidden_states = self.layer_norm2(hidden_states)
return hidden_states
class PerceiverResampler(nn.Module):
"""
Perceiver Resampler that converts variable-length visual features into
a fixed number of tokens through cross-attention with learned queries.
"""
def __init__(self, input_dim=768, latent_dim=768, num_latents=64, num_layers=4):
super().__init__()
self.latents = nn.Parameter(torch.randn(num_latents, latent_dim))
self.layers = nn.ModuleList([
nn.MultiheadAttention(
embed_dim=latent_dim,
num_heads=8,
batch_first=True
)
for _ in range(num_layers)
])
self.input_projection = nn.Linear(input_dim, latent_dim)
self.layer_norm = nn.LayerNorm(latent_dim)
def forward(self, x):
batch_size = x.shape[0]
# Project input features to match latent dimension
x = self.input_projection(x)
# Expand latents for each item in the batch
latents = self.latents.unsqueeze(0).expand(batch_size, -1, -1)
# Apply layers of cross-attention
for layer in self.layers:
latents, _ = layer(
query=latents,
key=x,
value=x
)
latents = self.layer_norm(latents)
return latents
class EnhancedFlamingoModel(nn.Module):
"""
Enhanced Flamingo model with improved components for in-context learning
and visual question answering tasks.
"""
def __init__(self, num_visual_tokens=64, vision_model_name="google/vit-base-patch16-224"):
super().__init__()
# Vision encoder (frozen ViT)
self.vision_encoder = ViTModel.from_pretrained(vision_model_name)
self.vision_processor = ViTImageProcessor.from_pretrained(vision_model_name)
for param in self.vision_encoder.parameters():
param.requires_grad = False
# Perceiver resampler
self.perceiver = PerceiverResampler(
input_dim=768, # ViT feature dim
latent_dim=768, # Match GPT2 hidden size
num_latents=num_visual_tokens,
num_layers=4
)
# Language model (frozen GPT-2)
self.language_model = GPT2LMHeadModel.from_pretrained("gpt2")
self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
self.tokenizer.pad_token = self.tokenizer.eos_token
# Keep LM frozen except for final layer norm and unembedding
for name, param in self.language_model.named_parameters():
if "ln_f" in name or "wte" in name:
param.requires_grad = True
else:
param.requires_grad = False
# Special tokens for marking image inputs
self.image_start_token = "<image>"
self.image_end_token = "</image>"
# Add special tokens to vocabulary
special_tokens = {"additional_special_tokens": [self.image_start_token, self.image_end_token]}
num_added = self.tokenizer.add_special_tokens(special_tokens)
self.language_model.resize_token_embeddings(len(self.tokenizer))
# Cross-attention blocks
self.cross_attentions = nn.ModuleList([
GatedCrossAttentionBlock(hidden_size=768, num_heads=12)
for _ in range(len(self.language_model.transformer.h))
])
# Create image token ID
self.image_start_token_id = self.tokenizer.convert_tokens_to_ids(self.image_start_token)
self.image_end_token_id = self.tokenizer.convert_tokens_to_ids(self.image_end_token)
# Register hook to modify the transformer layers
for i, block in enumerate(self.language_model.transformer.h):
block.register_forward_hook(self._make_cross_attention_hook(i))
# Buffer for storing visual features
self.register_buffer("visual_features", None, persistent=False)
def _make_cross_attention_hook(self, block_idx):
"""Create a forward hook for adding cross-attention at specified layer."""
cross_attn = self.cross_attentions[block_idx]
def hook(module, inputs, outputs):
if self.visual_features is None:
return outputs
hidden_states = outputs[0] if isinstance(outputs, tuple) else outputs
modified_hidden_states = cross_attn(hidden_states, self.visual_features)
if isinstance(outputs, tuple):
return (modified_hidden_states,) + outputs[1:]
return modified_hidden_states
return hook
def _encode_image(self, image_tensor):
"""Process a single image through the vision encoder and perceiver."""
with torch.no_grad():
vision_outputs = self.vision_encoder(image_tensor)
hidden_states = vision_outputs.last_hidden_state
# Process through perceiver resampler to get fixed number of tokens
visual_tokens = self.perceiver(hidden_states)
return visual_tokens
def _encode_images_batch(self, image_list):
"""Process a batch of images through the vision pipeline."""
processed_images = []
for image in image_list:
if isinstance(image, str):
# Load from URL if string
response = requests.get(image)
img = Image.open(BytesIO(response.content))
else:
# Assume PIL Image otherwise
img = image
# Preprocess for vision model
processed = self.vision_processor(img, return_tensors="pt")
processed_images.append(processed["pixel_values"])
# Stack into batch
image_tensors = torch.cat(processed_images, dim=0).to(next(self.parameters()).device)
return self._encode_image(image_tensors)
def format_prompt_with_images(self, text_prompt, images):
"""Format a prompt with image placeholders and encode the images."""
# Encode images first
self.visual_features = self._encode_images_batch(images)
# Replace placeholders with special tokens
formatted_prompt = text_prompt.replace("[IMAGE]", f"{self.image_start_token}{self.image_end_token}")
return formatted_prompt
def generate_answer(self, prompt, images=None, max_length=200, temperature=0.7):
"""Generate an answer for a visual question answering prompt with images."""
if images:
prompt = self.format_prompt_with_images(prompt, images)
# Tokenize prompt
inputs = self.tokenizer(prompt, return_tensors="pt").to(next(self.parameters()).device)
# Generate text
with torch.no_grad():
output_ids = self.language_model.generate(
inputs.input_ids,
max_length=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
pad_token_id=self.tokenizer.eos_token_id
)
# Get only the generated text (not the prompt)
generated_ids = output_ids[0][inputs.input_ids.shape[1]:]
generated_text = self.tokenizer.decode(generated_ids, skip_special_tokens=True)
# Clear visual features after generation
self.visual_features = None
return generated_text.strip()
def run_visual_qa_demo():
"""Demonstrate visual question answering with the Flamingo model."""
# Initialize model
model = EnhancedFlamingoModel().to("cuda" if torch.cuda.is_available() else "cpu")
# Example images (use URLs for convenience)
example_images = [
"https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg", # Image of a dog on a beach
"https://files.cuantum.tech/images/dog_drawing.jpg" # Drawing of a dog
]
# Few-shot prompt for VQA
few_shot_prompt = """
I will answer questions about images.
[IMAGE]
Question: What animal is in the image?
Answer: The image shows a dog running on the beach. It appears to be a golden retriever enjoying the sand and ocean.
[IMAGE]
Question: What is this a drawing of?
Answer: This is a simple drawing of a dog. It appears to be a cartoon-style sketch with basic lines representing a dog's features.
[IMAGE]
Question: What is shown in this image?
Answer:
"""
# New test image (Eiffel Tower)
test_image = "https://files.cuantum.tech/images/eiffel-tower.jpg"
# Generate answer
answer = model.generate_answer(
few_shot_prompt,
images=example_images + [test_image],
max_length=100
)
print("Model's answer:", answer)
if __name__ == "__main__":
run_visual_qa_demo()
Code Breakdown: Advanced Flamingo Implementation
This enhanced implementation of the Flamingo architecture includes several important improvements that make it more similar to the original DeepMind model:
1. Key Architecture Enhancements
- Gated Cross-Attention: Unlike the basic implementation, this version includes a gating mechanism that controls how much visual information flows into the language model at each layer. This prevents visual information from dominating and allows for more nuanced integration.
- Multi-layer Perceiver Resampler: The perceiver now uses multiple layers of cross-attention to refine the visual tokens, creating a more sophisticated visual representation.
- ViT Vision Encoder: Uses a modern Vision Transformer instead of ResNet, providing better visual feature extraction.
- Special Tokens: Adds special image tokens to the vocabulary, allowing the model to recognize where images appear in the context.
2. In-Context Learning Implementation
- Few-Shot Visual QA: The prompt structure demonstrates how Flamingo enables few-shot learning by showing examples of image-question-answer triplets.
- Image Placeholders: Uses [IMAGE] placeholders in the prompt that get replaced with special tokens, mimicking how the real Flamingo handles multiple images in context.
- Contextual Memory: The model processes multiple images and remembers their features during generation, allowing it to reference different examples.
3. Technical Implementation Details
- Forward Hooks: Uses PyTorch hooks instead of monkey patching to inject cross-attention into the transformer blocks, which is a cleaner implementation.
- Selective Fine-tuning: Only certain parts of the language model are trainable (final layer norm and embedding), while keeping most parameters frozen.
- Batched Image Processing: Handles multiple images efficiently by batching them through the vision pipeline.
4. User-Friendly Features
- URL Image Loading: Supports loading images directly from URLs, making demonstrations easier.
- Structured API: Provides a clean interface for formatting prompts with images and generating answers.
- Memory Management: Clears visual features after generation to free up memory.
5. Real-world Applications
This implementation demonstrates how Flamingo can be used for:
- Visual Question Answering: Answering specific questions about image content.
- Few-Shot Learning: Learning new tasks from just a few examples without parameter updates.
- Multi-image Reasoning: Processing information across multiple images to provide coherent answers.
The enhanced implementation shows how multimodal models can maintain the powerful in-context learning capabilities of large language models while incorporating rich visual information. This approach allows for flexible adaptation to new visual tasks without specialized fine-tuning, making it particularly valuable for real-world applications.
5.1.3 GPT-5 (OpenAI)
Launched on August 7, 2025, GPT-5 marks a new milestone in OpenAI’s large language model lineage. It is the first fully native multimodal model, trained jointly on text, images, and audio from the ground up, with a composed system design that integrates fast responses, deep reasoning, and intelligent routing. More than an incremental upgrade over GPT-4o, GPT-5 represents a paradigm shift: a model architected from the beginning to process and reason across modalities as a unified whole.
Native Multimodal Architecture
Unlike earlier models that retrofitted speech or vision modules onto a text-first transformer, GPT-5 is fundamentally multimodal. Text, image, and audio are processed in the same transformer backbone, creating shared internal representations that seamlessly connect concepts across formats.
This design produces fluid cross-modal reasoning. For example, if a user submits a photo of a math problem, GPT-5 not only recognizes the characters but also interprets the underlying mathematical structure. It then generates a step-by-step solution that references specific symbols in the image, checks for ambiguities, and explains the reasoning in natural language. This integrated comprehension extends to scientific diagrams, financial charts, architectural blueprints, and medical imagery.
By aligning modalities during training, GPT-5 develops deeper semantic coherence—understanding how textual descriptions, visual data, and spoken language reinforce or contradict each other. It can, for instance, highlight inconsistencies between a historical photograph and a written account, or correlate radiology images with patient notes.
Composed System and Intelligent Routing
GPT-5 is not a monolithic model but a composed system:
- A main fast model handles everyday queries with low latency.
- A thinking model engages when complex, multi-step reasoning is required, offering real-time chain-of-thought.
- Mini and nano variants optimize cost and speed for lightweight applications.
- A Pro reasoning variant (API only) extends test-time reasoning for the hardest problems.
An intelligent router automatically decides which component to use, sparing users from manually picking between “light” and “heavy” models. This dynamic composition ensures efficiency for simple prompts and depth for challenging ones.
Reasoning and Context Management
With real-time chain-of-thought reasoning, GPT-5 excels in tasks that require logic, multi-step deduction, or tool use. On external benchmarks, it sets new records: 74.9% accuracy on SWE-bench Verified (software engineering) and 88% on Aider polyglot (code editing).
The model’s expanded context window—up to 400,000 tokens via the API, with output lengths of up to 128,000 tokens—supports the analysis of entire books, multi-hour meetings, or large codebases without losing track of earlier information. This scale makes it suitable for legal discovery, research synthesis, and full-repository debugging.
Voice and Multilingual Capabilities
Through the Realtime API, GPT-5 offers natural speech-in/speech-out interactions with millisecond-level latency. The voice system is robust to accents, can modulate tone on command, and integrates with SIP protocols, enabling real-world phone calls and live agents. Users can now hold fluid conversations where GPT-5 reasons, speaks, and listens in real time.
Multilingual fluency has also advanced, making GPT-5 a practical tool for cross-border communication, customer support, education, and accessibility.
Developer Controls and Tool Integration
Developers gain fine-grained control via new parameters:
reasoning_effort: from minimal (fast) to extensive (deep reasoning).verbosity: low, medium, or high detail in responses.
The API exposes three model families—gpt-5, gpt-5-mini, and gpt-5-nano—to balance accuracy, cost, and latency. Pricing (per million tokens) at launch was $1.25 input / $10 output for GPT-5, with cheaper mini and nano tiers.
GPT-5 also supports custom tools: lightweight, plaintext tool calls with optional grammar constraints, allowing more reliable integration with external APIs. Enterprises can connect GPT-5 directly into Microsoft Copilot, Apple Intelligence, GitLab, Notion, and custom pipelines.
Accuracy, Safety, and Bias Reduction
OpenAI introduced safe-completions training in GPT-5. Instead of choosing between over-compliance and refusal, the model aims to generate the safest useful answer. Internal evaluations show:
- Substantially fewer hallucinations than GPT-4o.
- Lower sycophancy (over-agreeableness).
- Reduced deception, meaning the model is less likely to feign success on impossible tasks.
Safety frameworks classify GPT-5 Thinking as High capability in biology and chemistry, with layered safeguards, red-teaming, and monitoring.
Use Cases and Industry Impact
- Coding & Engineering: GPT-5 generates functional front-end code, debugs large repositories, and coordinates multi-tool development workflows.
- Automation & Productivity: From grading and summarizing to document review, it frees human bandwidth for higher-order work.
- Knowledge Work: Enterprises use GPT-5 for legal analysis, financial reporting, and R&D, where its long context and reasoning shine.
- Creative Workflows: Designers, writers, and researchers can mix text, images, and audio in prompts—e.g., analyzing a chart and drafting a report in one go.
- Voice Agents: Customer service and sales teams deploy GPT-5 via Realtime API to deliver human-like support, capturing alphanumeric details and following strict protocols.
The New Standard
GPT-5 establishes a new baseline for large multimodal models. Its unified architecture, dynamic routing, reasoning capabilities, and developer controls make it a versatile foundation for both consumer and enterprise AI. By natively fusing text, vision, and audio, GPT-5 doesn’t just respond across modalities—it reasons through them, enabling a generation of AI systems that operate more like collaborators than tools.
Basic Example: Multimodal Prompt with JSON Output (Chat Completions API)
A beginner-friendly example showing how to send an image and text together and receive a structured JSON response.
import requests
import json # You need this to parse the JSON string from the response
API_KEY = "YOUR_OPENAI_API_KEY"
# Use the correct API endpoint
API_URL = "https://api.openai.com/v1/chat/completions"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
# Example: Provide an image URL and a text query jointly
# Corrected input structure using 'type' and 'image_url' keys
image_part = {
"type": "image_url",
"image_url": {
"url": "https://files.cuantum.tech/images/chart.png" # Can also use a data URL for base64 images
}
}
# Corrected text part structure
text_part = {
"type": "text",
"text": "Summarize the main trend shown in the chart. Also, generate the Python code to recreate this visualization. Format the response as a JSON object with the keys 'summary', 'python_code', and 'key_points'."
}
# Corrected payload
payload = {
"model": "gpt-5",
"messages": [
{
"role": "user",
"content": [
image_part,
text_part
]
}
],
# Correct way to request JSON output
"response_format": { "type": "json_object" },
# The max_tokens parameter is standard
"max_tokens": 400
}
response = requests.post(API_URL, headers=headers, json=payload)
result = response.json()
# Correct way to handle the API response
try:
# The API returns a JSON string inside the message content, so we parse it
response_content = result['choices'][0]['message']['content']
parsed_output = json.loads(response_content)
# Print structured output from the parsed JSON
print("Summary:", parsed_output.get("summary"))
print("Python code:", parsed_output.get("python_code"))
print("Key points:", parsed_output.get("key_points"))
except (KeyError, IndexError, json.JSONDecodeError) as e:
print("Error parsing the API response:", e)
print("Raw response:", result)
Code Breakdown
This example demonstrates how to send a multimodal request to OpenAI's GPT-5 model, combining an image URL with a text query, and specifically asking for a structured JSON response.
1. Import Libraries
import requests
import jsonrequests: This library is essential for making HTTP requests in Python. We use it to send our data to the OpenAI API and receive the response.json: This library is used for working with JSON (JavaScript Object Notation) data. We'll use it to construct our request payload and, critically, to parse the JSON string that GPT-5 will return to us when we ask for structured output.
2. API Configuration
API_KEY = "YOUR_OPENAI_API_KEY"
API_URL = "https://api.openai.com/v1/chat/completions"API_KEY: This is a placeholder for your unique OpenAI API key. You must replace"YOUR_OPENAI_API_KEY"with your actual key, which you can obtain from the OpenAI developer dashboard. This key authenticates your requests.API_URL: This is the specific endpoint for OpenAI's chat completion API. All conversational and multimodal requests go to this URL. It's crucial that this is correct.
3. Request Headers
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}headers: This dictionary contains metadata sent with our HTTP request."Authorization": f"Bearer {API_KEY}": This header authenticates your request using your API key. TheBearertoken prefix is a standard for OAuth 2.0."Content-Type": "application/json": This header tells the server that the body of our request is formatted as JSON.
4. Defining Multimodal Input Parts
GPT-5 can process different types of input simultaneously. Here, we define an image and a text part.
image_part = {
"type": "image_url",
"image_url": {
"url": "https://files.cuantum.tech/images/chart.png"
}
}image_part: This dictionary represents the visual input."type": "image_url": Specifies that this content block is an image provided via a URL."image_url": {"url": "..."}: This nested structure is where the actual image URL is provided. The model will fetch and process the image from this link. You could also provide base64 encoded images here instead of a URL.
text_part = {
"type": "text",
"text": "Summarize the main trend shown in the chart. Also, generate the Python code to recreate this visualization. Format the response as a JSON object with the keys 'summary', 'python_code', and 'key_points'."
}text_part: This dictionary holds the textual instruction for the model."type": "text": Indicates this content block is plain text."text": "...": This is the actual prompt to GPT-5. Notice how we explicitly ask for a JSON object with specific keys (summary,python_code,key_points). This is crucial for getting structured output from the model.
5. Constructing the Request Payload
This is the main body of the request, containing all the instructions for the API.
payload = {
"model": "gpt-5",
"messages": [
{
"role": "user",
"content": [
image_part,
text_part
]
}
],
"response_format": { "type": "json_object" },
"max_tokens": 400
}"model": "gpt-5": Specifies which OpenAI model to use. In this case, it's the latest GPT-5."messages": [...]: This is a list of message objects, forming the conversation.- Each message has a
"role"(e.g.,"user","system","assistant") and"content". "role": "user": Indicates that this message comes from the user."content": [image_part, text_part]: This is the crucial part for multimodal input. Thecontentis a list containing both ourimage_partandtext_partdictionaries. The model will process them together.
- Each message has a
"response_format": { "type": "json_object" }: This parameter explicitly tells the API to constrain the model's output to a valid JSON object. This is essential when you want structured data back from the model, as we requested in ourtext_part."max_tokens": 400: Sets the maximum number of tokens (words or word pieces) the model should generate in its response. This helps control cost and response length.
6. Sending the Request
response = requests.post(API_URL, headers=headers, json=payload)
result = response.json()requests.post(...): This function sends an HTTP POST request to theAPI_URLwith ourheadersand thepayload(converted to JSON byrequests.post).response.json(): The API's reply comes back as a JSON string. This method parses that string into a Python dictionary, making it easy to access the data.
7. Handling and Parsing the Response
The API's response structure is standard, but the actual content we asked GPT-5 to generate is nested within it as a string.
try:
response_content = result['choices'][0]['message']['content']
parsed_output = json.loads(response_content)
print("Summary:", parsed_output.get("summary"))
print("Python code:", parsed_output.get("python_code"))
print("Key points:", parsed_output.get("key_points"))
except (KeyError, IndexError, json.JSONDecodeError) as e:
print("Error parsing the API response:", e)
print("Raw response:", result)try...except: This block is crucial for robust error handling. API calls can fail for many reasons (network issues, incorrect API key, malformed requests, or the model might not return valid JSON).result['choices'][0]['message']['content']: This is the path to extract the actual text generated by GPT-5.result['choices']: The API can return multiplechoices(different possible completions) based on parameters liken. We usually take the first one ([0]).['message']: Within each choice, themessageobject contains therole(e.g., "assistant") and the generatedcontent.
json.loads(response_content): Since we specifically asked the model to format its output as a JSON string within thecontentfield, we need to usejson.loads()to parse this string into a Python dictionary.parsed_output.get("summary"),parsed_output.get("python_code"),parsed_output.get("key_points"): Onceresponse_contentis parsed into a dictionary, we can access the individual fields we requested from GPT-5. Using.get()is safer than direct dictionary access ([]) as it preventsKeyErrorif a key is missing.- The
exceptblock catches potential errors during parsing or if the expected keys are not found, printing both the error and the raw API response for debugging.
Advanced Example: Production-Ready Multimodal Workflow (Responses API with JSON Schema)
A robust example demonstrating best practices for reliability, schema validation, retries, and safe execution of returned code.
"""
Multimodal (image + text) → structured JSON with GPT-5
- Uses the Responses API (recommended)
- Strict JSON schema for reliable structured output
- Optional: safely execute returned Matplotlib code in a subprocess to render a PNG
"""
import os
import json
import time
import base64
import requests
import tempfile
import subprocess
import sys
from textwrap import dedent
from typing import Dict, Any, List, Optional
# =========================
# Configuration
# =========================
API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
API_URL = "https://api.openai.com/v1/responses"
MODEL = "gpt-5" # or: gpt-5-mini / gpt-5-nano
HEADERS = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
}
# Use a public image URL OR a local file encoded as a data URL (see helper below).
IMAGE_URL = "https://cdn.example.com/chart.png" # <- replace for your test
# Strict JSON schema for the model’s response
RESPONSE_SCHEMA: Dict[str, Any] = {
"name": "ChartInsight",
"schema": {
"type": "object",
"properties": {
"summary": {"type": "string"},
"python_code": {"type": "string"},
"key_points": {
"type": "array",
"items": {"type": "string"},
"minItems": 3,
"maxItems": 7
}
},
"required": ["summary", "python_code", "key_points"],
"additionalProperties": False
},
"strict": True
}
PROMPT_TEXT = (
"You are a meticulous data analyst.\n"
"Tasks:\n"
"1) Summarize the main trend in the chart.\n"
"2) Generate minimal, runnable Python (matplotlib) code that recreates a similar visualization "
" using inferred placeholder data. Include clear axis labels and a title.\n"
"3) Provide 3–7 bullet key points.\n"
"Return a JSON object that matches the provided JSON schema exactly."
)
# =========================
# Helpers
# =========================
def local_image_to_data_url(path: str, mime: Optional[str] = None) -> str:
"""
Convert a local image file to a data URL usable as an image input.
Example usage:
IMAGE_URL = local_image_to_data_url("chart.png")
"""
if not mime:
# naive mime inference by extension
ext = os.path.splitext(path)[1].lower()
mime = "image/png" if ext in [".png"] else "image/jpeg"
with open(path, "rb") as f:
b64 = base64.b64encode(f.read()).decode("utf-8")
return f"data:{mime};base64,{b64}"
def build_payload(image_url: str) -> Dict[str, Any]:
"""
Build a Responses API payload with multimodal input and JSON schema output.
"""
return {
"model": MODEL,
"input": [
{
"role": "user",
"content": [
{"type": "input_image", "image_url": {"url": image_url}},
{"type": "input_text", "text": PROMPT_TEXT}
]
}
],
"response_format": {
"type": "json_schema",
"json_schema": RESPONSE_SCHEMA
},
"max_output_tokens": 900,
"temperature": 0.2
}
def post_with_retries(
url: str,
headers: Dict[str, str],
json_payload: Dict[str, Any],
retries: int = 3,
backoff: float = 1.5,
timeout: int = 60
) -> Dict[str, Any]:
"""
POST with simple exponential backoff for rate limits / transient errors.
"""
for attempt in range(1, retries + 1):
try:
resp = requests.post(url, headers=headers, json=json_payload, timeout=timeout)
if resp.status_code == 200:
return resp.json()
# Retry on typical transient statuses
if resp.status_code in (429, 500, 502, 503, 504):
time.sleep(backoff ** attempt)
continue
raise RuntimeError(f"HTTP {resp.status_code}: {resp.text}")
except requests.exceptions.Timeout as e:
if attempt == retries:
raise
time.sleep(backoff ** attempt)
except requests.exceptions.RequestException as e:
if attempt == retries:
raise
time.sleep(backoff ** attempt)
raise RuntimeError("Request failed after retries")
def parse_responses_api_json(result: Dict[str, Any]) -> Dict[str, Any]:
"""
Extract the schema-validated JSON text and parse it to a dict.
Responses API returns: output[0].content[0].text for text output.
"""
try:
content_blocks = result["output"][0]["content"]
# Find first text block
for block in content_blocks:
if block.get("type") == "output_text" or block.get("type") == "text":
text = block.get("text", "")
if not text:
continue
# In schema mode, text should be strict JSON
return json.loads(text)
raise KeyError("No text block found in the response output")
except (KeyError, IndexError, json.JSONDecodeError) as e:
debug = json.dumps(result, indent=2)[:2000] # truncate for readability
raise ValueError(f"Failed to parse structured output: {e}\nPartial payload:\n{debug}")
def run_matplotlib_script(py_code: str) -> None:
"""
Safely run returned Matplotlib code in a clean subprocess (not in-process exec).
Saves 'recreated_chart.png' in the current working directory.
"""
safe_prefix = dedent("""
import matplotlib
matplotlib.use('Agg') # headless backend for servers/CI
""")
# Force a save at the end, even if the model code forgets to save
force_save = dedent("""
import os
import matplotlib.pyplot as plt
out = 'recreated_chart.png'
try:
plt.savefig(out, dpi=150, bbox_inches='tight')
except Exception:
# Some scripts call show() only; ensure we still save a figure if present
try:
plt.gcf().savefig(out, dpi=150, bbox_inches='tight')
except Exception:
pass
print(f"[Saved] {os.path.abspath(out)}")
""")
script = safe_prefix + "\n" + py_code + "\n\n" + force_save
with tempfile.NamedTemporaryFile("w", suffix=".py", delete=False) as f:
f.write(script)
tmp_path = f.name
completed = subprocess.run(
[sys.executable, tmp_path],
capture_output=True,
text=True,
timeout=60
)
if completed.stdout:
print(completed.stdout)
if completed.returncode != 0:
print("Script error:\n", completed.stderr)
# =========================
# Main flow
# =========================
def main():
if not API_KEY or API_KEY == "YOUR_OPENAI_API_KEY":
raise EnvironmentError("Set OPENAI_API_KEY environment variable or hardcode API_KEY.")
# If you want to test with a local image:
# IMAGE_URL = local_image_to_data_url("path/to/chart.png")
payload = build_payload(IMAGE_URL)
result = post_with_retries(API_URL, HEADERS, payload)
data = parse_responses_api_json(result)
print("\n=== Summary ===\n", data["summary"])
print("\n=== Key points ===")
for i, kp in enumerate(data["key_points"], 1):
print(f"{i}. {kp}")
print("\n=== Python code (recreate chart) ===\n")
print(data["python_code"])
# Optional: render the returned chart
user_wants_render = True # set to False to skip rendering
if user_wants_render:
run_matplotlib_script(data["python_code"])
if __name__ == "__main__":
main()
Download the chart example image here: https://files.cuantum.tech/images/chart.png
Code breakdown:
- Configuration
API_URL = "https://api.openai.com/v1/responses"uses the Responses API (the current, multimodal-first endpoint).MODEL = "gpt-5"picks the full model; you can swap togpt-5-mini/gpt-5-nanofor cheaper/faster runs.IMAGE_URL: set a public URL or switch to a local file vialocal_image_to_data_url().
- Strict JSON via schema
RESPONSE_SCHEMAtells the model exactly what keys and types to return.- This is more reliable than a plain
json_objecthint because the model is constrained to a schema and will retry internally to satisfy it.
- Building the multimodal prompt
build_payload()composesinputwith two blocks:{"type": "input_image", "image_url": {...}}for the image,{"type": "input_text", "text": PROMPT_TEXT}for instructions.
- The
response_formatrequests schema-validated output; the model returns a single JSON string that parses cleanly.
- Network resilience
post_with_retries()adds basic retry/backoff on rate limits or transient 5xx errors and a timeout so calls don’t hang.- Non-retryable errors raise with the server’s message for quick diagnosis.
- Parsing the Responses API
parse_responses_api_json()extractsresult["output"][0]["content"][0]["text"](the schema-validated JSON) andjson.loads()it.- If the shape changes (e.g., future versions), the function fails loudly with a helpful snippet.
- Optional: safe Matplotlib execution
run_matplotlib_script()runs the code in a separate Python process, not viaexec()in your main process.- It forces a headless backend and ensures a saved file
recreated_chart.pngeven if the script forgets. - This pattern is good enough for demos and CI, but for production you might put further guards (resource limits, containers).
- Main flow
- Build payload → call API with retries → parse JSON → print
summary,key_points, andpython_code. - Optionally, render the chart with the sandboxed subprocess.
Tool-Calling Example: “Ask GPT-5 to fetch data with your function, then analyze and plot”
"""
Tool-calling with GPT-5 (Chat Completions API)
- The model asks to call our tool `get_prices` with {symbol, days}
- We run the tool (here: mock data), send results back, then GPT-5 completes:
-> JSON with 'summary', 'key_points', and 'python_code' (Matplotlib)
"""
import os
import json
import time
import math
import requests
from datetime import datetime, timedelta
from typing import Dict, Any, List
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
API_URL = "https://api.openai.com/v1/chat/completions"
MODEL = "gpt-5"
HEADERS = {
"Authorization": f"Bearer {OPENAI_API_KEY}",
"Content-Type": "application/json",
}
# ---------- Tool: mock market data ----------
def get_prices(symbol: str, days: int = 30) -> Dict[str, Any]:
"""
Return mock OHLC data for the past N days.
Replace this with your real data source later (DB/API/cache).
"""
end = datetime.utcnow().date()
dates = [(end - timedelta(days=i)).isoformat() for i in range(days)][::-1]
# Simple deterministic waveform so every run is similar
base = 100.0
prices = []
for i, d in enumerate(dates):
v = base + 10 * math.sin(i / 4.0) + (i * 0.15)
o = round(v + math.sin(i) * 0.3, 2)
c = round(v + math.cos(i) * 0.3, 2)
h = round(max(o, c) + 0.6, 2)
l = round(min(o, c) - 0.6, 2)
prices.append({"date": d, "open": o, "high": h, "low": l, "close": c})
return {"symbol": symbol.upper(), "series": prices}
# ---------- Tool spec for the model ----------
TOOLS = [
{
"type": "function",
"function": {
"name": "get_prices",
"description": "Get recent OHLC data for a ticker symbol.",
"parameters": {
"type": "object",
"properties": {
"symbol": {"type": "string", "description": "Ticker, e.g., AAPL"},
"days": {"type": "integer", "minimum": 5, "maximum": 200, "default": 30}
},
"required": ["symbol"]
}
}
}
]
SYSTEM = (
"You are a quantitative analyst. If needed, call tools to fetch data, "
"then return a structured JSON with keys: summary (string), key_points (array of strings), "
"python_code (string that plots the series with matplotlib)."
)
USER = (
"Analyze the recent trend for the symbol AAPL (last 60 days). "
"If you need prices, use the tool. Then return JSON with summary, key_points, python_code."
)
def chat(payload: Dict[str, Any]) -> Dict[str, Any]:
r = requests.post(API_URL, headers=HEADERS, json=payload, timeout=60)
if r.status_code != 200:
raise RuntimeError(f"HTTP {r.status_code}: {r.text}")
return r.json()
def main():
# 1) Ask GPT-5; allow tool calling
payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": USER}
],
"tools": TOOLS,
"tool_choice": "auto",
# Ask for JSON if model can comply directly
"response_format": {"type": "json_object"},
"temperature": 0.2,
"max_tokens": 900
}
first = chat(payload)
msg = first["choices"][0]["message"]
# 2) If the model wants to call tools, run them and send results back
tool_messages = []
if "tool_calls" in msg:
for call in msg["tool_calls"]:
name = call["function"]["name"]
args = json.loads(call["function"]["arguments"] or "{}")
if name == "get_prices":
tool_result = get_prices(symbol=args.get("symbol", "AAPL"),
days=int(args.get("days", 60)))
else:
tool_result = {"error": f"Unknown tool {name}"}
tool_messages.append({
"role": "tool",
"tool_call_id": call["id"],
"name": name,
"content": json.dumps(tool_result)
})
# 3) Send a follow-up message containing the tool outputs
follow_payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": USER},
msg, # the assistant message that requested tools
*tool_messages
],
"response_format": {"type": "json_object"},
"temperature": 0.2,
"max_tokens": 1200
}
final = chat(follow_payload)
out = final
else:
out = first # Model answered without tools
# 4) Parse the final JSON
content = out["choices"][0]["message"]["content"]
try:
data = json.loads(content)
except json.JSONDecodeError:
print("Model did not return valid JSON. Raw content:\n", content)
return
print("\n=== Summary ===\n", data.get("summary"))
print("\n=== Key points ===")
for i, kp in enumerate(data.get("key_points", []), 1):
print(f"{i}. {kp}")
print("\n=== Python code (plot) ===\n")
print(data.get("python_code"))
if __name__ == "__main__":
if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
raise SystemExit("Set OPENAI_API_KEY env var first.")
main()
Code breakdown:
Let GPT-5 decide when to call your function (get_prices), you execute it (mock or real API), feed results back, and let GPT-5 finish with analysis + Matplotlib code in JSON.
1) Imports & configuration
requestshandles HTTP calls to OpenAI.json,time,math,datetimeare used for parsing, retries (if added), and mock data generation.OPENAI_API_KEYis read from env; never hardcode secrets in real projects.API_URLtargets the Chat Completions endpoint (best known for tool calling).MODEL = "gpt-5"; you can swap togpt-5-minifor cheaper experiments.
Tip: In production, wrap network calls with retry/backoff (429/5xx). A simple helper function can centralize that (you can reuse the one from your Advanced example).
2) The tool you expose to the model
def get_prices(symbol: str, days: int = 30) -> Dict[str, Any]:
...- This is a mock OHLC generator. Replace with your real data source:
- A REST call (e.g., Yahoo, Polygon, your own DB/API).
- Caching layer (Redis) to keep latency/costs down.
- Output shape:
{
"symbol": "AAPL",
"series": [
{"date": "2025-07-01", "open": 101.2, "high": 102.0, "low": 100.6, "close": 101.8},
...
]
}Keep it consistent; the LLM will rely on the keys you return.
3) Advertising the tool (the TOOLS spec)
TOOLS = [
{
"type": "function",
"function": {
"name": "get_prices",
"description": "Get recent OHLC data...",
"parameters": { ... JSON Schema ... }
}
}
]- You define a JSON Schema (name, required fields, types).
- The model uses this to decide if and how to call your function.
- Keep schema minimal but precise (e.g., clamp
daysto a reasonable range).
4) System and User messages
- SYSTEM enforces role & output contract:
- “You are a quantitative analyst … return JSON with keys:
summary,key_points,python_code.”
- “You are a quantitative analyst … return JSON with keys:
- USER asks for “Analyze AAPL last 60 days,” nudging the model to use a tool if it needs data.
Tip: Always restate your desired output format in SYSTEM (and/or USER). This increases compliance, especially if you don’t use schema mode.
5) First request: allow tool calling
payload = {
"model": MODEL,
"messages": [system, user],
"tools": TOOLS,
"tool_choice": "auto",
"response_format": {"type": "json_object"},
...
}tool_choice: "auto"lets the model decide if it needs the tool.response_format: "json_object"asks for JSON, but not as strict as schema mode. (That’s okay here; the focus is tool calling.)- Low
temperature(0.2) boosts determinism.
6) Detect and execute tool calls
msg = first["choices"][0]["message"]
if "tool_calls" in msg:
for call in msg["tool_calls"]:
# 1) parse arguments
# 2) run your function
# 3) build a "tool" message with the resultstool_callsis the assistant’s intent to call your function with arguments.- You must parse
call["function"]["arguments"](stringified JSON), run your function, and post results as atoolrole message back to OpenAI.
Security notes:
- Never directly execute arbitrary code sent via tool args.
- Validate inputs (symbols, ranges). Add allowlists/ratelimits for external APIs.
7) Second request: provide tool outputs and ask GPT-5 to finish
follow_payload = {
"messages": [
system, user,
msg, # the assistant message that requested tools
*tool_messages # your tool outputs bound to the call IDs
],
"response_format": {"type":"json_object"}, ...
}- You include:
- The original assistant message that requested tools (so the model keeps context).
- Your tool result messages with the proper
tool_call_id.
- GPT-5 now has real data and completes the task (analysis + code).
8) Parse the final JSON
content = out["choices"][0]["message"]["content"]
data = json.loads(content)- Print
summary,key_points,python_code. - If parsing fails, dump raw content—often a sign the model deviated (rare at low temperature, but possible).
9) Customization knobs
- Switch to schema mode: If you want stronger guarantees on the final JSON, use:
response_format: { "type": "json_schema", "json_schema": {...} }
- Multiple tools: Add more function specs to
TOOLS. GPT-5 will pick the right one. - Parallel calls: The API can return multiple
tool_calls—run them all, then send all thetoolmessages back in one follow-up. - Logging: Log both the tool args and outputs to audit the agent’s steps.
10) Common pitfalls
- Forgetting
tool_call_idwhen sending the tool result message. - Mismatched schemas: If your returned JSON structure diverges from your documented shape, the model may misinterpret it later.
- Rate limits: Add retry/backoff for 429/5xx (especially if your tool triggers 3rd-party APIs).
11) Testing tips
- Start with mock data (like the example) for deterministic outputs.
- Add a unit test that asserts the model returns valid JSON with the required keys.
5.1.4 DeepSeek-VL
DeepSeek-VL is a Chinese open-source multimodal model developed by the DeepSeek team, designed to bridge the gap between vision and language processing. It represents China's significant contribution to the multimodal AI landscape, offering capabilities comparable to proprietary models but with open access for researchers and developers. The model emerged as part of China's growing AI research ecosystem, demonstrating the country's commitment to advancing state-of-the-art AI technologies while ensuring they remain accessible to the broader scientific community.
The model is specifically optimized for efficiency and vision-language reasoning, with architectural choices that prioritize computational performance while maintaining high-quality results. Its streamlined design makes it particularly suitable for deployment in resource-constrained environments, enabling advanced multimodal capabilities on more modest hardware configurations. DeepSeek-VL achieves this efficiency through careful attention to model size, training procedures, and inference optimizations. For example, it employs specialized vision encoders that extract rich visual features while minimizing computational overhead, and leverages knowledge distillation techniques to compress larger models' capabilities into more compact architectures.
In performance evaluations, DeepSeek-VL is often benchmarked against industry leaders like GPT-4V and Flamingo, where it demonstrates competitive results at a fraction of the computational cost. This makes it an attractive option for cost-effective deployments in production environments, particularly for organizations seeking multimodal capabilities without the expense associated with commercial API usage. Benchmark studies have shown that DeepSeek-VL achieves 85-90% of the performance of these larger models on standard vision-language tasks while requiring significantly less computational resources. This performance-to-cost ratio has made it particularly popular among startups, academic institutions, and developers in emerging markets.
The model excels in tasks requiring detailed visual understanding combined with natural language reasoning, such as image captioning, visual question answering, and complex scene interpretation. DeepSeek-VL's architecture incorporates specialized attention mechanisms that allow it to focus on relevant visual elements when answering questions or generating descriptions.
This capability enables applications ranging from assisting visually impaired users to automating content moderation and enhancing e-commerce product discovery through visual search. The model also demonstrates strong performance in cross-cultural visual contexts, making it particularly valuable for applications serving diverse global audiences.
Example: Using DeepSeek-VL for Image Understanding
# Install dependencies first
# pip install transformers torch pillow
from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image
import requests
import matplotlib.pyplot as plt
# Download and load an example image
image_url = "https://files.cuantum.tech/images/deep-seek-descriptive.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
# Load DeepSeek-VL model and processor
model_name = "deepseek-ai/deepseek-vl-7b-chat"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
# Create a prompt for the model
prompt = "Describe what you see in this image in detail."
# Process the inputs
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Generate a response
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False
)
# Decode the response
generated_text = processor.decode(outputs[0], skip_special_tokens=True)
# Display the image and response
plt.figure(figsize=(10, 8))
plt.imshow(image)
plt.axis('off')
plt.title('Input Image')
plt.show()
print("DeepSeek-VL's response:")
print(generated_text.split("ASSISTANT:")[-1].strip())
Code Breakdown: Using DeepSeek-VL for Image Understanding
The example above demonstrates how to use DeepSeek-VL for a basic image understanding task. Here's a detailed breakdown of each section:
1. Dependencies and Setup
- Key libraries: The code uses
transformersfor model access,torchfor tensor operations, andPILfor image handling. - Image acquisition: Fetches a sample image from a URL using
requestsand opens it with PIL.
2. Model Initialization
- Model selection: Uses the 7B parameter chat-tuned version of DeepSeek-VL (
deepseek-ai/deepseek-vl-7b-chat). - Processor loading: The
AutoProcessorhandles both tokenization of text and preprocessing of images. - Model loading:
trust_remote_code=Trueis required as DeepSeek-VL uses custom code for its implementation.
3. Input Processing
- Prompt creation: A simple prompt asking for image description, but you can use more specific prompts like "What objects are in this image?" or "Explain what's happening in this scene."
- Multimodal processing: The processor combines both text input (prompt) and image input into a format the model can understand.
- Return format:
return_tensors="pt"specifies PyTorch tensors as the output format.
4. Response Generation
- Inference with
torch.no_grad(): Disables gradient calculation for efficiency during inference. - Generation parameters:
max_new_tokens=512: Limits response length to 512 tokens.do_sample=False: Uses greedy decoding instead of sampling for deterministic outputs.
5. Response Processing and Visualization
- Decoding: Converts token IDs back to human-readable text.
- Response extraction: Splits the output to get only the assistant's response portion.
- Visualization: Displays the input image alongside the generated description.
Advanced Usage Patterns
Beyond this basic example, DeepSeek-VL supports several advanced capabilities:
- Visual reasoning: You can ask complex questions about relationships between objects in the image.
- Multi-image analysis: Process multiple images by passing a list to the processor.
- Fine-tuning: Adapt the model to specific domains using techniques like LoRA or QLoRA.
- Memory efficiency: For resource-constrained environments, consider using quantization:
# For 8-bit quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_enable_fp32_cpu_offload=True
)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
quantization_config=quantization_config,
device_map="auto"
)Implementation Considerations:
- Hardware requirements: DeepSeek-VL 7B requires at least 16GB GPU memory for full precision, but can run on consumer GPUs with quantization.
- Inference speed: First-time inference includes model loading time; subsequent calls are faster.
- Response format: The model follows a chat format with "ASSISTANT:" prefix. For cleaner outputs, always strip this prefix.
- Error handling: In production, add try-except blocks to handle image loading failures and timeout configurations for large images.
DeepSeek-VL represents a significant advancement in making multimodal AI accessible to developers, particularly those seeking open-source alternatives to proprietary models like GPT-4V or Gemini.
Example: Advanced Visual Question Answering with DeepSeek-VL
# Install required libraries
# pip install transformers torch pillow matplotlib requests
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
import matplotlib.pyplot as plt
from io import BytesIO
# Function to load and display an image from a URL
def load_and_display_image(image_url, title="Input Image"):
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
plt.figure(figsize=(10, 8))
plt.imshow(image)
plt.axis('off')
plt.title(title)
plt.show()
return image
# Load DeepSeek-VL model and processor
model_id = "deepseek-ai/deepseek-vl-7b-chat"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16, # Use half precision for efficiency
device_map="auto", # Automatically distribute across available GPUs
trust_remote_code=True
)
# Sample image URLs for visual reasoning tasks
image_urls = [
"https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg", # People at a table
"https://files.cuantum.tech/images/deep-seek-chart.jpg" # Charts/graphs
]
# Load and display the first image
image = load_and_display_image(image_urls[0])
# Function to generate responses for a given image and prompt
def generate_vl_response(image, prompt, max_new_tokens=256):
# Create chat message format
messages = [
{"role": "user", "content": prompt}
]
# Process inputs
inputs = processor(
messages=messages,
images=image,
return_tensors="pt"
).to(model.device)
# Generate response with customized parameters
generated_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True, # Enable sampling for more diverse outputs
temperature=0.7, # Control randomness (higher = more random)
top_p=0.9, # Nucleus sampling parameter
repetition_penalty=1.1 # Discourage repetition
)
# Decode response
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# Extract assistant's response
response = generated_text.split("ASSISTANT:")[-1].strip()
return response
# Example prompts for different visual reasoning tasks
prompts = [
"Describe this image in detail. What are the people doing?",
"Count how many people are in this image and describe what each person is wearing.",
"What emotions can you detect on people's faces in this image?",
"If you had to create a story based on this image, what would it be?"
]
# Generate and display responses
for i, prompt in enumerate(prompts):
print(f"\nPrompt {i+1}: {prompt}")
print("-" * 50)
response = generate_vl_response(image, prompt)
print(response)
print("=" * 80)
# Load the second image (charts/graphs) for technical analysis
technical_image = load_and_display_image(image_urls[1], "Technical Chart")
# Technical analysis prompt
technical_prompt = "Analyze this chart. What patterns do you observe? What conclusions can you draw from this data visualization?"
# Generate and display technical analysis
print(f"\nTechnical Analysis Prompt: {technical_prompt}")
print("-" * 50)
response = generate_vl_response(technical_image, technical_prompt, max_new_tokens=512)
print(response)
Comprehensive Code Breakdown: Advanced DeepSeek-VL Implementation
This code example demonstrates how to leverage DeepSeek-VL for sophisticated visual reasoning tasks. Let's break down each component:
1. Setup and Model Initialization
- Library imports: Beyond basic dependencies, we specifically import
AutoModelForCausalLMwhich provides a more flexible interface for generative tasks than the basicAutoModelused in the previous example. - Helper function:
load_and_display_image()encapsulates image loading logic, making the code more modular and reusable. - Model optimization:
torch_dtype=torch.float16enables half-precision computation, reducing memory usage by approximately 50% with minimal impact on output quality.device_map="auto"intelligently distributes model layers across available GPUs or uses CPU offloading when needed.
2. Multi-image Processing
- Image collection: Stores multiple image URLs for different analysis scenarios, demonstrating DeepSeek-VL's versatility.
- Sequential processing: The code is structured to analyze multiple images with different prompts, showcasing how the model handles diverse visual contexts.
3. Response Generation Function
- Chat-style formatting: Unlike the previous example, this implementation uses DeepSeek-VL's chat interface through the
messagesparameter, which better aligns with conversational applications. - Generation parameters:
do_sample=Trueandtemperature=0.7: Enables controlled randomness in outputs, producing more natural and diverse responses.top_p=0.9: Implements nucleus sampling, which dynamically filters the token probability distribution.repetition_penalty=1.1: Reduces the likelihood of generating repetitive phrases, improving response quality.
4. Task Diversification
- Multiple prompt types: The example includes different types of visual reasoning tasks:
- Descriptive: "Describe this image in detail..."
- Quantitative: "Count how many people..."
- Emotional analysis: "What emotions can you detect..."
- Creative: "If you had to create a story..."
- Technical analysis: "Analyze this chart..."
5. Performance Considerations
- Memory management: The example uses half-precision (
float16) and automatic device mapping to optimize memory usage. - Response length control:
max_new_tokensis adjusted based on the complexity of the task, with technical analysis allowed a longer response (512 tokens vs 256). - Prompt engineering: The prompts are carefully crafted to elicit specific types of visual reasoning, demonstrating how prompt design affects model output.
6. Real-world Application Scenarios
- This implementation demonstrates DeepSeek-VL's capabilities in several practical use cases:
- Social media content analysis: Understanding context and relationships in photos.
- Data visualization interpretation: Extracting insights from charts and graphs.
- Content moderation: Detecting emotional content and potentially sensitive material in images.
- Creative assistance: Helping generate stories or content based on visual inspiration.
7. Extension Possibilities
- This code could be extended in several ways:
- Batch processing: Modify to handle multiple images simultaneously for higher throughput.
- Interactive applications: Integrate into a web interface where users can upload images and select analysis types.
- Multi-turn conversations: Expand the
messagesarray to include previous exchanges for contextual understanding. - Integration with other models: Combine DeepSeek-VL's outputs with specialized models for tasks like object detection or sentiment analysis.
This advanced implementation highlights DeepSeek-VL's flexibility and power for complex visual-language reasoning tasks, making it suitable for both research and production applications where understanding images in context is critical.
5.1.5 Why Text+Image Matters
Accessibility: Helping visually impaired users understand images by providing detailed descriptions of visual content. These models can identify objects, people, scenes, and even interpret spatial relationships, allowing visually impaired individuals to "see" through AI-generated descriptions. They can also assist with navigation by describing surroundings or identifying potential hazards.
For visually impaired individuals, multimodal AI serves as an essential bridge to visual content. These systems go beyond simple object recognition to provide context-rich descriptions that convey the full meaning of images. When a visually impaired person encounters an image online, in a document, or through a specialized device, multimodal models can:
- Generate comprehensive scene descriptions that include not just what objects are present, but their arrangement, colors, lighting, and overall composition
- Identify and describe people in photos, including facial expressions, clothing, actions, and apparent relationships between individuals
- Read and interpret text within images, such as signs, menus, product labels, and instructions
- Recognize landmarks and provide spatial awareness in unfamiliar environments
In real-world applications, these capabilities are being integrated into smartphone apps that can narrate the visual world in real-time, smart glasses that provide audio descriptions of surroundings, and screen readers that can interpret complex visual elements on websites. The technology is particularly valuable for educational materials, allowing visually impaired students to access diagrams, charts, and illustrations that would otherwise be inaccessible without human assistance.
The advancement of these multimodal systems represents a significant step forward in digital inclusivity, empowering visually impaired users with greater independence and access to information that was previously unavailable to them.
Education: Explaining diagrams, charts, or historical photos to enhance learning experiences. Multimodal models can break down complex visualizations into understandable components, clarify scientific diagrams, provide historical context for photographs, and even translate visual mathematical notation into explanations. This makes educational content more accessible and comprehensible across various subjects and learning styles.
In educational contexts, multimodal AI serves as a powerful teaching assistant that bridges visual and textual information:
- For STEM education, these models can analyze complex scientific diagrams and:
- Convert abstract visual concepts into clear, step-by-step explanationsConvert abstract visual concepts into clear, step-by-step explanations
- Identify and label components of biological systems, chemical structures, or engineering schematicsIdentify and label components of biological systems, chemical structures, or engineering schematics
- Translate mathematical expressions and equations into plain language interpretationsTranslate mathematical expressions and equations into plain language interpretations
- In history and social studies, multimodal models enhance learning by:
- Providing detailed context for historical photographs, including time period, cultural significance, and historical relevanceProviding detailed context for historical photographs, including time period, cultural significance, and historical relevance
- Analyzing primary source documents with both textual and visual elementsAnalyzing primary source documents with both textual and visual elements
- Making connections between visual artifacts and broader historical narrativesMaking connections between visual artifacts and broader historical narratives
- For data literacy, these systems help students by:
- Breaking down complex charts and graphs into comprehensible insightsBreaking down complex charts and graphs into comprehensible insights
- Explaining statistical visualizations and data trends in accessible languageExplaining statistical visualizations and data trends in accessible language
- Teaching students how to interpret different types of data representationsTeaching students how to interpret different types of data representations
These capabilities are particularly valuable for students with different learning styles, allowing visual learners to receive verbal explanations and verbal learners to better understand visual content. They also support personalized learning by adapting explanations to different educational levels, from elementary to advanced university courses.
Creative work: Generating captions, stories, or descriptions that can inspire artists, writers, and content creators. These models can suggest creative interpretations of images, develop narratives based on visual scenes, assist with storyboarding by describing sequential images, and help marketers craft compelling visual content with appropriate messaging.
For creative professionals, multimodal AI serves as both muse and collaborator. Writers facing creative blocks can use these systems to generate story prompts from visual inspiration. When shown an image of a misty forest at dawn, for instance, the AI might suggest narrative elements like "a forgotten path leading to an ancient secret" or "the meeting place of two worlds." This capability transforms random visual stimuli into structured creative starting points.
Visual artists and designers benefit from AI-generated descriptions that highlight elements they might otherwise overlook. A photographer reviewing their portfolio might gain new perspective when the AI points out "the interplay of shadow and reflection creates a natural frame around the subject" or "the unexpected color contrast draws attention to the emotional center of the image."
In film and animation, these models streamline the pre-production process. Storyboard artists can quickly generate descriptive text for sequential panels, helping directors and producers visualize narrative flow before committing resources to production. The AI can suggest camera angles, lighting moods, and scene transitions based on visual references, accelerating the creative development cycle.
For content marketers, multimodal models bridge the gap between visual assets and compelling messaging. When analyzing product photography, these systems can generate targeted copy that aligns with both the visual elements and brand voice, ensuring consistent communication across channels. This capability is particularly valuable for social media campaigns where striking visuals must be paired with concise, engaging text in multiple formats and platforms.
Productivity: Extracting structured insights from documents, tables, or screenshots, which saves time and improves efficiency in professional settings. Instead of manually parsing visual data, users can leverage AI to convert tables into spreadsheets, extract key information from receipts or business cards, analyze graphs and charts in reports, and transform handwritten notes into searchable text.
This productivity advantage manifests across numerous professional workflows:
- In financial services, multimodal AI can automatically process invoices and receipts by:
- Identifying vendor information, dates, and payment amountsIdentifying vendor information, dates, and payment amounts
- Categorizing expenses according to predefined accounting codesCategorizing expenses according to predefined accounting codes
- Flagging potential discrepancies or unusual chargesFlagging potential discrepancies or unusual charges
- For research and analysis, these systems can:
- Extract precise numerical data from complex charts and graphsExtract precise numerical data from complex charts and graphs
- Convert statistical visualizations into structured datasetsConvert statistical visualizations into structured datasets
- Summarize key trends and outliers identified in visual dataSummarize key trends and outliers identified in visual data
- In administrative workflows, multimodal AI streamlines:
- Business card digitization for immediate contact database integrationBusiness card digitization for immediate contact database integration
- Form processing without manual data entryForm processing without manual data entry
- Meeting note transcription with automatic action item extractionMeeting note transcription with automatic action item extraction
The time savings are substantial—tasks that would require hours of manual data entry can be completed in seconds, while also reducing human error. For organizations handling large volumes of visual documents, this capability transforms information management by making previously inaccessible data searchable, analyzable, and actionable.
Multimodal models bring us closer to AI that interacts with the world as humans do: through multiple senses, not just words. By bridging the gap between visual perception and language understanding, these technologies create more intuitive and natural human-AI interactions that reflect how we naturally process information through multiple channels simultaneously.
5.1 Text+Image Models (LLaVA, Flamingo, GPT-4o, DeepSeek-VL)
So far, we have focused on models that live in the world of words. But human intelligence is multimodal: we learn by reading, seeing, hearing, and interacting with the world. For AI to approach this kind of understanding, language models must also expand beyond text.
This limitation of text-only models becomes evident when we consider how humans perceive and process information. We don't experience the world as isolated streams of text—we integrate visual cues, sounds, and physical interactions to form a comprehensive understanding. Traditional LLMs, despite their impressive capabilities with language, lack this holistic perception that comes naturally to humans.
This is where multimodal LLMs come in. By combining text with images, audio, or video, these models can:
- Describe what they "see" in pictures, recognizing objects, scenes, actions, and even emotional context within visual content.
- Answer questions about charts or diagrams, interpreting visual data representations and translating visual patterns into meaningful insights.
- Connect written descriptions to visual understanding, bridging the gap between abstract concepts described in words and their concrete visual manifestations.
- Support real-world tasks like tutoring, accessibility tools, and robotics, where understanding multiple forms of communication is essential for effective assistance.
Multimodal systems represent a significant leap forward in AI capabilities. Rather than processing each type of data in isolation, these models create connections between different forms of information, much like the human brain integrates signals from our various senses. This cross-modal reasoning allows for richer understanding and more natural interactions with AI systems.
In this chapter, we'll explore how researchers are pushing LLMs beyond text, starting with one of the most active areas: Text+Image models.
Text+Image models extend language models by integrating visual encoders with text-based transformers. This integration represents a significant advancement in AI, allowing models to process and understand both visual and textual information simultaneously. In practice, this integration involves several key components working together:
- An image encoder (like CLIP's vision transformer or a convolutional net) processes an image into embeddings. This encoder analyzes the visual content pixel by pixel, identifying features such as shapes, colors, objects, spatial relationships, and even contextual elements. The encoder works through multiple processing layers, each extracting increasingly complex information:
- Low-level features: First, the encoder detects basic elements like edges, textures, and color patterns across the image. This initial layer of processing works similarly to how our eyes first perceive visual information - identifying contrasts between light and dark, detecting boundaries between colors, and registering texture variations (like smooth vs. rough surfaces).
This stage is computationally intensive as the model must analyze every pixel and its relationship to neighboring pixels. For example, when processing a photograph of a forest, the encoder might identify:
- Vertical lines representing tree trunks
- Irregular patterns of green representing foliage
- Textural differences between rough bark and smooth leaves
- Shadow gradients indicating depth and lighting direction
- Color transitions between sky and terrain
The encoder uses specialized filters that respond to specific patterns - some detect horizontal lines, others vertical lines, while others identify specific color gradients or textural elements. These filters work in parallel across the entire image, creating feature maps that highlight where each pattern appears most strongly.
These fundamental visual elements form the building blocks for all higher-level recognition, much like how letters combine to form words and sentences in language processing. Without accurate detection at this stage, the more complex recognition tasks in subsequent layers would fail.
- Mid-level features: These basic elements are then combined to recognize more complex structures such as specific shapes, object parts, and spatial arrangements. At this stage, the model begins to identify meaningful patterns - recognizing that certain edges form the outline of a face, or that particular textures likely represent fur, fabric, or foliage.
This mid-level processing is crucial because it bridges the gap between raw visual data and semantic understanding. For example, when processing an image of a person walking a dog in a park:
- The model might recognize curved lines and color patterns that form the silhouette of a human figure
- It identifies four-legged shapes with characteristic proportions that indicate "dog"
- It detects textural patterns of grass, trees, and sky that suggest "outdoor environment"
- It recognizes spatial configurations that establish the relationship between person and dog (connected by a leash)
The model also starts to understand spatial relationships, determining when objects are above, below, or inside others. These spatial relationships provide critical context - a cup on a table has different implications than a table on a cup. The model learns to recognize standard spatial arrangements (like furniture in a room) and unusual configurations that might require special attention.
- High-level features: Finally, the encoder identifies complete objects, scenes, actions, and the relationships between elements in the image. This is where true "understanding" emerges, as the model recognizes not just isolated objects but meaningful context - distinguishing between a dog sitting on a sofa versus running through a park, or understanding that a person holding a tennis racket near a net represents a specific activity.
At this highest level of processing, the model performs several sophisticated cognitive tasks:
- Object recognition and classification: The model can identify whole entities (people, animals, vehicles, furniture) and categorize them into specific types or classes (German Shepherd dog, mid-century sofa, professional tennis player).
- Scene understanding: Beyond individual objects, the model comprehends entire environments - recognizing a kitchen from its appliances and layout, or a beach scene from the combination of sand, water, and distinctive lighting.
- Action recognition: The model can interpret dynamic elements - differentiating between someone running versus walking, or throwing versus catching - based on posture, positioning, and contextual cues.
- Relationship detection: Perhaps most impressively, the model identifies how objects relate to each other spatially and functionally - recognizing that a person is walking a dog (connected by a leash), riding a bicycle (positioned on top), or cooking food (performing actions on ingredients).
- Contextual inference: The model makes educated guesses about the broader situation - inferring a birthday celebration from candles on a cake and gathering of people, or a professional meeting from business attire and a conference room setting.
The model can also interpret emotional content, social interactions, and even infer potential narratives within the scene. It might recognize facial expressions indicating happiness or concern, body language suggesting tension or relaxation, or social dynamics like a teacher instructing students or friends enjoying a meal together. Through extensive training on millions of images with corresponding descriptions, the model learns to associate visual patterns with rich semantic concepts, enabling it to "see" at a level that approximates human understanding.
The result is a dense representation of the image's content in a numerical format that the model can process - essentially translating visual information into a "language" that the AI can understand and reason with.
- Low-level features: First, the encoder detects basic elements like edges, textures, and color patterns across the image. This initial layer of processing works similarly to how our eyes first perceive visual information - identifying contrasts between light and dark, detecting boundaries between colors, and registering texture variations (like smooth vs. rough surfaces).
- A projection layer maps those embeddings into the same space as the language model's tokens. This critical alignment step ensures that visual information and text information can be processed together. Without this projection, the model would struggle to make meaningful connections between what it sees and what it understands through language.
The projection layer essentially translates the "language of images" into a format compatible with the "language of text," allowing both modalities to coexist in the same computational space. This process involves several sophisticated transformations:
Dimensionality alignment: Image embeddings and text embeddings often have different dimensions and structures. The projection layer reshapes visual features to match the exact dimensions expected by the language model, ensuring that every visual concept can be represented in a way the text processing components can interpret. This process involves complex mathematical transformations that convert the high-dimensional tensors from the vision encoder (which might have shapes like [batch_size, sequence_length, vision_dimension]) into the format required by the language model (typically [batch_size, sequence_length, hidden_dimension]).
For example, a vision encoder might output features with 1024 dimensions per token, while the language model might work with 768-dimensional embeddings. The projection layer would then implement a learned linear transformation (essentially a matrix multiplication) that maps each 1024-dimensional vector to a 768-dimensional vector while preserving as much semantic information as possible.
This alignment is not just about matching numbers - it's about preserving the rich semantic relationships captured in the visual domain. The projection parameters are learned during training, allowing the model to discover optimal mappings between visual concepts and their linguistic counterparts. This ensures that when the language model attends to these projected visual features, it can extract meaningful information that corresponds to concepts it understands through language.
Semantic mapping: Beyond simple dimension matching, the projection layer learns to map visual concepts to their linguistic counterparts. For example, the visual features representing "a red apple" must be projected into a space where they can interact meaningfully with the text tokens for "red" and "apple."
This semantic mapping is a sophisticated translation process that bridges two fundamentally different representational systems. When processing an image of a red apple, the vision encoder extracts features capturing its roundness, smooth texture, red coloration, and stem. These visual features exist as abstract numerical patterns distributed across multiple embedding dimensions. The projection layer must transform these distributed visual patterns into representations that align with how language models understand concepts like "red" (a color attribute) and "apple" (a fruit category).
The challenge is significant because visual and linguistic representations are structured differently:
- In vision, concepts are often entangled - the "redness" and "appleness" exist simultaneously in the same pixels and are processed together.
- In language, concepts are more discrete - "red" and "apple" are separate tokens with distinct meanings that compose together.
Through extensive training on paired image-text data, the projection layer learns to disentangle these visual features and map them to their linguistic counterparts. When successful, the projected visual features will activate similar neural patterns as would be activated by the text "red apple" in the language model. This enables the language model to reason about the visual content using its language understanding capabilities - for instance, answering questions like "What color is the apple?" by connecting the visual representation to the appropriate linguistic concept "red".
This semantic alignment is what allows multimodal models to perform cross-modal reasoning tasks, such as describing unseen objects, answering questions about visual content, or generating text that references visual elements in contextually appropriate ways.
Contextual integration: The projection ensures that contextual relationships in the visual domain (like spatial relationships between objects) are preserved in a way that the language model can access and reason about. This allows the model to answer questions about relative positions or interactions between objects in an image.
This contextual integration is particularly crucial because visual scenes contain rich spatial and relational information that must be translated into a format the language model can process. For example, when looking at an image of a dining table, the model needs to understand not just that there are plates, glasses, and utensils, but their arrangement (plates in front of chairs, glasses above plates, forks to the left of plates), their groupings (place settings), and their functional relationships (napkins folded on plates).
The projection layer preserves these spatial hierarchies by maintaining relative positional information between visual features. Through specialized attention mechanisms, it ensures that:
- Proximity relationships ("the book is next to the lamp") are encoded in ways that language models can interpret
- Containment relationships ("the apple is in the bowl") maintain their hierarchical structure
- Directional relationships ("the dog is facing the camera") preserve orientation information
- Scale relationships ("the elephant is larger than the mouse") retain relative size information
This sophisticated mapping enables the model to correctly interpret questions like "What's above the bookshelf?", "Is the child holding the balloon?", or "Which way is the car facing?" - questions that require understanding not just what objects are present but how they relate to one another in physical space.
Without proper contextual integration, a model might recognize all objects in an image but fail to understand their meaningful relationships, severely limiting its ability to reason about scenes as humans naturally do.
- The language model treats visual embeddings as if they were special tokens, allowing it to "attend" to both words and pixels. Through self-attention mechanisms, the model can create connections between visual elements and textual concepts, forming a comprehensive understanding that spans both modalities.
This integration happens through a sophisticated process where the transformer architecture's self-attention mechanism simultaneously processes both text tokens and visual tokens. When a user asks "What color is the car in this image?", the model's attention heads can focus on:
- The visual embeddings representing the car in the image
- The textual tokens related to "color" and "car" in the query
- The contextual relationship between these elements
The self-attention weights form a complex web of connections, allowing information to flow bidirectionally between modalities. For example, when processing an image of a red sports car alongside text mentioning "vehicle," the model can:
- Associate visual features of the car with the word "vehicle" in the text
- Connect color properties from the visual embedding to potential color descriptions
- Link spatial relationships in the image (car on road) to potential scene descriptions
This cross-modal attention enables the model to perform tasks like visual question answering, image captioning, and text-conditional reasoning about visual content. The attention maps themselves reveal how the model distributes focus across different parts of both the image and text when forming its understanding.
This allows the model to reason about relationships between what it "sees" and what it "reads."
This fusion of visual and textual processing creates a powerful system that can understand context across modalities, enabling it to answer prompts like:
- "What's written on the sign in this photo?" - requiring text recognition within images and understanding of visual context. The model must identify text elements embedded within the visual scene, distinguish them from other visual features, and accurately transcribe the text while maintaining awareness of the sign's context in the broader image (whether it's a street sign, store front, warning notice, etc.).
- "Describe this chart in plain English." - requiring interpretation of data visualizations and translation into natural language. Here, the model must recognize the chart type (bar graph, pie chart, line graph, etc.), identify axes labels, data points, and trends, then synthesize this information into coherent prose that captures the key relationships and insights presented in the visualization.
- "Write a story about this image." - requiring creative generation based on visual stimuli and understanding of narrative elements. This complex task requires the model to recognize not just objects but their relationships, potential emotional content, implied actions or intentions, and then use these elements to create a coherent narrative with characters, setting, plot, and thematic elements that plausibly extend from what's visible in the image.
5.1.1 LLaVA (Large Language and Vision Assistant)
Open-source model combining CLIP for vision + Vicuna (LLM). CLIP (Contrastive Language-Image Pre-training) serves as the vision encoder that processes and extracts features from images, while Vicuna, a fine-tuned version of LLaMA, handles the language processing capabilities. The architecture leverages CLIP's powerful visual representation ability, which was trained on 400 million image-text pairs to understand visual concepts, and combines it with Vicuna's advanced language understanding and generation capabilities.
LLaVA follows a two-stage training process. First, it's pretrained on a large corpus of image-text pairs to establish basic connections between visual and linguistic information. Then, it's specifically trained on instruction-following data that pairs images with text prompts. This training approach enables LLaVA to understand and respond to specific instructions about visual content, going beyond simple image captioning to more complex reasoning about what it sees. This instruction-tuning is what gives LLaVA its ability to follow nuanced directions when analyzing images, rather than just generating generic descriptions.
The training dataset includes approximately 158,000 image-text instruction pairs, carefully curated to cover a wide range of visual reasoning tasks, from simple object identification to complex scene interpretation. This instruction-tuning phase is crucial as it teaches the model to follow specific directives when analyzing visual content. The dataset incorporates diverse image types including natural photographs, diagrams, charts, screenshots, and artistic images, ensuring the model can handle various visual formats. The text instructions are similarly diverse, ranging from simple requests like "What color is the car?" to more complex ones like "Explain the relationship between the people in this image and what they might be feeling."
Example task: describing an image in detail. LLaVA can generate comprehensive descriptions that include object identification, spatial relationships, attributes, actions, and even infer context or emotions from visual scenes. Its descriptions can range from factual observations to more interpretative analyses depending on the prompt.
For instance, when shown an image of a city street, LLaVA can identify not only the vehicles, pedestrians, and buildings, but also describe their relationships (e.g., "a person crossing the street while cars wait at a red light"), infer weather conditions based on visual cues (e.g., "wet pavement suggests recent rainfall"), and even comment on the likely time of day based on lighting conditions and shadows. The model can also perform more specialized tasks like reading text in images, analyzing charts or graphs, identifying landmarks, and recognizing famous people or artwork, demonstrating its versatility across different visual analysis scenarios.
LLaVA stands out for its efficient architecture that achieves strong performance while requiring relatively modest computational resources compared to proprietary alternatives. Its open-source nature has made it a popular choice for researchers and developers working on vision-language applications. The model's architecture is notably streamlined, using a simple projection layer to connect CLIP's vision embeddings with Vicuna's language processing capabilities. This approach avoids the computational overhead of more complex cross-attention mechanisms while still enabling effective communication between the visual and language components. The smaller variants of LLaVA can run on consumer-grade GPUs with 16GB of memory, making advanced multimodal AI accessible to a much broader range of researchers and developers than closed-source alternatives that may require specialized hardware.
The model achieves competitive performance on benchmarks such as VQAv2 (Visual Question Answering) and GQA (Grounded Question Answering), while being significantly more resource-efficient than closed-source alternatives like GPT-4V. On the VQAv2 benchmark, which evaluates a model's ability to answer questions about images, LLaVA-1.5 achieves scores comparable to much larger proprietary models. Its accessibility allows developers to fine-tune it for specific domains or applications, such as medical image analysis (interpreting X-rays, CT scans, and other medical imaging), retail product recognition (identifying products in shelves or catalog images), or educational content development (explaining scientific diagrams or historical artifacts), fostering a growing ecosystem of specialized multimodal AI applications. The model has inspired numerous derivatives and extensions in the open-source community, including versions optimized for different languages, specialized for particular domains like document understanding, or modified to work with video input rather than static images.
Code Example: Using LLaVA for Multimodal Processing
# Complete LLaVA implementation example
import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
# Step 1: Load the pre-trained LLaVA model and processor
model_id = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Step 2: Prepare the image
image = Image.open("colosseum.jpg")
# Step 3: Define your prompt
prompt = "Describe this image in detail."
# Step 4: Process the inputs
inputs = processor(
prompt,
images=image,
return_tensors="pt"
).to("cuda")
# Step 5: Generate the response
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
# Step 6: Decode and print the response
generated_text = processor.decode(output[0], skip_special_tokens=True)
print(generated_text)
For this example, download the Colosseum image here: https://files.cuantum.tech/images/colosseum.jpg
Code Breakdown: Using LLaVA for Multimodal Processing
This code demonstrates how to use the LLaVA (Large Language and Vision Assistant) model to process images and generate descriptive text. Let's break down each part in detail:
1. Imports and Setup
- torch: The PyTorch library provides tensor computation and neural networks functionality.
- PIL.Image: The Python Imaging Library allows us to open and manipulate image files.
- AutoProcessor: Automatically selects the appropriate processor for the model, handling both text tokenization and image preprocessing.
- LlavaForConditionalGeneration: The main LLaVA model class that combines vision and language capabilities.
2. Model Loading
The code loads the LLaVA 1.5 7B model from Hugging Face, which is a moderate-sized variant balancing performance and resource requirements:
- torch_dtype=torch.float16: Uses half-precision floating-point format to reduce memory usage.
- device_map="auto": Automatically determines the optimal device placement strategy, distributing model components across available GPUs or using CPU as needed.
3. Input Preparation
The code prepares two key inputs:
- An image loaded using PIL's Image.open() function.
- A text prompt that specifies the task ("Describe this image in detail").
The processor then:
- Resizes and normalizes the image to match CLIP's expected input format (224x224 pixels).
- Tokenizes the text prompt into input IDs for the language model component.
- Creates attention masks and other required tensor inputs.
4. Generation Process
The model.generate() method creates the text response with several parameters controlling the generation:
- max_new_tokens=256: Limits the response length to a maximum of 256 new tokens.
- do_sample=True: Enables sampling-based generation rather than greedy decoding.
- temperature=0.6: Controls randomness in the generation (lower values are more deterministic).
- top_p=0.9: Implements nucleus sampling, considering only tokens whose cumulative probability exceeds 90%.
5. Behind the Scenes: How LLaVA Processes the Image
When you run this code, LLaVA performs several sophisticated operations:
- The CLIP vision encoder extracts visual features from the image, creating a high-dimensional representation that captures objects, attributes, spatial relationships, and other visual information.
- The projection layer transforms these visual embeddings into a format compatible with the language model's embedding space, essentially "translating" visual concepts into a language the LLM can understand.
- The Vicuna language model (based on LLaMA) receives both the projected visual embeddings and the tokenized prompt, treating the visual information as special tokens in its context window.
- The self-attention mechanism allows the model to focus on relevant parts of both the image representation and the text prompt when generating each token of the response.
- The decoder generates a coherent, contextually appropriate text response based on both the visual content and the text instruction.
6. Advanced Customization Options
The basic example above can be extended with additional parameters for more control:
# Advanced parameters for more control
output = model.generate(
**inputs,
max_new_tokens=512, # Generate longer responses
do_sample=True, # Enable sampling-based generation
temperature=0.7, # Slightly more creative responses
top_p=0.9, # Nucleus sampling parameter
top_k=50, # Limit vocabulary to top 50 tokens
repetition_penalty=1.2, # Discourage repetition of phrases
length_penalty=1.0, # No penalty based on length
no_repeat_ngram_size=3, # Avoid repeating 3-grams
)
7. Practical Applications
This code structure can be adapted for various multimodal tasks by modifying the prompt:
- Visual question answering: "What color is the car in this image?"
- Image reasoning: "Explain what might happen next in this scene."
- Content extraction: "Extract all text visible in this image."
- Creative generation: "Write a short story inspired by this image."
LLaVA's architecture effectively bridges vision and language, enabling these diverse applications with the same underlying model.
Advanced Example: Interactive Visual Question Answering with LLaVA
The following code demonstrates a more sophisticated use case for LLaVA: building an interactive visual question answering application that can process uploaded images and answer questions about them in real-time.
# Advanced LLaVA application: Interactive Visual QA with Gradio
import torch
import gradio as gr
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
# Load the LLaVA model and processor
model_id = "llava-hf/llava-1.5-13b-hf" # Using larger 13B parameter version
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
def process_image_and_question(image, question, temperature=0.7, max_length=500):
"""Process an image and a question to generate a response using LLaVA."""
# Prepare the prompt with the user's question
prompt = f"Answer this question about the image: {question}"
# Process inputs
inputs = processor(
prompt,
images=image,
return_tensors="pt"
).to(model.device)
# Generate the response
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
)
# Decode the response
generated_text = processor.decode(output[0], skip_special_tokens=True)
# Return just the model's answer, removing the original question
response = generated_text.split("Answer this question about the image:")[-1].strip()
return response
# Set up the Gradio interface
with gr.Blocks() as demo:
gr.Markdown("# LLaVA Visual Question Answering")
gr.Markdown("Upload an image and ask a question about it.")
with gr.Row():
with gr.Column():
image_input = gr.Image(type="pil", label="Upload Image")
question_input = gr.Textbox(label="Your Question", placeholder="What's happening in this image?")
temperature = gr.Slider(0.1, 1.0, value=0.7, label="Temperature (creativity)")
max_length = gr.Slider(50, 1000, value=500, step=50, label="Maximum response length")
submit_button = gr.Button("Get Answer")
with gr.Column():
output_text = gr.Textbox(label="LLaVA's Answer", lines=10)
# Connect the interface to the processing function
submit_button.click(
fn=process_image_and_question,
inputs=[image_input, question_input, temperature, max_length],
outputs=output_text
)
# Add example images and questions
gr.Examples(
examples=[
["example_street_scene.jpg", "What safety hazards do you see in this image?"],
["example_chart.jpg", "Explain the main trend shown in this chart."],
["example_food.jpg", "What ingredients might be in this dish?"]
],
inputs=[image_input, question_input]
)
# Launch the application
demo.launch()
For this example, download the required images from these links:
Street Scene: files.cuantum.tech/images/example_street_scene.jpg
Chart: https://files.cuantum.tech/images/example_chart.jpg
Food: https://files.cuantum.tech/images/example_food.jpg
Code Breakdown: Interactive Visual QA Application
This advanced example demonstrates how to build a user-friendly application for visual question answering using LLaVA. Let's break down the key components:
1. Model Selection and Setup
- LLaVA 1.5-13B: This code uses the larger 13B parameter version of LLaVA (compared to the 7B in the previous example), which offers improved reasoning capabilities at the cost of requiring more computational resources.
- The same initialization approach is used, with float16 precision and automatic device mapping to optimize for available hardware.
2. Core Processing Function
The process_image_and_question() function handles the core multimodal processing:
- It takes four inputs: an image, a question, and two generation parameters (temperature and max length).
- The question is formatted into a standardized prompt format that helps guide LLaVA's response generation.
- After processing, it extracts just the relevant answer portion, removing the original prompt for a cleaner user experience.
3. Gradio Interface Construction
The code uses Gradio to create an intuitive web interface for the application:
- User inputs: Image upload, question text box, and generation parameter sliders for fine-tuning responses.
- Layout organization: Arranged in a two-column layout for inputs on the left and outputs on the right.
- Examples: Pre-configured example images and questions to demonstrate the system's capabilities.
4. Behind the Scenes: Enhanced Multimodal Processing
When a user interacts with this application, several sophisticated processes occur:
- The uploaded image is automatically preprocessed by the Gradio interface to ensure compatibility with LLaVA's input requirements.
- The LLaVA processor handles both the text tokenization and image preprocessing, ensuring proper alignment between modalities.
- The question is formatted into a directive that helps the model understand the specific visual reasoning task required.
- Generation parameters provide user control over the response style - higher temperature produces more creative but potentially less precise answers.
- Post-processing extracts just the relevant answer, creating a cleaner conversational experience.
5. Potential Applications
This interactive application template could be adapted for numerous real-world use cases:
- Educational tools: Students could upload diagrams or historical images and ask for explanations.
- Accessibility services: Visually impaired users could ask detailed questions about photographs or documents.
- E-commerce: Shoppers could upload product images and ask specific questions about features or compatibility.
- Technical support: Users could share screenshots of error messages or hardware setups and ask for troubleshooting advice.
- Content moderation: Platforms could use a modified version to help analyze uploaded images for policy compliance.
6. Technical Considerations and Limitations
When implementing this type of application, it's important to consider:
- Hardware requirements: The 13B parameter model requires a GPU with at least 24GB VRAM for optimal performance.
- Inference speed: Response generation typically takes 2-10 seconds depending on hardware and response length.
- Image resolution: LLaVA processes images at a fixed resolution (typically 224x224 pixels), which may limit detailed analysis of very small elements.
- Privacy considerations: For sensitive applications, consider running this locally rather than on cloud infrastructure.
This example illustrates how LLaVA's capabilities can be packaged into user-friendly applications that bring multimodal AI's power to non-technical users. The combination of visual understanding, language generation, and interactive controls creates a flexible system for a wide range of visual reasoning tasks.
5.1.2 Flamingo (DeepMind)
Flamingo is a groundbreaking multimodal model developed by DeepMind, specifically engineered to excel at few-shot learning across text and image domains. Unlike models that require extensive task-specific training, Flamingo can adapt to new visual tasks with minimal examples. This represents a significant advancement in multimodal AI, as most earlier systems required dedicated training datasets for each new type of visual reasoning task they needed to perform.
At its architectural core, Flamingo uses a frozen language model (LLM) as its foundation and introduces specialized cross-attention layers that create bridges between visual representations and textual understanding. These cross-attention mechanisms serve as effective translators, allowing visual information to be meaningfully incorporated into the language model's processing pipeline without disrupting its pre-trained linguistic capabilities. The visual processing component of Flamingo utilizes a vision encoder based on a Normalizer-Free ResNet (NFNet), which transforms images into dense feature representations. These visual features are then processed through a perceiver resampler module that converts the variable-sized visual representations into a fixed number of visual tokens that can be efficiently processed by the language model.
What makes Flamingo particularly impressive is its ability to perform "in-context learning" with visual data. It can answer questions about previously unseen image-text tasks with remarkably little training data - often needing just 1-16 examples to achieve strong performance. This capability allows Flamingo to generalize to novel visual reasoning scenarios without extensive retraining, making it adaptable across domains like visual question answering, image captioning, and visual reasoning with minimal setup time. The model was trained on a massive multimodal dataset comprising hundreds of millions of image-text pairs gathered from diverse web sources, enabling it to develop a rich understanding of the relationships between visual and textual concepts.
During inference, Flamingo can process interleaved sequences of images and text, making it particularly well-suited for conversational interactions about visual content. For example, a user could show Flamingo several images of animals with corresponding descriptions as examples, then present a new animal image and ask for a similar description. The model would leverage its few-shot learning capabilities to generate an appropriate response following the pattern established in the examples. This flexibility extends to complex reasoning tasks as well, such as comparing multiple images, answering questions about specific visual details, or even generating creative content inspired by visual inputs.
The model's architecture has inspired subsequent research in efficient multimodal learning, particularly in how to effectively combine pre-trained unimodal models (like vision-only and language-only systems) into powerful multimodal reasoners without requiring extensive joint training from scratch. This approach has proven valuable for developing more accessible multimodal AI systems while leveraging the strengths of specialized models in each modality.
Flamingo Implementation Example: Multimodal Few-shot Learning
Below is a simplified implementation example of a Flamingo-inspired architecture using PyTorch. This example demonstrates the core components of Flamingo: a vision encoder, a perceiver resampler, and cross-attention layers integrated with a language model.
import torch
import torch.nn as nn
import torchvision.models as models
from transformers import GPT2LMHeadModel, GPT2Tokenizer
class PerceiverResampler(nn.Module):
"""
Perceiver Resampler module that converts variable-sized visual features
to a fixed number of tokens that can be processed by the language model.
"""
def __init__(self, input_dim=2048, latent_dim=768, num_latents=64, num_layers=4):
super().__init__()
self.latents = nn.Parameter(torch.randn(num_latents, latent_dim))
self.layers = nn.ModuleList([
nn.MultiheadAttention(embed_dim=latent_dim, num_heads=8, batch_first=True)
for _ in range(num_layers)
])
self.input_proj = nn.Linear(input_dim, latent_dim)
self.norm = nn.LayerNorm(latent_dim)
def forward(self, visual_features):
# Project visual features to latent dimension
visual_features = self.input_proj(visual_features)
# Expand latents to batch size
batch_size = visual_features.shape[0]
latents = self.latents.unsqueeze(0).expand(batch_size, -1, -1)
# Process through cross-attention layers
for layer in self.layers:
latents = latents + layer(
query=latents,
key=visual_features,
value=visual_features,
need_weights=False
)[0]
latents = self.norm(latents)
return latents
class CrossAttentionBlock(nn.Module):
"""
Cross-attention block that integrates visual information into the LLM.
"""
def __init__(self, hidden_size=768, num_heads=12):
super().__init__()
self.cross_attention = nn.MultiheadAttention(
embed_dim=hidden_size,
num_heads=num_heads,
batch_first=True
)
self.layer_norm1 = nn.LayerNorm(hidden_size)
self.layer_norm2 = nn.LayerNorm(hidden_size)
def forward(self, hidden_states, visual_features):
normed_hidden_states = self.layer_norm1(hidden_states)
# Apply cross-attention
attn_output = self.cross_attention(
query=normed_hidden_states,
key=visual_features,
value=visual_features,
need_weights=False
)[0]
# Residual connection and layer norm
hidden_states = hidden_states + attn_output
hidden_states = self.layer_norm2(hidden_states)
return hidden_states
class FlamingoModel(nn.Module):
"""
Simplified Flamingo model combining vision encoder, perceiver resampler,
and a language model with cross-attention layers.
"""
def __init__(self, vision_model_name="resnet50", num_visual_tokens=64):
super().__init__()
# Vision encoder (frozen)
self.vision_encoder = models.__dict__[vision_model_name](pretrained=True)
self.vision_encoder.fc = nn.Identity() # Remove classification head
for param in self.vision_encoder.parameters():
param.requires_grad = False
# Perceiver resampler
self.perceiver = PerceiverResampler(
input_dim=2048, # ResNet50 feature dim
latent_dim=768, # Match GPT2 hidden size
num_latents=num_visual_tokens
)
# Language model (frozen)
self.language_model = GPT2LMHeadModel.from_pretrained("gpt2")
self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
self.tokenizer.pad_token = self.tokenizer.eos_token
for param in self.language_model.parameters():
param.requires_grad = False
# Cross-attention layers (one per transformer block)
self.cross_attentions = nn.ModuleList([
CrossAttentionBlock(hidden_size=768, num_heads=12)
for _ in range(len(self.language_model.transformer.h))
])
# Save original forward methods
self.original_block_forward = self.language_model.transformer.h[0].forward
# Monkey patch the transformer blocks to include cross-attention
for i, block in enumerate(self.language_model.transformer.h):
block.flamingo_cross_attn = self.cross_attentions[i]
block.forward = self._make_new_forward(block, i)
# Visual features buffer for storing current visual context
self.register_buffer("visual_features", None, persistent=False)
def _make_new_forward(self, block, block_index):
"""Creates a new forward method for transformer blocks that includes cross-attention."""
original_forward = block.forward
cross_attn = self.cross_attentions[block_index]
def new_forward(x, **kwargs):
# Run original transformer block
hidden_states = original_forward(x, **kwargs)
# Apply cross-attention with visual features
if self.visual_features is not None:
hidden_states = cross_attn(hidden_states[0], self.visual_features)
return (hidden_states,) + hidden_states[1:] if isinstance(hidden_states, tuple) else (hidden_states,)
return hidden_states
return new_forward
def process_images(self, images):
"""Extract visual features from images and prepare them for conditioning."""
with torch.no_grad():
# Extract features from vision encoder
features = self.vision_encoder(images) # [batch_size, 2048]
features = features.unsqueeze(1) # Add sequence dimension [batch_size, 1, 2048]
# Process through perceiver resampler
visual_tokens = self.perceiver(features) # [batch_size, num_latents, hidden_size]
# Store visual features for cross-attention
self.visual_features = visual_tokens
def generate(self, prompt, images=None, max_length=100, temperature=0.7):
"""Generate text conditioned on images and text prompt."""
# Process images if provided
if images is not None:
self.process_images(images)
else:
self.visual_features = None
# Tokenize prompt
inputs = self.tokenizer(prompt, return_tensors="pt", padding=True)
input_ids = inputs.input_ids.to(next(self.parameters()).device)
attention_mask = inputs.attention_mask.to(next(self.parameters()).device)
# Generate text
output_ids = self.language_model.generate(
input_ids,
attention_mask=attention_mask,
max_length=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
)
# Decode output
generated_text = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
return generated_text
# Example usage
def flamingo_example():
from PIL import Image
import torchvision.transforms as transforms
# Initialize model
model = FlamingoModel().to("cuda" if torch.cuda.is_available() else "cpu")
# Prepare image transform
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Load and process image
image = Image.open("eiffel-tower.jpg")
image_tensor = transform(image).unsqueeze(0).to(next(model.parameters()).device)
# Example prompts for few-shot learning
few_shot_prompt = """
Image: [A photo of a busy street in Tokyo]
Description: The image shows a crowded street in Tokyo with neon signs, many pedestrians, and small restaurants.
Image: [A photo of the Grand Canyon]
Description: The image depicts the vast expanse of the Grand Canyon with its layered rock formations and deep ravines.
Image: [Current image]
Description:
"""
# Generate text based on image
output = model.generate(few_shot_prompt, images=image_tensor, max_length=200)
print(output)
if __name__ == "__main__":
flamingo_example()
For this example, download the Eiffel Tower image here: https://files.cuantum.tech/images/eiffel-tower.jpg
Code Breakdown: Flamingo-inspired Multimodal Model
The above implementation represents a simplified version of DeepMind's Flamingo architecture. Let's break down the key components:
1. Architecture Components
- Vision Encoder: A pretrained ResNet50 model that extracts visual features from images. In the full Flamingo model, this would be a more advanced vision model like NFNet.
- Perceiver Resampler: This critical component transforms variable-sized visual features into a fixed number of visual tokens. It uses cross-attention between learned latent vectors and visual features to condense the visual information.
- Language Model: A pretrained GPT-2 model serves as the language foundation. The original Flamingo used a more powerful Chinchilla LLM.
- Cross-Attention Layers: These layers are inserted into each transformer block of the language model, allowing visual information to influence text generation at multiple levels of processing.
2. Key Design Decisions
- Frozen Backbone Models: Both the vision encoder and language model are kept frozen, preserving their pretrained capabilities while only training the connecting components.
- Parameter Efficiency: By only training the perceiver resampler and cross-attention layers, Flamingo achieves multimodal capabilities with relatively few trainable parameters.
- Monkey Patching: The implementation uses a technique called "monkey patching" to insert cross-attention into the language model without modifying its original architecture.
3. How Visual Processing Works
- The image is passed through the vision encoder to extract high-level visual features (2048-dimensional for ResNet50).
- These features are then processed by the perceiver resampler, which condenses them into a fixed set of tokens (64 in this example).
- The resulting visual tokens are stored in a buffer and made available to all cross-attention layers during text generation.
4. How Few-Shot Learning Is Implemented
- The example demonstrates few-shot learning through a carefully formatted prompt containing example image-text pairs.
- Each example follows a pattern of "Image: [description]" followed by "Description: [detailed text]".
- The final prompt ends with "Image: [Current image]" and "Description:", prompting the model to generate a description for the new image following the pattern established by the examples.
- This in-context learning approach allows the model to adapt to specific tasks without parameter updates.
5. Practical Considerations and Limitations
- Computational Efficiency: The real Flamingo model uses sophisticated techniques for handling larger contexts and more efficiently processing visual information.
- Training Requirements: To fully train this model, you would need a large dataset of image-text pairs and significant computational resources.
- Simplified Architecture: This example omits some details of the full Flamingo architecture for clarity, such as gated cross-attention and more advanced visual processing.
6. Real-world Applications
- Visual question answering: Answering specific questions about image content with few or no examples.
- Image captioning: Generating detailed descriptions of images in various styles based on examples.
- Visual reasoning: Performing complex reasoning tasks about visual content, such as comparing images or identifying relationships.
- Multimodal chat: Enabling conversational interactions that seamlessly incorporate visual information.
This implementation provides a starting point for understanding and experimenting with Flamingo-style multimodal architectures. The real power of such models comes from their ability to perform in-context learning across modalities, adapting to new tasks with minimal examples.
Enhanced Flamingo Implementation with In-Context Learning
Let's explore a more comprehensive implementation of the Flamingo architecture that better demonstrates its in-context learning capabilities for visual question answering:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2Tokenizer, ViTModel, ViTImageProcessor
from PIL import Image
import requests
from io import BytesIO
class GatedCrossAttentionBlock(nn.Module):
"""
Enhanced cross-attention block with gating mechanism as used in Flamingo.
"""
def __init__(self, hidden_size=768, num_heads=12):
super().__init__()
self.hidden_size = hidden_size
self.cross_attention = nn.MultiheadAttention(
embed_dim=hidden_size,
num_heads=num_heads,
batch_first=True
)
# Gating mechanism
self.gate = nn.Linear(hidden_size, hidden_size)
self.gate_activation = nn.Sigmoid()
# Layer normalization
self.layer_norm1 = nn.LayerNorm(hidden_size)
self.layer_norm2 = nn.LayerNorm(hidden_size)
def forward(self, hidden_states, visual_features):
normed_hidden_states = self.layer_norm1(hidden_states)
# Apply cross-attention
attn_output, _ = self.cross_attention(
query=normed_hidden_states,
key=visual_features,
value=visual_features
)
# Apply gating mechanism
gate_values = self.gate_activation(self.gate(normed_hidden_states))
attn_output = gate_values * attn_output
# Residual connection and layer norm
hidden_states = hidden_states + attn_output
hidden_states = self.layer_norm2(hidden_states)
return hidden_states
class PerceiverResampler(nn.Module):
"""
Perceiver Resampler that converts variable-length visual features into
a fixed number of tokens through cross-attention with learned queries.
"""
def __init__(self, input_dim=768, latent_dim=768, num_latents=64, num_layers=4):
super().__init__()
self.latents = nn.Parameter(torch.randn(num_latents, latent_dim))
self.layers = nn.ModuleList([
nn.MultiheadAttention(
embed_dim=latent_dim,
num_heads=8,
batch_first=True
)
for _ in range(num_layers)
])
self.input_projection = nn.Linear(input_dim, latent_dim)
self.layer_norm = nn.LayerNorm(latent_dim)
def forward(self, x):
batch_size = x.shape[0]
# Project input features to match latent dimension
x = self.input_projection(x)
# Expand latents for each item in the batch
latents = self.latents.unsqueeze(0).expand(batch_size, -1, -1)
# Apply layers of cross-attention
for layer in self.layers:
latents, _ = layer(
query=latents,
key=x,
value=x
)
latents = self.layer_norm(latents)
return latents
class EnhancedFlamingoModel(nn.Module):
"""
Enhanced Flamingo model with improved components for in-context learning
and visual question answering tasks.
"""
def __init__(self, num_visual_tokens=64, vision_model_name="google/vit-base-patch16-224"):
super().__init__()
# Vision encoder (frozen ViT)
self.vision_encoder = ViTModel.from_pretrained(vision_model_name)
self.vision_processor = ViTImageProcessor.from_pretrained(vision_model_name)
for param in self.vision_encoder.parameters():
param.requires_grad = False
# Perceiver resampler
self.perceiver = PerceiverResampler(
input_dim=768, # ViT feature dim
latent_dim=768, # Match GPT2 hidden size
num_latents=num_visual_tokens,
num_layers=4
)
# Language model (frozen GPT-2)
self.language_model = GPT2LMHeadModel.from_pretrained("gpt2")
self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
self.tokenizer.pad_token = self.tokenizer.eos_token
# Keep LM frozen except for final layer norm and unembedding
for name, param in self.language_model.named_parameters():
if "ln_f" in name or "wte" in name:
param.requires_grad = True
else:
param.requires_grad = False
# Special tokens for marking image inputs
self.image_start_token = "<image>"
self.image_end_token = "</image>"
# Add special tokens to vocabulary
special_tokens = {"additional_special_tokens": [self.image_start_token, self.image_end_token]}
num_added = self.tokenizer.add_special_tokens(special_tokens)
self.language_model.resize_token_embeddings(len(self.tokenizer))
# Cross-attention blocks
self.cross_attentions = nn.ModuleList([
GatedCrossAttentionBlock(hidden_size=768, num_heads=12)
for _ in range(len(self.language_model.transformer.h))
])
# Create image token ID
self.image_start_token_id = self.tokenizer.convert_tokens_to_ids(self.image_start_token)
self.image_end_token_id = self.tokenizer.convert_tokens_to_ids(self.image_end_token)
# Register hook to modify the transformer layers
for i, block in enumerate(self.language_model.transformer.h):
block.register_forward_hook(self._make_cross_attention_hook(i))
# Buffer for storing visual features
self.register_buffer("visual_features", None, persistent=False)
def _make_cross_attention_hook(self, block_idx):
"""Create a forward hook for adding cross-attention at specified layer."""
cross_attn = self.cross_attentions[block_idx]
def hook(module, inputs, outputs):
if self.visual_features is None:
return outputs
hidden_states = outputs[0] if isinstance(outputs, tuple) else outputs
modified_hidden_states = cross_attn(hidden_states, self.visual_features)
if isinstance(outputs, tuple):
return (modified_hidden_states,) + outputs[1:]
return modified_hidden_states
return hook
def _encode_image(self, image_tensor):
"""Process a single image through the vision encoder and perceiver."""
with torch.no_grad():
vision_outputs = self.vision_encoder(image_tensor)
hidden_states = vision_outputs.last_hidden_state
# Process through perceiver resampler to get fixed number of tokens
visual_tokens = self.perceiver(hidden_states)
return visual_tokens
def _encode_images_batch(self, image_list):
"""Process a batch of images through the vision pipeline."""
processed_images = []
for image in image_list:
if isinstance(image, str):
# Load from URL if string
response = requests.get(image)
img = Image.open(BytesIO(response.content))
else:
# Assume PIL Image otherwise
img = image
# Preprocess for vision model
processed = self.vision_processor(img, return_tensors="pt")
processed_images.append(processed["pixel_values"])
# Stack into batch
image_tensors = torch.cat(processed_images, dim=0).to(next(self.parameters()).device)
return self._encode_image(image_tensors)
def format_prompt_with_images(self, text_prompt, images):
"""Format a prompt with image placeholders and encode the images."""
# Encode images first
self.visual_features = self._encode_images_batch(images)
# Replace placeholders with special tokens
formatted_prompt = text_prompt.replace("[IMAGE]", f"{self.image_start_token}{self.image_end_token}")
return formatted_prompt
def generate_answer(self, prompt, images=None, max_length=200, temperature=0.7):
"""Generate an answer for a visual question answering prompt with images."""
if images:
prompt = self.format_prompt_with_images(prompt, images)
# Tokenize prompt
inputs = self.tokenizer(prompt, return_tensors="pt").to(next(self.parameters()).device)
# Generate text
with torch.no_grad():
output_ids = self.language_model.generate(
inputs.input_ids,
max_length=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
pad_token_id=self.tokenizer.eos_token_id
)
# Get only the generated text (not the prompt)
generated_ids = output_ids[0][inputs.input_ids.shape[1]:]
generated_text = self.tokenizer.decode(generated_ids, skip_special_tokens=True)
# Clear visual features after generation
self.visual_features = None
return generated_text.strip()
def run_visual_qa_demo():
"""Demonstrate visual question answering with the Flamingo model."""
# Initialize model
model = EnhancedFlamingoModel().to("cuda" if torch.cuda.is_available() else "cpu")
# Example images (use URLs for convenience)
example_images = [
"https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg", # Image of a dog on a beach
"https://files.cuantum.tech/images/dog_drawing.jpg" # Drawing of a dog
]
# Few-shot prompt for VQA
few_shot_prompt = """
I will answer questions about images.
[IMAGE]
Question: What animal is in the image?
Answer: The image shows a dog running on the beach. It appears to be a golden retriever enjoying the sand and ocean.
[IMAGE]
Question: What is this a drawing of?
Answer: This is a simple drawing of a dog. It appears to be a cartoon-style sketch with basic lines representing a dog's features.
[IMAGE]
Question: What is shown in this image?
Answer:
"""
# New test image (Eiffel Tower)
test_image = "https://files.cuantum.tech/images/eiffel-tower.jpg"
# Generate answer
answer = model.generate_answer(
few_shot_prompt,
images=example_images + [test_image],
max_length=100
)
print("Model's answer:", answer)
if __name__ == "__main__":
run_visual_qa_demo()
Code Breakdown: Advanced Flamingo Implementation
This enhanced implementation of the Flamingo architecture includes several important improvements that make it more similar to the original DeepMind model:
1. Key Architecture Enhancements
- Gated Cross-Attention: Unlike the basic implementation, this version includes a gating mechanism that controls how much visual information flows into the language model at each layer. This prevents visual information from dominating and allows for more nuanced integration.
- Multi-layer Perceiver Resampler: The perceiver now uses multiple layers of cross-attention to refine the visual tokens, creating a more sophisticated visual representation.
- ViT Vision Encoder: Uses a modern Vision Transformer instead of ResNet, providing better visual feature extraction.
- Special Tokens: Adds special image tokens to the vocabulary, allowing the model to recognize where images appear in the context.
2. In-Context Learning Implementation
- Few-Shot Visual QA: The prompt structure demonstrates how Flamingo enables few-shot learning by showing examples of image-question-answer triplets.
- Image Placeholders: Uses [IMAGE] placeholders in the prompt that get replaced with special tokens, mimicking how the real Flamingo handles multiple images in context.
- Contextual Memory: The model processes multiple images and remembers their features during generation, allowing it to reference different examples.
3. Technical Implementation Details
- Forward Hooks: Uses PyTorch hooks instead of monkey patching to inject cross-attention into the transformer blocks, which is a cleaner implementation.
- Selective Fine-tuning: Only certain parts of the language model are trainable (final layer norm and embedding), while keeping most parameters frozen.
- Batched Image Processing: Handles multiple images efficiently by batching them through the vision pipeline.
4. User-Friendly Features
- URL Image Loading: Supports loading images directly from URLs, making demonstrations easier.
- Structured API: Provides a clean interface for formatting prompts with images and generating answers.
- Memory Management: Clears visual features after generation to free up memory.
5. Real-world Applications
This implementation demonstrates how Flamingo can be used for:
- Visual Question Answering: Answering specific questions about image content.
- Few-Shot Learning: Learning new tasks from just a few examples without parameter updates.
- Multi-image Reasoning: Processing information across multiple images to provide coherent answers.
The enhanced implementation shows how multimodal models can maintain the powerful in-context learning capabilities of large language models while incorporating rich visual information. This approach allows for flexible adaptation to new visual tasks without specialized fine-tuning, making it particularly valuable for real-world applications.
5.1.3 GPT-5 (OpenAI)
Launched on August 7, 2025, GPT-5 marks a new milestone in OpenAI’s large language model lineage. It is the first fully native multimodal model, trained jointly on text, images, and audio from the ground up, with a composed system design that integrates fast responses, deep reasoning, and intelligent routing. More than an incremental upgrade over GPT-4o, GPT-5 represents a paradigm shift: a model architected from the beginning to process and reason across modalities as a unified whole.
Native Multimodal Architecture
Unlike earlier models that retrofitted speech or vision modules onto a text-first transformer, GPT-5 is fundamentally multimodal. Text, image, and audio are processed in the same transformer backbone, creating shared internal representations that seamlessly connect concepts across formats.
This design produces fluid cross-modal reasoning. For example, if a user submits a photo of a math problem, GPT-5 not only recognizes the characters but also interprets the underlying mathematical structure. It then generates a step-by-step solution that references specific symbols in the image, checks for ambiguities, and explains the reasoning in natural language. This integrated comprehension extends to scientific diagrams, financial charts, architectural blueprints, and medical imagery.
By aligning modalities during training, GPT-5 develops deeper semantic coherence—understanding how textual descriptions, visual data, and spoken language reinforce or contradict each other. It can, for instance, highlight inconsistencies between a historical photograph and a written account, or correlate radiology images with patient notes.
Composed System and Intelligent Routing
GPT-5 is not a monolithic model but a composed system:
- A main fast model handles everyday queries with low latency.
- A thinking model engages when complex, multi-step reasoning is required, offering real-time chain-of-thought.
- Mini and nano variants optimize cost and speed for lightweight applications.
- A Pro reasoning variant (API only) extends test-time reasoning for the hardest problems.
An intelligent router automatically decides which component to use, sparing users from manually picking between “light” and “heavy” models. This dynamic composition ensures efficiency for simple prompts and depth for challenging ones.
Reasoning and Context Management
With real-time chain-of-thought reasoning, GPT-5 excels in tasks that require logic, multi-step deduction, or tool use. On external benchmarks, it sets new records: 74.9% accuracy on SWE-bench Verified (software engineering) and 88% on Aider polyglot (code editing).
The model’s expanded context window—up to 400,000 tokens via the API, with output lengths of up to 128,000 tokens—supports the analysis of entire books, multi-hour meetings, or large codebases without losing track of earlier information. This scale makes it suitable for legal discovery, research synthesis, and full-repository debugging.
Voice and Multilingual Capabilities
Through the Realtime API, GPT-5 offers natural speech-in/speech-out interactions with millisecond-level latency. The voice system is robust to accents, can modulate tone on command, and integrates with SIP protocols, enabling real-world phone calls and live agents. Users can now hold fluid conversations where GPT-5 reasons, speaks, and listens in real time.
Multilingual fluency has also advanced, making GPT-5 a practical tool for cross-border communication, customer support, education, and accessibility.
Developer Controls and Tool Integration
Developers gain fine-grained control via new parameters:
reasoning_effort: from minimal (fast) to extensive (deep reasoning).verbosity: low, medium, or high detail in responses.
The API exposes three model families—gpt-5, gpt-5-mini, and gpt-5-nano—to balance accuracy, cost, and latency. Pricing (per million tokens) at launch was $1.25 input / $10 output for GPT-5, with cheaper mini and nano tiers.
GPT-5 also supports custom tools: lightweight, plaintext tool calls with optional grammar constraints, allowing more reliable integration with external APIs. Enterprises can connect GPT-5 directly into Microsoft Copilot, Apple Intelligence, GitLab, Notion, and custom pipelines.
Accuracy, Safety, and Bias Reduction
OpenAI introduced safe-completions training in GPT-5. Instead of choosing between over-compliance and refusal, the model aims to generate the safest useful answer. Internal evaluations show:
- Substantially fewer hallucinations than GPT-4o.
- Lower sycophancy (over-agreeableness).
- Reduced deception, meaning the model is less likely to feign success on impossible tasks.
Safety frameworks classify GPT-5 Thinking as High capability in biology and chemistry, with layered safeguards, red-teaming, and monitoring.
Use Cases and Industry Impact
- Coding & Engineering: GPT-5 generates functional front-end code, debugs large repositories, and coordinates multi-tool development workflows.
- Automation & Productivity: From grading and summarizing to document review, it frees human bandwidth for higher-order work.
- Knowledge Work: Enterprises use GPT-5 for legal analysis, financial reporting, and R&D, where its long context and reasoning shine.
- Creative Workflows: Designers, writers, and researchers can mix text, images, and audio in prompts—e.g., analyzing a chart and drafting a report in one go.
- Voice Agents: Customer service and sales teams deploy GPT-5 via Realtime API to deliver human-like support, capturing alphanumeric details and following strict protocols.
The New Standard
GPT-5 establishes a new baseline for large multimodal models. Its unified architecture, dynamic routing, reasoning capabilities, and developer controls make it a versatile foundation for both consumer and enterprise AI. By natively fusing text, vision, and audio, GPT-5 doesn’t just respond across modalities—it reasons through them, enabling a generation of AI systems that operate more like collaborators than tools.
Basic Example: Multimodal Prompt with JSON Output (Chat Completions API)
A beginner-friendly example showing how to send an image and text together and receive a structured JSON response.
import requests
import json # You need this to parse the JSON string from the response
API_KEY = "YOUR_OPENAI_API_KEY"
# Use the correct API endpoint
API_URL = "https://api.openai.com/v1/chat/completions"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
# Example: Provide an image URL and a text query jointly
# Corrected input structure using 'type' and 'image_url' keys
image_part = {
"type": "image_url",
"image_url": {
"url": "https://files.cuantum.tech/images/chart.png" # Can also use a data URL for base64 images
}
}
# Corrected text part structure
text_part = {
"type": "text",
"text": "Summarize the main trend shown in the chart. Also, generate the Python code to recreate this visualization. Format the response as a JSON object with the keys 'summary', 'python_code', and 'key_points'."
}
# Corrected payload
payload = {
"model": "gpt-5",
"messages": [
{
"role": "user",
"content": [
image_part,
text_part
]
}
],
# Correct way to request JSON output
"response_format": { "type": "json_object" },
# The max_tokens parameter is standard
"max_tokens": 400
}
response = requests.post(API_URL, headers=headers, json=payload)
result = response.json()
# Correct way to handle the API response
try:
# The API returns a JSON string inside the message content, so we parse it
response_content = result['choices'][0]['message']['content']
parsed_output = json.loads(response_content)
# Print structured output from the parsed JSON
print("Summary:", parsed_output.get("summary"))
print("Python code:", parsed_output.get("python_code"))
print("Key points:", parsed_output.get("key_points"))
except (KeyError, IndexError, json.JSONDecodeError) as e:
print("Error parsing the API response:", e)
print("Raw response:", result)
Code Breakdown
This example demonstrates how to send a multimodal request to OpenAI's GPT-5 model, combining an image URL with a text query, and specifically asking for a structured JSON response.
1. Import Libraries
import requests
import jsonrequests: This library is essential for making HTTP requests in Python. We use it to send our data to the OpenAI API and receive the response.json: This library is used for working with JSON (JavaScript Object Notation) data. We'll use it to construct our request payload and, critically, to parse the JSON string that GPT-5 will return to us when we ask for structured output.
2. API Configuration
API_KEY = "YOUR_OPENAI_API_KEY"
API_URL = "https://api.openai.com/v1/chat/completions"API_KEY: This is a placeholder for your unique OpenAI API key. You must replace"YOUR_OPENAI_API_KEY"with your actual key, which you can obtain from the OpenAI developer dashboard. This key authenticates your requests.API_URL: This is the specific endpoint for OpenAI's chat completion API. All conversational and multimodal requests go to this URL. It's crucial that this is correct.
3. Request Headers
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}headers: This dictionary contains metadata sent with our HTTP request."Authorization": f"Bearer {API_KEY}": This header authenticates your request using your API key. TheBearertoken prefix is a standard for OAuth 2.0."Content-Type": "application/json": This header tells the server that the body of our request is formatted as JSON.
4. Defining Multimodal Input Parts
GPT-5 can process different types of input simultaneously. Here, we define an image and a text part.
image_part = {
"type": "image_url",
"image_url": {
"url": "https://files.cuantum.tech/images/chart.png"
}
}image_part: This dictionary represents the visual input."type": "image_url": Specifies that this content block is an image provided via a URL."image_url": {"url": "..."}: This nested structure is where the actual image URL is provided. The model will fetch and process the image from this link. You could also provide base64 encoded images here instead of a URL.
text_part = {
"type": "text",
"text": "Summarize the main trend shown in the chart. Also, generate the Python code to recreate this visualization. Format the response as a JSON object with the keys 'summary', 'python_code', and 'key_points'."
}text_part: This dictionary holds the textual instruction for the model."type": "text": Indicates this content block is plain text."text": "...": This is the actual prompt to GPT-5. Notice how we explicitly ask for a JSON object with specific keys (summary,python_code,key_points). This is crucial for getting structured output from the model.
5. Constructing the Request Payload
This is the main body of the request, containing all the instructions for the API.
payload = {
"model": "gpt-5",
"messages": [
{
"role": "user",
"content": [
image_part,
text_part
]
}
],
"response_format": { "type": "json_object" },
"max_tokens": 400
}"model": "gpt-5": Specifies which OpenAI model to use. In this case, it's the latest GPT-5."messages": [...]: This is a list of message objects, forming the conversation.- Each message has a
"role"(e.g.,"user","system","assistant") and"content". "role": "user": Indicates that this message comes from the user."content": [image_part, text_part]: This is the crucial part for multimodal input. Thecontentis a list containing both ourimage_partandtext_partdictionaries. The model will process them together.
- Each message has a
"response_format": { "type": "json_object" }: This parameter explicitly tells the API to constrain the model's output to a valid JSON object. This is essential when you want structured data back from the model, as we requested in ourtext_part."max_tokens": 400: Sets the maximum number of tokens (words or word pieces) the model should generate in its response. This helps control cost and response length.
6. Sending the Request
response = requests.post(API_URL, headers=headers, json=payload)
result = response.json()requests.post(...): This function sends an HTTP POST request to theAPI_URLwith ourheadersand thepayload(converted to JSON byrequests.post).response.json(): The API's reply comes back as a JSON string. This method parses that string into a Python dictionary, making it easy to access the data.
7. Handling and Parsing the Response
The API's response structure is standard, but the actual content we asked GPT-5 to generate is nested within it as a string.
try:
response_content = result['choices'][0]['message']['content']
parsed_output = json.loads(response_content)
print("Summary:", parsed_output.get("summary"))
print("Python code:", parsed_output.get("python_code"))
print("Key points:", parsed_output.get("key_points"))
except (KeyError, IndexError, json.JSONDecodeError) as e:
print("Error parsing the API response:", e)
print("Raw response:", result)try...except: This block is crucial for robust error handling. API calls can fail for many reasons (network issues, incorrect API key, malformed requests, or the model might not return valid JSON).result['choices'][0]['message']['content']: This is the path to extract the actual text generated by GPT-5.result['choices']: The API can return multiplechoices(different possible completions) based on parameters liken. We usually take the first one ([0]).['message']: Within each choice, themessageobject contains therole(e.g., "assistant") and the generatedcontent.
json.loads(response_content): Since we specifically asked the model to format its output as a JSON string within thecontentfield, we need to usejson.loads()to parse this string into a Python dictionary.parsed_output.get("summary"),parsed_output.get("python_code"),parsed_output.get("key_points"): Onceresponse_contentis parsed into a dictionary, we can access the individual fields we requested from GPT-5. Using.get()is safer than direct dictionary access ([]) as it preventsKeyErrorif a key is missing.- The
exceptblock catches potential errors during parsing or if the expected keys are not found, printing both the error and the raw API response for debugging.
Advanced Example: Production-Ready Multimodal Workflow (Responses API with JSON Schema)
A robust example demonstrating best practices for reliability, schema validation, retries, and safe execution of returned code.
"""
Multimodal (image + text) → structured JSON with GPT-5
- Uses the Responses API (recommended)
- Strict JSON schema for reliable structured output
- Optional: safely execute returned Matplotlib code in a subprocess to render a PNG
"""
import os
import json
import time
import base64
import requests
import tempfile
import subprocess
import sys
from textwrap import dedent
from typing import Dict, Any, List, Optional
# =========================
# Configuration
# =========================
API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
API_URL = "https://api.openai.com/v1/responses"
MODEL = "gpt-5" # or: gpt-5-mini / gpt-5-nano
HEADERS = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
}
# Use a public image URL OR a local file encoded as a data URL (see helper below).
IMAGE_URL = "https://cdn.example.com/chart.png" # <- replace for your test
# Strict JSON schema for the model’s response
RESPONSE_SCHEMA: Dict[str, Any] = {
"name": "ChartInsight",
"schema": {
"type": "object",
"properties": {
"summary": {"type": "string"},
"python_code": {"type": "string"},
"key_points": {
"type": "array",
"items": {"type": "string"},
"minItems": 3,
"maxItems": 7
}
},
"required": ["summary", "python_code", "key_points"],
"additionalProperties": False
},
"strict": True
}
PROMPT_TEXT = (
"You are a meticulous data analyst.\n"
"Tasks:\n"
"1) Summarize the main trend in the chart.\n"
"2) Generate minimal, runnable Python (matplotlib) code that recreates a similar visualization "
" using inferred placeholder data. Include clear axis labels and a title.\n"
"3) Provide 3–7 bullet key points.\n"
"Return a JSON object that matches the provided JSON schema exactly."
)
# =========================
# Helpers
# =========================
def local_image_to_data_url(path: str, mime: Optional[str] = None) -> str:
"""
Convert a local image file to a data URL usable as an image input.
Example usage:
IMAGE_URL = local_image_to_data_url("chart.png")
"""
if not mime:
# naive mime inference by extension
ext = os.path.splitext(path)[1].lower()
mime = "image/png" if ext in [".png"] else "image/jpeg"
with open(path, "rb") as f:
b64 = base64.b64encode(f.read()).decode("utf-8")
return f"data:{mime};base64,{b64}"
def build_payload(image_url: str) -> Dict[str, Any]:
"""
Build a Responses API payload with multimodal input and JSON schema output.
"""
return {
"model": MODEL,
"input": [
{
"role": "user",
"content": [
{"type": "input_image", "image_url": {"url": image_url}},
{"type": "input_text", "text": PROMPT_TEXT}
]
}
],
"response_format": {
"type": "json_schema",
"json_schema": RESPONSE_SCHEMA
},
"max_output_tokens": 900,
"temperature": 0.2
}
def post_with_retries(
url: str,
headers: Dict[str, str],
json_payload: Dict[str, Any],
retries: int = 3,
backoff: float = 1.5,
timeout: int = 60
) -> Dict[str, Any]:
"""
POST with simple exponential backoff for rate limits / transient errors.
"""
for attempt in range(1, retries + 1):
try:
resp = requests.post(url, headers=headers, json=json_payload, timeout=timeout)
if resp.status_code == 200:
return resp.json()
# Retry on typical transient statuses
if resp.status_code in (429, 500, 502, 503, 504):
time.sleep(backoff ** attempt)
continue
raise RuntimeError(f"HTTP {resp.status_code}: {resp.text}")
except requests.exceptions.Timeout as e:
if attempt == retries:
raise
time.sleep(backoff ** attempt)
except requests.exceptions.RequestException as e:
if attempt == retries:
raise
time.sleep(backoff ** attempt)
raise RuntimeError("Request failed after retries")
def parse_responses_api_json(result: Dict[str, Any]) -> Dict[str, Any]:
"""
Extract the schema-validated JSON text and parse it to a dict.
Responses API returns: output[0].content[0].text for text output.
"""
try:
content_blocks = result["output"][0]["content"]
# Find first text block
for block in content_blocks:
if block.get("type") == "output_text" or block.get("type") == "text":
text = block.get("text", "")
if not text:
continue
# In schema mode, text should be strict JSON
return json.loads(text)
raise KeyError("No text block found in the response output")
except (KeyError, IndexError, json.JSONDecodeError) as e:
debug = json.dumps(result, indent=2)[:2000] # truncate for readability
raise ValueError(f"Failed to parse structured output: {e}\nPartial payload:\n{debug}")
def run_matplotlib_script(py_code: str) -> None:
"""
Safely run returned Matplotlib code in a clean subprocess (not in-process exec).
Saves 'recreated_chart.png' in the current working directory.
"""
safe_prefix = dedent("""
import matplotlib
matplotlib.use('Agg') # headless backend for servers/CI
""")
# Force a save at the end, even if the model code forgets to save
force_save = dedent("""
import os
import matplotlib.pyplot as plt
out = 'recreated_chart.png'
try:
plt.savefig(out, dpi=150, bbox_inches='tight')
except Exception:
# Some scripts call show() only; ensure we still save a figure if present
try:
plt.gcf().savefig(out, dpi=150, bbox_inches='tight')
except Exception:
pass
print(f"[Saved] {os.path.abspath(out)}")
""")
script = safe_prefix + "\n" + py_code + "\n\n" + force_save
with tempfile.NamedTemporaryFile("w", suffix=".py", delete=False) as f:
f.write(script)
tmp_path = f.name
completed = subprocess.run(
[sys.executable, tmp_path],
capture_output=True,
text=True,
timeout=60
)
if completed.stdout:
print(completed.stdout)
if completed.returncode != 0:
print("Script error:\n", completed.stderr)
# =========================
# Main flow
# =========================
def main():
if not API_KEY or API_KEY == "YOUR_OPENAI_API_KEY":
raise EnvironmentError("Set OPENAI_API_KEY environment variable or hardcode API_KEY.")
# If you want to test with a local image:
# IMAGE_URL = local_image_to_data_url("path/to/chart.png")
payload = build_payload(IMAGE_URL)
result = post_with_retries(API_URL, HEADERS, payload)
data = parse_responses_api_json(result)
print("\n=== Summary ===\n", data["summary"])
print("\n=== Key points ===")
for i, kp in enumerate(data["key_points"], 1):
print(f"{i}. {kp}")
print("\n=== Python code (recreate chart) ===\n")
print(data["python_code"])
# Optional: render the returned chart
user_wants_render = True # set to False to skip rendering
if user_wants_render:
run_matplotlib_script(data["python_code"])
if __name__ == "__main__":
main()
Download the chart example image here: https://files.cuantum.tech/images/chart.png
Code breakdown:
- Configuration
API_URL = "https://api.openai.com/v1/responses"uses the Responses API (the current, multimodal-first endpoint).MODEL = "gpt-5"picks the full model; you can swap togpt-5-mini/gpt-5-nanofor cheaper/faster runs.IMAGE_URL: set a public URL or switch to a local file vialocal_image_to_data_url().
- Strict JSON via schema
RESPONSE_SCHEMAtells the model exactly what keys and types to return.- This is more reliable than a plain
json_objecthint because the model is constrained to a schema and will retry internally to satisfy it.
- Building the multimodal prompt
build_payload()composesinputwith two blocks:{"type": "input_image", "image_url": {...}}for the image,{"type": "input_text", "text": PROMPT_TEXT}for instructions.
- The
response_formatrequests schema-validated output; the model returns a single JSON string that parses cleanly.
- Network resilience
post_with_retries()adds basic retry/backoff on rate limits or transient 5xx errors and a timeout so calls don’t hang.- Non-retryable errors raise with the server’s message for quick diagnosis.
- Parsing the Responses API
parse_responses_api_json()extractsresult["output"][0]["content"][0]["text"](the schema-validated JSON) andjson.loads()it.- If the shape changes (e.g., future versions), the function fails loudly with a helpful snippet.
- Optional: safe Matplotlib execution
run_matplotlib_script()runs the code in a separate Python process, not viaexec()in your main process.- It forces a headless backend and ensures a saved file
recreated_chart.pngeven if the script forgets. - This pattern is good enough for demos and CI, but for production you might put further guards (resource limits, containers).
- Main flow
- Build payload → call API with retries → parse JSON → print
summary,key_points, andpython_code. - Optionally, render the chart with the sandboxed subprocess.
Tool-Calling Example: “Ask GPT-5 to fetch data with your function, then analyze and plot”
"""
Tool-calling with GPT-5 (Chat Completions API)
- The model asks to call our tool `get_prices` with {symbol, days}
- We run the tool (here: mock data), send results back, then GPT-5 completes:
-> JSON with 'summary', 'key_points', and 'python_code' (Matplotlib)
"""
import os
import json
import time
import math
import requests
from datetime import datetime, timedelta
from typing import Dict, Any, List
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
API_URL = "https://api.openai.com/v1/chat/completions"
MODEL = "gpt-5"
HEADERS = {
"Authorization": f"Bearer {OPENAI_API_KEY}",
"Content-Type": "application/json",
}
# ---------- Tool: mock market data ----------
def get_prices(symbol: str, days: int = 30) -> Dict[str, Any]:
"""
Return mock OHLC data for the past N days.
Replace this with your real data source later (DB/API/cache).
"""
end = datetime.utcnow().date()
dates = [(end - timedelta(days=i)).isoformat() for i in range(days)][::-1]
# Simple deterministic waveform so every run is similar
base = 100.0
prices = []
for i, d in enumerate(dates):
v = base + 10 * math.sin(i / 4.0) + (i * 0.15)
o = round(v + math.sin(i) * 0.3, 2)
c = round(v + math.cos(i) * 0.3, 2)
h = round(max(o, c) + 0.6, 2)
l = round(min(o, c) - 0.6, 2)
prices.append({"date": d, "open": o, "high": h, "low": l, "close": c})
return {"symbol": symbol.upper(), "series": prices}
# ---------- Tool spec for the model ----------
TOOLS = [
{
"type": "function",
"function": {
"name": "get_prices",
"description": "Get recent OHLC data for a ticker symbol.",
"parameters": {
"type": "object",
"properties": {
"symbol": {"type": "string", "description": "Ticker, e.g., AAPL"},
"days": {"type": "integer", "minimum": 5, "maximum": 200, "default": 30}
},
"required": ["symbol"]
}
}
}
]
SYSTEM = (
"You are a quantitative analyst. If needed, call tools to fetch data, "
"then return a structured JSON with keys: summary (string), key_points (array of strings), "
"python_code (string that plots the series with matplotlib)."
)
USER = (
"Analyze the recent trend for the symbol AAPL (last 60 days). "
"If you need prices, use the tool. Then return JSON with summary, key_points, python_code."
)
def chat(payload: Dict[str, Any]) -> Dict[str, Any]:
r = requests.post(API_URL, headers=HEADERS, json=payload, timeout=60)
if r.status_code != 200:
raise RuntimeError(f"HTTP {r.status_code}: {r.text}")
return r.json()
def main():
# 1) Ask GPT-5; allow tool calling
payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": USER}
],
"tools": TOOLS,
"tool_choice": "auto",
# Ask for JSON if model can comply directly
"response_format": {"type": "json_object"},
"temperature": 0.2,
"max_tokens": 900
}
first = chat(payload)
msg = first["choices"][0]["message"]
# 2) If the model wants to call tools, run them and send results back
tool_messages = []
if "tool_calls" in msg:
for call in msg["tool_calls"]:
name = call["function"]["name"]
args = json.loads(call["function"]["arguments"] or "{}")
if name == "get_prices":
tool_result = get_prices(symbol=args.get("symbol", "AAPL"),
days=int(args.get("days", 60)))
else:
tool_result = {"error": f"Unknown tool {name}"}
tool_messages.append({
"role": "tool",
"tool_call_id": call["id"],
"name": name,
"content": json.dumps(tool_result)
})
# 3) Send a follow-up message containing the tool outputs
follow_payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": USER},
msg, # the assistant message that requested tools
*tool_messages
],
"response_format": {"type": "json_object"},
"temperature": 0.2,
"max_tokens": 1200
}
final = chat(follow_payload)
out = final
else:
out = first # Model answered without tools
# 4) Parse the final JSON
content = out["choices"][0]["message"]["content"]
try:
data = json.loads(content)
except json.JSONDecodeError:
print("Model did not return valid JSON. Raw content:\n", content)
return
print("\n=== Summary ===\n", data.get("summary"))
print("\n=== Key points ===")
for i, kp in enumerate(data.get("key_points", []), 1):
print(f"{i}. {kp}")
print("\n=== Python code (plot) ===\n")
print(data.get("python_code"))
if __name__ == "__main__":
if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
raise SystemExit("Set OPENAI_API_KEY env var first.")
main()
Code breakdown:
Let GPT-5 decide when to call your function (get_prices), you execute it (mock or real API), feed results back, and let GPT-5 finish with analysis + Matplotlib code in JSON.
1) Imports & configuration
requestshandles HTTP calls to OpenAI.json,time,math,datetimeare used for parsing, retries (if added), and mock data generation.OPENAI_API_KEYis read from env; never hardcode secrets in real projects.API_URLtargets the Chat Completions endpoint (best known for tool calling).MODEL = "gpt-5"; you can swap togpt-5-minifor cheaper experiments.
Tip: In production, wrap network calls with retry/backoff (429/5xx). A simple helper function can centralize that (you can reuse the one from your Advanced example).
2) The tool you expose to the model
def get_prices(symbol: str, days: int = 30) -> Dict[str, Any]:
...- This is a mock OHLC generator. Replace with your real data source:
- A REST call (e.g., Yahoo, Polygon, your own DB/API).
- Caching layer (Redis) to keep latency/costs down.
- Output shape:
{
"symbol": "AAPL",
"series": [
{"date": "2025-07-01", "open": 101.2, "high": 102.0, "low": 100.6, "close": 101.8},
...
]
}Keep it consistent; the LLM will rely on the keys you return.
3) Advertising the tool (the TOOLS spec)
TOOLS = [
{
"type": "function",
"function": {
"name": "get_prices",
"description": "Get recent OHLC data...",
"parameters": { ... JSON Schema ... }
}
}
]- You define a JSON Schema (name, required fields, types).
- The model uses this to decide if and how to call your function.
- Keep schema minimal but precise (e.g., clamp
daysto a reasonable range).
4) System and User messages
- SYSTEM enforces role & output contract:
- “You are a quantitative analyst … return JSON with keys:
summary,key_points,python_code.”
- “You are a quantitative analyst … return JSON with keys:
- USER asks for “Analyze AAPL last 60 days,” nudging the model to use a tool if it needs data.
Tip: Always restate your desired output format in SYSTEM (and/or USER). This increases compliance, especially if you don’t use schema mode.
5) First request: allow tool calling
payload = {
"model": MODEL,
"messages": [system, user],
"tools": TOOLS,
"tool_choice": "auto",
"response_format": {"type": "json_object"},
...
}tool_choice: "auto"lets the model decide if it needs the tool.response_format: "json_object"asks for JSON, but not as strict as schema mode. (That’s okay here; the focus is tool calling.)- Low
temperature(0.2) boosts determinism.
6) Detect and execute tool calls
msg = first["choices"][0]["message"]
if "tool_calls" in msg:
for call in msg["tool_calls"]:
# 1) parse arguments
# 2) run your function
# 3) build a "tool" message with the resultstool_callsis the assistant’s intent to call your function with arguments.- You must parse
call["function"]["arguments"](stringified JSON), run your function, and post results as atoolrole message back to OpenAI.
Security notes:
- Never directly execute arbitrary code sent via tool args.
- Validate inputs (symbols, ranges). Add allowlists/ratelimits for external APIs.
7) Second request: provide tool outputs and ask GPT-5 to finish
follow_payload = {
"messages": [
system, user,
msg, # the assistant message that requested tools
*tool_messages # your tool outputs bound to the call IDs
],
"response_format": {"type":"json_object"}, ...
}- You include:
- The original assistant message that requested tools (so the model keeps context).
- Your tool result messages with the proper
tool_call_id.
- GPT-5 now has real data and completes the task (analysis + code).
8) Parse the final JSON
content = out["choices"][0]["message"]["content"]
data = json.loads(content)- Print
summary,key_points,python_code. - If parsing fails, dump raw content—often a sign the model deviated (rare at low temperature, but possible).
9) Customization knobs
- Switch to schema mode: If you want stronger guarantees on the final JSON, use:
response_format: { "type": "json_schema", "json_schema": {...} }
- Multiple tools: Add more function specs to
TOOLS. GPT-5 will pick the right one. - Parallel calls: The API can return multiple
tool_calls—run them all, then send all thetoolmessages back in one follow-up. - Logging: Log both the tool args and outputs to audit the agent’s steps.
10) Common pitfalls
- Forgetting
tool_call_idwhen sending the tool result message. - Mismatched schemas: If your returned JSON structure diverges from your documented shape, the model may misinterpret it later.
- Rate limits: Add retry/backoff for 429/5xx (especially if your tool triggers 3rd-party APIs).
11) Testing tips
- Start with mock data (like the example) for deterministic outputs.
- Add a unit test that asserts the model returns valid JSON with the required keys.
5.1.4 DeepSeek-VL
DeepSeek-VL is a Chinese open-source multimodal model developed by the DeepSeek team, designed to bridge the gap between vision and language processing. It represents China's significant contribution to the multimodal AI landscape, offering capabilities comparable to proprietary models but with open access for researchers and developers. The model emerged as part of China's growing AI research ecosystem, demonstrating the country's commitment to advancing state-of-the-art AI technologies while ensuring they remain accessible to the broader scientific community.
The model is specifically optimized for efficiency and vision-language reasoning, with architectural choices that prioritize computational performance while maintaining high-quality results. Its streamlined design makes it particularly suitable for deployment in resource-constrained environments, enabling advanced multimodal capabilities on more modest hardware configurations. DeepSeek-VL achieves this efficiency through careful attention to model size, training procedures, and inference optimizations. For example, it employs specialized vision encoders that extract rich visual features while minimizing computational overhead, and leverages knowledge distillation techniques to compress larger models' capabilities into more compact architectures.
In performance evaluations, DeepSeek-VL is often benchmarked against industry leaders like GPT-4V and Flamingo, where it demonstrates competitive results at a fraction of the computational cost. This makes it an attractive option for cost-effective deployments in production environments, particularly for organizations seeking multimodal capabilities without the expense associated with commercial API usage. Benchmark studies have shown that DeepSeek-VL achieves 85-90% of the performance of these larger models on standard vision-language tasks while requiring significantly less computational resources. This performance-to-cost ratio has made it particularly popular among startups, academic institutions, and developers in emerging markets.
The model excels in tasks requiring detailed visual understanding combined with natural language reasoning, such as image captioning, visual question answering, and complex scene interpretation. DeepSeek-VL's architecture incorporates specialized attention mechanisms that allow it to focus on relevant visual elements when answering questions or generating descriptions.
This capability enables applications ranging from assisting visually impaired users to automating content moderation and enhancing e-commerce product discovery through visual search. The model also demonstrates strong performance in cross-cultural visual contexts, making it particularly valuable for applications serving diverse global audiences.
Example: Using DeepSeek-VL for Image Understanding
# Install dependencies first
# pip install transformers torch pillow
from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image
import requests
import matplotlib.pyplot as plt
# Download and load an example image
image_url = "https://files.cuantum.tech/images/deep-seek-descriptive.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
# Load DeepSeek-VL model and processor
model_name = "deepseek-ai/deepseek-vl-7b-chat"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
# Create a prompt for the model
prompt = "Describe what you see in this image in detail."
# Process the inputs
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Generate a response
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False
)
# Decode the response
generated_text = processor.decode(outputs[0], skip_special_tokens=True)
# Display the image and response
plt.figure(figsize=(10, 8))
plt.imshow(image)
plt.axis('off')
plt.title('Input Image')
plt.show()
print("DeepSeek-VL's response:")
print(generated_text.split("ASSISTANT:")[-1].strip())
Code Breakdown: Using DeepSeek-VL for Image Understanding
The example above demonstrates how to use DeepSeek-VL for a basic image understanding task. Here's a detailed breakdown of each section:
1. Dependencies and Setup
- Key libraries: The code uses
transformersfor model access,torchfor tensor operations, andPILfor image handling. - Image acquisition: Fetches a sample image from a URL using
requestsand opens it with PIL.
2. Model Initialization
- Model selection: Uses the 7B parameter chat-tuned version of DeepSeek-VL (
deepseek-ai/deepseek-vl-7b-chat). - Processor loading: The
AutoProcessorhandles both tokenization of text and preprocessing of images. - Model loading:
trust_remote_code=Trueis required as DeepSeek-VL uses custom code for its implementation.
3. Input Processing
- Prompt creation: A simple prompt asking for image description, but you can use more specific prompts like "What objects are in this image?" or "Explain what's happening in this scene."
- Multimodal processing: The processor combines both text input (prompt) and image input into a format the model can understand.
- Return format:
return_tensors="pt"specifies PyTorch tensors as the output format.
4. Response Generation
- Inference with
torch.no_grad(): Disables gradient calculation for efficiency during inference. - Generation parameters:
max_new_tokens=512: Limits response length to 512 tokens.do_sample=False: Uses greedy decoding instead of sampling for deterministic outputs.
5. Response Processing and Visualization
- Decoding: Converts token IDs back to human-readable text.
- Response extraction: Splits the output to get only the assistant's response portion.
- Visualization: Displays the input image alongside the generated description.
Advanced Usage Patterns
Beyond this basic example, DeepSeek-VL supports several advanced capabilities:
- Visual reasoning: You can ask complex questions about relationships between objects in the image.
- Multi-image analysis: Process multiple images by passing a list to the processor.
- Fine-tuning: Adapt the model to specific domains using techniques like LoRA or QLoRA.
- Memory efficiency: For resource-constrained environments, consider using quantization:
# For 8-bit quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_enable_fp32_cpu_offload=True
)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
quantization_config=quantization_config,
device_map="auto"
)Implementation Considerations:
- Hardware requirements: DeepSeek-VL 7B requires at least 16GB GPU memory for full precision, but can run on consumer GPUs with quantization.
- Inference speed: First-time inference includes model loading time; subsequent calls are faster.
- Response format: The model follows a chat format with "ASSISTANT:" prefix. For cleaner outputs, always strip this prefix.
- Error handling: In production, add try-except blocks to handle image loading failures and timeout configurations for large images.
DeepSeek-VL represents a significant advancement in making multimodal AI accessible to developers, particularly those seeking open-source alternatives to proprietary models like GPT-4V or Gemini.
Example: Advanced Visual Question Answering with DeepSeek-VL
# Install required libraries
# pip install transformers torch pillow matplotlib requests
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
import matplotlib.pyplot as plt
from io import BytesIO
# Function to load and display an image from a URL
def load_and_display_image(image_url, title="Input Image"):
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
plt.figure(figsize=(10, 8))
plt.imshow(image)
plt.axis('off')
plt.title(title)
plt.show()
return image
# Load DeepSeek-VL model and processor
model_id = "deepseek-ai/deepseek-vl-7b-chat"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16, # Use half precision for efficiency
device_map="auto", # Automatically distribute across available GPUs
trust_remote_code=True
)
# Sample image URLs for visual reasoning tasks
image_urls = [
"https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg", # People at a table
"https://files.cuantum.tech/images/deep-seek-chart.jpg" # Charts/graphs
]
# Load and display the first image
image = load_and_display_image(image_urls[0])
# Function to generate responses for a given image and prompt
def generate_vl_response(image, prompt, max_new_tokens=256):
# Create chat message format
messages = [
{"role": "user", "content": prompt}
]
# Process inputs
inputs = processor(
messages=messages,
images=image,
return_tensors="pt"
).to(model.device)
# Generate response with customized parameters
generated_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True, # Enable sampling for more diverse outputs
temperature=0.7, # Control randomness (higher = more random)
top_p=0.9, # Nucleus sampling parameter
repetition_penalty=1.1 # Discourage repetition
)
# Decode response
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# Extract assistant's response
response = generated_text.split("ASSISTANT:")[-1].strip()
return response
# Example prompts for different visual reasoning tasks
prompts = [
"Describe this image in detail. What are the people doing?",
"Count how many people are in this image and describe what each person is wearing.",
"What emotions can you detect on people's faces in this image?",
"If you had to create a story based on this image, what would it be?"
]
# Generate and display responses
for i, prompt in enumerate(prompts):
print(f"\nPrompt {i+1}: {prompt}")
print("-" * 50)
response = generate_vl_response(image, prompt)
print(response)
print("=" * 80)
# Load the second image (charts/graphs) for technical analysis
technical_image = load_and_display_image(image_urls[1], "Technical Chart")
# Technical analysis prompt
technical_prompt = "Analyze this chart. What patterns do you observe? What conclusions can you draw from this data visualization?"
# Generate and display technical analysis
print(f"\nTechnical Analysis Prompt: {technical_prompt}")
print("-" * 50)
response = generate_vl_response(technical_image, technical_prompt, max_new_tokens=512)
print(response)
Comprehensive Code Breakdown: Advanced DeepSeek-VL Implementation
This code example demonstrates how to leverage DeepSeek-VL for sophisticated visual reasoning tasks. Let's break down each component:
1. Setup and Model Initialization
- Library imports: Beyond basic dependencies, we specifically import
AutoModelForCausalLMwhich provides a more flexible interface for generative tasks than the basicAutoModelused in the previous example. - Helper function:
load_and_display_image()encapsulates image loading logic, making the code more modular and reusable. - Model optimization:
torch_dtype=torch.float16enables half-precision computation, reducing memory usage by approximately 50% with minimal impact on output quality.device_map="auto"intelligently distributes model layers across available GPUs or uses CPU offloading when needed.
2. Multi-image Processing
- Image collection: Stores multiple image URLs for different analysis scenarios, demonstrating DeepSeek-VL's versatility.
- Sequential processing: The code is structured to analyze multiple images with different prompts, showcasing how the model handles diverse visual contexts.
3. Response Generation Function
- Chat-style formatting: Unlike the previous example, this implementation uses DeepSeek-VL's chat interface through the
messagesparameter, which better aligns with conversational applications. - Generation parameters:
do_sample=Trueandtemperature=0.7: Enables controlled randomness in outputs, producing more natural and diverse responses.top_p=0.9: Implements nucleus sampling, which dynamically filters the token probability distribution.repetition_penalty=1.1: Reduces the likelihood of generating repetitive phrases, improving response quality.
4. Task Diversification
- Multiple prompt types: The example includes different types of visual reasoning tasks:
- Descriptive: "Describe this image in detail..."
- Quantitative: "Count how many people..."
- Emotional analysis: "What emotions can you detect..."
- Creative: "If you had to create a story..."
- Technical analysis: "Analyze this chart..."
5. Performance Considerations
- Memory management: The example uses half-precision (
float16) and automatic device mapping to optimize memory usage. - Response length control:
max_new_tokensis adjusted based on the complexity of the task, with technical analysis allowed a longer response (512 tokens vs 256). - Prompt engineering: The prompts are carefully crafted to elicit specific types of visual reasoning, demonstrating how prompt design affects model output.
6. Real-world Application Scenarios
- This implementation demonstrates DeepSeek-VL's capabilities in several practical use cases:
- Social media content analysis: Understanding context and relationships in photos.
- Data visualization interpretation: Extracting insights from charts and graphs.
- Content moderation: Detecting emotional content and potentially sensitive material in images.
- Creative assistance: Helping generate stories or content based on visual inspiration.
7. Extension Possibilities
- This code could be extended in several ways:
- Batch processing: Modify to handle multiple images simultaneously for higher throughput.
- Interactive applications: Integrate into a web interface where users can upload images and select analysis types.
- Multi-turn conversations: Expand the
messagesarray to include previous exchanges for contextual understanding. - Integration with other models: Combine DeepSeek-VL's outputs with specialized models for tasks like object detection or sentiment analysis.
This advanced implementation highlights DeepSeek-VL's flexibility and power for complex visual-language reasoning tasks, making it suitable for both research and production applications where understanding images in context is critical.
5.1.5 Why Text+Image Matters
Accessibility: Helping visually impaired users understand images by providing detailed descriptions of visual content. These models can identify objects, people, scenes, and even interpret spatial relationships, allowing visually impaired individuals to "see" through AI-generated descriptions. They can also assist with navigation by describing surroundings or identifying potential hazards.
For visually impaired individuals, multimodal AI serves as an essential bridge to visual content. These systems go beyond simple object recognition to provide context-rich descriptions that convey the full meaning of images. When a visually impaired person encounters an image online, in a document, or through a specialized device, multimodal models can:
- Generate comprehensive scene descriptions that include not just what objects are present, but their arrangement, colors, lighting, and overall composition
- Identify and describe people in photos, including facial expressions, clothing, actions, and apparent relationships between individuals
- Read and interpret text within images, such as signs, menus, product labels, and instructions
- Recognize landmarks and provide spatial awareness in unfamiliar environments
In real-world applications, these capabilities are being integrated into smartphone apps that can narrate the visual world in real-time, smart glasses that provide audio descriptions of surroundings, and screen readers that can interpret complex visual elements on websites. The technology is particularly valuable for educational materials, allowing visually impaired students to access diagrams, charts, and illustrations that would otherwise be inaccessible without human assistance.
The advancement of these multimodal systems represents a significant step forward in digital inclusivity, empowering visually impaired users with greater independence and access to information that was previously unavailable to them.
Education: Explaining diagrams, charts, or historical photos to enhance learning experiences. Multimodal models can break down complex visualizations into understandable components, clarify scientific diagrams, provide historical context for photographs, and even translate visual mathematical notation into explanations. This makes educational content more accessible and comprehensible across various subjects and learning styles.
In educational contexts, multimodal AI serves as a powerful teaching assistant that bridges visual and textual information:
- For STEM education, these models can analyze complex scientific diagrams and:
- Convert abstract visual concepts into clear, step-by-step explanationsConvert abstract visual concepts into clear, step-by-step explanations
- Identify and label components of biological systems, chemical structures, or engineering schematicsIdentify and label components of biological systems, chemical structures, or engineering schematics
- Translate mathematical expressions and equations into plain language interpretationsTranslate mathematical expressions and equations into plain language interpretations
- In history and social studies, multimodal models enhance learning by:
- Providing detailed context for historical photographs, including time period, cultural significance, and historical relevanceProviding detailed context for historical photographs, including time period, cultural significance, and historical relevance
- Analyzing primary source documents with both textual and visual elementsAnalyzing primary source documents with both textual and visual elements
- Making connections between visual artifacts and broader historical narrativesMaking connections between visual artifacts and broader historical narratives
- For data literacy, these systems help students by:
- Breaking down complex charts and graphs into comprehensible insightsBreaking down complex charts and graphs into comprehensible insights
- Explaining statistical visualizations and data trends in accessible languageExplaining statistical visualizations and data trends in accessible language
- Teaching students how to interpret different types of data representationsTeaching students how to interpret different types of data representations
These capabilities are particularly valuable for students with different learning styles, allowing visual learners to receive verbal explanations and verbal learners to better understand visual content. They also support personalized learning by adapting explanations to different educational levels, from elementary to advanced university courses.
Creative work: Generating captions, stories, or descriptions that can inspire artists, writers, and content creators. These models can suggest creative interpretations of images, develop narratives based on visual scenes, assist with storyboarding by describing sequential images, and help marketers craft compelling visual content with appropriate messaging.
For creative professionals, multimodal AI serves as both muse and collaborator. Writers facing creative blocks can use these systems to generate story prompts from visual inspiration. When shown an image of a misty forest at dawn, for instance, the AI might suggest narrative elements like "a forgotten path leading to an ancient secret" or "the meeting place of two worlds." This capability transforms random visual stimuli into structured creative starting points.
Visual artists and designers benefit from AI-generated descriptions that highlight elements they might otherwise overlook. A photographer reviewing their portfolio might gain new perspective when the AI points out "the interplay of shadow and reflection creates a natural frame around the subject" or "the unexpected color contrast draws attention to the emotional center of the image."
In film and animation, these models streamline the pre-production process. Storyboard artists can quickly generate descriptive text for sequential panels, helping directors and producers visualize narrative flow before committing resources to production. The AI can suggest camera angles, lighting moods, and scene transitions based on visual references, accelerating the creative development cycle.
For content marketers, multimodal models bridge the gap between visual assets and compelling messaging. When analyzing product photography, these systems can generate targeted copy that aligns with both the visual elements and brand voice, ensuring consistent communication across channels. This capability is particularly valuable for social media campaigns where striking visuals must be paired with concise, engaging text in multiple formats and platforms.
Productivity: Extracting structured insights from documents, tables, or screenshots, which saves time and improves efficiency in professional settings. Instead of manually parsing visual data, users can leverage AI to convert tables into spreadsheets, extract key information from receipts or business cards, analyze graphs and charts in reports, and transform handwritten notes into searchable text.
This productivity advantage manifests across numerous professional workflows:
- In financial services, multimodal AI can automatically process invoices and receipts by:
- Identifying vendor information, dates, and payment amountsIdentifying vendor information, dates, and payment amounts
- Categorizing expenses according to predefined accounting codesCategorizing expenses according to predefined accounting codes
- Flagging potential discrepancies or unusual chargesFlagging potential discrepancies or unusual charges
- For research and analysis, these systems can:
- Extract precise numerical data from complex charts and graphsExtract precise numerical data from complex charts and graphs
- Convert statistical visualizations into structured datasetsConvert statistical visualizations into structured datasets
- Summarize key trends and outliers identified in visual dataSummarize key trends and outliers identified in visual data
- In administrative workflows, multimodal AI streamlines:
- Business card digitization for immediate contact database integrationBusiness card digitization for immediate contact database integration
- Form processing without manual data entryForm processing without manual data entry
- Meeting note transcription with automatic action item extractionMeeting note transcription with automatic action item extraction
The time savings are substantial—tasks that would require hours of manual data entry can be completed in seconds, while also reducing human error. For organizations handling large volumes of visual documents, this capability transforms information management by making previously inaccessible data searchable, analyzable, and actionable.
Multimodal models bring us closer to AI that interacts with the world as humans do: through multiple senses, not just words. By bridging the gap between visual perception and language understanding, these technologies create more intuitive and natural human-AI interactions that reflect how we naturally process information through multiple channels simultaneously.
