Click here to view the next lesson.

Chapter 5: Beyond Text: Multimodal LLMs

Practical Exercises – Chapter 5

These exercises help you practice text+image alignment (CLIP/LLaVA-style), captioning (BLIP), speech transcription (Whisper), speech embeddings (wav2vec2), video features (VideoMAE), and simple cross-modal pipelines.

Notes:
You’ll need Python 3.9+, transformers, torch, and task-specific libs (e.g., sentencepiece, soundfile, openai-whisper, av, Pillow).
Model downloads happen automatically the first time you run each snippet.

Exercise 1 — Text+Image Matching with CLIP

Task: Given an image (dog.jpg) and two candidate captions, compute which caption best matches the image.

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("dog.jpg")  # replace with your image path
texts = ["a photo of a dog", "a photo of a cat"]

inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
    out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)

for t, p in zip(texts, probs.tolist()):
    print(f"{t} -> {p:.4f}")

print("Prediction:", texts[int(probs.argmax())])

Exercise 2 — Zero-Shot Image Classification with CLIP

Task: Use CLIP to pick a label for bird.jpg among ["sparrow", "eagle", "penguin"].

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

labels = ["a photo of a sparrow", "a photo of an eagle", "a photo of a penguin"]
image = Image.open("bird.jpg")

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
    out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)

for lab, p in zip(labels, probs.tolist()):
    print(f"{lab} -> {p:.4f}")
print("Predicted label:", labels[int(probs.argmax())])

Exercise 3 — Image Captioning with BLIP

Task: Generate a caption for scene.jpg using BLIP.

Solution:

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

image = Image.open("scene.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=30)
caption = processor.decode(out[0], skip_special_tokens=True)
print("Caption:", caption)

Exercise 4 — Whisper Transcription (Speech → Text)

Task: Transcribe speech_sample.mp3.

Solution:

# pip install git+https://github.com/openai/whisper.git
import whisper

model = whisper.load_model("base")  # "tiny", "small", "medium", "large" also available
result = model.transcribe("speech_sample.mp3")
print("Transcription:", result["text"])

Exercise 5 — Speech Embeddings with wav2vec2

Task: Extract frame-level embeddings from speech_sample.wav.

Solution:

from transformers import Wav2Vec2Processor, Wav2Vec2Model
import soundfile as sf
import torch

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

wave, sr = sf.read("speech_sample.wav")
inputs = processor(wave, sampling_rate=sr, return_tensors="pt")

with torch.no_grad():
    hidden = model(**inputs).last_hidden_state  # [batch, time, hidden_dim]
print("Embeddings:", hidden.shape)

Exercise 6 — Video Frame Sampling & Features (VideoMAE)

Task: Sample ~8 frames from clip.mp4, extract embeddings with VideoMAE.

Solution:

# pip install av
from transformers import VideoMAEFeatureExtractor, VideoMAEModel
import av, torch

feature_extractor = VideoMAEFeatureExtractor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")

container = av.open("clip.mp4")
frames = []
for i, frame in enumerate(container.decode(video=0)):
    if i % 30 == 0:  # sample 1 frame per second if 30fps
        frames.append(frame.to_ndarray(format="rgb24"))
    if len(frames) == 8:
        break

inputs = feature_extractor(frames, return_tensors="pt")
with torch.no_grad():
    out = model(**inputs)
video_emb = out.last_hidden_state  # [batch=1, tokens, hidden_dim]
print("Video embeddings:", video_emb.shape)

Exercise 7 — Cross-Modal Mini-Pipeline (ASR → Caption → Summary)

Task:

Transcribe audio (lecture.mp3) with Whisper.
Caption a slide image (slide.png) with BLIP.
Concatenate both texts and produce a short summary with a small text model (e.g., DistilGPT-2).

Solution:

# ASR
import whisper
from transformers import BlipProcessor, BlipForConditionalGeneration, AutoTokenizer, AutoModelForCausalLM
from PIL import Image
import torch

# 1) Whisper transcription
whisper_model = whisper.load_model("small")
asr = whisper_model.transcribe("lecture.mp3")["text"]

# 2) BLIP caption
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
cap = blip.generate(**blip_proc(images=Image.open("slide.png").convert("RGB"), return_tensors="pt"), max_new_tokens=40)
caption = blip_proc.decode(cap[0], skip_special_tokens=True)

# 3) Summarize with a tiny LM (toy)
tok = AutoTokenizer.from_pretrained("distilgpt2")
lm = AutoModelForCausalLM.from_pretrained("distilgpt2")
prompt = f"Lecture transcript: {asr}\nSlide: {caption}\n\nTL;DR summary:"
ids = tok.encode(prompt, return_tensors="pt")
with torch.no_grad():
    out = lm.generate(ids, max_new_tokens=80, do_sample=True, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))

(For higher-quality summarization, swap in a stronger seq2seq model like facebook/bart-large-cnn.)

Exercise 8 — Visual Question Answering (VQA) with LLaVA-style Interface (Demo)

Task: Use a community VQA checkpoint to answer a question about chart.png (e.g., “What trend is increasing after 2020?”).

Note: Full LLaVA weights are large; the snippet shows the interface pattern using a lighter VQA model (dandelin/vilt-b32-finetuned-vqa).

Solution:

from transformers import ViltProcessor, ViltForQuestionAnswering
from PIL import Image
import torch

processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")

image = Image.open("chart.png").convert("RGB")
question = "What trend is increasing after 2020?"
inputs = processor(image, question, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
answer_idx = logits.argmax(-1).item()
print("Answer:", model.config.id2label[answer_idx])

Exercise 9 — Robustness Check: Add Noise to Audio Before ASR

Task: Add white noise to speech_clean.wav and compare Whisper transcriptions (clean vs noisy).

Solution:

import numpy as np, soundfile as sf, whisper

# Load clean audio
wave, sr = sf.read("speech_clean.wav")
rng = np.random.default_rng(0)
noise = rng.normal(scale=0.02, size=len(wave))  # mild noise
noisy = (wave + noise).astype(np.float32)
sf.write("speech_noisy.wav", noisy, sr)

model = whisper.load_model("base")
clean_txt = model.transcribe("speech_clean.wav")["text"]
noisy_txt = model.transcribe("speech_noisy.wav")["text"]
print("CLEAN:", clean_txt)
print("NOISY:", noisy_txt)

Exercise 10 — Batch Captioning & CLIP Rerank

Task: Caption a set of images, then rerank candidate captions with CLIP to pick the most image-relevant one.

Solution:

from transformers import BlipProcessor, BlipForConditionalGeneration, CLIPProcessor, CLIPModel
from PIL import Image
import torch, glob

# Load models
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_proc = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

images = [Image.open(p).convert("RGB") for p in glob.glob("imgs/*.jpg")]

for img in images:
    # Generate a few variants by sampling
    caps = []
    for _ in range(3):
        with torch.no_grad():
            out = blip.generate(**blip_proc(images=img, return_tensors="pt"),
                                max_new_tokens=30, do_sample=True, top_p=0.9)
        caps.append(blip_proc.decode(out[0], skip_special_tokens=True))

    # Rerank with CLIP
    inputs = clip_proc(text=caps, images=img, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = clip_model(**inputs).logits_per_image
    best = caps[int(logits.softmax(dim=1).argmax())]
    print("Chosen caption:", best)

What You Practiced

Text+image alignment (CLIP), captioning (BLIP), and VQA patterns.
Speech transcription (Whisper) and speech embeddings (wav2vec2).
Video feature extraction (VideoMAE) and cross-modal pipelines that combine ASR and vision.
Simple robustness checks (noise) and reranking strategies to boost quality.

Practical Exercises – Chapter 5

Notes:
You’ll need Python 3.9+, transformers, torch, and task-specific libs (e.g., sentencepiece, soundfile, openai-whisper, av, Pillow).
Model downloads happen automatically the first time you run each snippet.

Exercise 1 — Text+Image Matching with CLIP

Task: Given an image (dog.jpg) and two candidate captions, compute which caption best matches the image.

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("dog.jpg")  # replace with your image path
texts = ["a photo of a dog", "a photo of a cat"]

inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
    out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)

for t, p in zip(texts, probs.tolist()):
    print(f"{t} -> {p:.4f}")

print("Prediction:", texts[int(probs.argmax())])

Exercise 2 — Zero-Shot Image Classification with CLIP

Task: Use CLIP to pick a label for bird.jpg among ["sparrow", "eagle", "penguin"].

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

labels = ["a photo of a sparrow", "a photo of an eagle", "a photo of a penguin"]
image = Image.open("bird.jpg")

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
    out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)

for lab, p in zip(labels, probs.tolist()):
    print(f"{lab} -> {p:.4f}")
print("Predicted label:", labels[int(probs.argmax())])

Exercise 3 — Image Captioning with BLIP

Task: Generate a caption for scene.jpg using BLIP.

Solution:

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

image = Image.open("scene.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=30)
caption = processor.decode(out[0], skip_special_tokens=True)
print("Caption:", caption)

Exercise 4 — Whisper Transcription (Speech → Text)

Task: Transcribe speech_sample.mp3.

Solution:

# pip install git+https://github.com/openai/whisper.git
import whisper

model = whisper.load_model("base")  # "tiny", "small", "medium", "large" also available
result = model.transcribe("speech_sample.mp3")
print("Transcription:", result["text"])

Exercise 5 — Speech Embeddings with wav2vec2

Task: Extract frame-level embeddings from speech_sample.wav.

Solution:

from transformers import Wav2Vec2Processor, Wav2Vec2Model
import soundfile as sf
import torch

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

wave, sr = sf.read("speech_sample.wav")
inputs = processor(wave, sampling_rate=sr, return_tensors="pt")

with torch.no_grad():
    hidden = model(**inputs).last_hidden_state  # [batch, time, hidden_dim]
print("Embeddings:", hidden.shape)

Exercise 6 — Video Frame Sampling & Features (VideoMAE)

Task: Sample ~8 frames from clip.mp4, extract embeddings with VideoMAE.

Solution:

# pip install av
from transformers import VideoMAEFeatureExtractor, VideoMAEModel
import av, torch

feature_extractor = VideoMAEFeatureExtractor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")

container = av.open("clip.mp4")
frames = []
for i, frame in enumerate(container.decode(video=0)):
    if i % 30 == 0:  # sample 1 frame per second if 30fps
        frames.append(frame.to_ndarray(format="rgb24"))
    if len(frames) == 8:
        break

inputs = feature_extractor(frames, return_tensors="pt")
with torch.no_grad():
    out = model(**inputs)
video_emb = out.last_hidden_state  # [batch=1, tokens, hidden_dim]
print("Video embeddings:", video_emb.shape)

Exercise 7 — Cross-Modal Mini-Pipeline (ASR → Caption → Summary)

Task:

Transcribe audio (lecture.mp3) with Whisper.
Caption a slide image (slide.png) with BLIP.
Concatenate both texts and produce a short summary with a small text model (e.g., DistilGPT-2).

Solution:

# ASR
import whisper
from transformers import BlipProcessor, BlipForConditionalGeneration, AutoTokenizer, AutoModelForCausalLM
from PIL import Image
import torch

# 1) Whisper transcription
whisper_model = whisper.load_model("small")
asr = whisper_model.transcribe("lecture.mp3")["text"]

# 2) BLIP caption
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
cap = blip.generate(**blip_proc(images=Image.open("slide.png").convert("RGB"), return_tensors="pt"), max_new_tokens=40)
caption = blip_proc.decode(cap[0], skip_special_tokens=True)

# 3) Summarize with a tiny LM (toy)
tok = AutoTokenizer.from_pretrained("distilgpt2")
lm = AutoModelForCausalLM.from_pretrained("distilgpt2")
prompt = f"Lecture transcript: {asr}\nSlide: {caption}\n\nTL;DR summary:"
ids = tok.encode(prompt, return_tensors="pt")
with torch.no_grad():
    out = lm.generate(ids, max_new_tokens=80, do_sample=True, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))

(For higher-quality summarization, swap in a stronger seq2seq model like facebook/bart-large-cnn.)

Exercise 8 — Visual Question Answering (VQA) with LLaVA-style Interface (Demo)

Task: Use a community VQA checkpoint to answer a question about chart.png (e.g., “What trend is increasing after 2020?”).

Note: Full LLaVA weights are large; the snippet shows the interface pattern using a lighter VQA model (dandelin/vilt-b32-finetuned-vqa).

Solution:

from transformers import ViltProcessor, ViltForQuestionAnswering
from PIL import Image
import torch

processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")

image = Image.open("chart.png").convert("RGB")
question = "What trend is increasing after 2020?"
inputs = processor(image, question, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
answer_idx = logits.argmax(-1).item()
print("Answer:", model.config.id2label[answer_idx])

Exercise 9 — Robustness Check: Add Noise to Audio Before ASR

Task: Add white noise to speech_clean.wav and compare Whisper transcriptions (clean vs noisy).

Solution:

import numpy as np, soundfile as sf, whisper

# Load clean audio
wave, sr = sf.read("speech_clean.wav")
rng = np.random.default_rng(0)
noise = rng.normal(scale=0.02, size=len(wave))  # mild noise
noisy = (wave + noise).astype(np.float32)
sf.write("speech_noisy.wav", noisy, sr)

model = whisper.load_model("base")
clean_txt = model.transcribe("speech_clean.wav")["text"]
noisy_txt = model.transcribe("speech_noisy.wav")["text"]
print("CLEAN:", clean_txt)
print("NOISY:", noisy_txt)

Exercise 10 — Batch Captioning & CLIP Rerank

Task: Caption a set of images, then rerank candidate captions with CLIP to pick the most image-relevant one.

Solution:

from transformers import BlipProcessor, BlipForConditionalGeneration, CLIPProcessor, CLIPModel
from PIL import Image
import torch, glob

# Load models
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_proc = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

images = [Image.open(p).convert("RGB") for p in glob.glob("imgs/*.jpg")]

for img in images:
    # Generate a few variants by sampling
    caps = []
    for _ in range(3):
        with torch.no_grad():
            out = blip.generate(**blip_proc(images=img, return_tensors="pt"),
                                max_new_tokens=30, do_sample=True, top_p=0.9)
        caps.append(blip_proc.decode(out[0], skip_special_tokens=True))

    # Rerank with CLIP
    inputs = clip_proc(text=caps, images=img, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = clip_model(**inputs).logits_per_image
    best = caps[int(logits.softmax(dim=1).argmax())]
    print("Chosen caption:", best)

What You Practiced

Text+image alignment (CLIP), captioning (BLIP), and VQA patterns.
Speech transcription (Whisper) and speech embeddings (wav2vec2).
Video feature extraction (VideoMAE) and cross-modal pipelines that combine ASR and vision.
Simple robustness checks (noise) and reranking strategies to boost quality.

Practical Exercises – Chapter 5

Notes:
You’ll need Python 3.9+, transformers, torch, and task-specific libs (e.g., sentencepiece, soundfile, openai-whisper, av, Pillow).
Model downloads happen automatically the first time you run each snippet.

Exercise 1 — Text+Image Matching with CLIP

Task: Given an image (dog.jpg) and two candidate captions, compute which caption best matches the image.

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("dog.jpg")  # replace with your image path
texts = ["a photo of a dog", "a photo of a cat"]

inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
    out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)

for t, p in zip(texts, probs.tolist()):
    print(f"{t} -> {p:.4f}")

print("Prediction:", texts[int(probs.argmax())])

Exercise 2 — Zero-Shot Image Classification with CLIP

Task: Use CLIP to pick a label for bird.jpg among ["sparrow", "eagle", "penguin"].

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

labels = ["a photo of a sparrow", "a photo of an eagle", "a photo of a penguin"]
image = Image.open("bird.jpg")

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
    out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)

for lab, p in zip(labels, probs.tolist()):
    print(f"{lab} -> {p:.4f}")
print("Predicted label:", labels[int(probs.argmax())])

Exercise 3 — Image Captioning with BLIP

Task: Generate a caption for scene.jpg using BLIP.

Solution:

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

image = Image.open("scene.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=30)
caption = processor.decode(out[0], skip_special_tokens=True)
print("Caption:", caption)

Exercise 4 — Whisper Transcription (Speech → Text)

Task: Transcribe speech_sample.mp3.

Solution:

# pip install git+https://github.com/openai/whisper.git
import whisper

model = whisper.load_model("base")  # "tiny", "small", "medium", "large" also available
result = model.transcribe("speech_sample.mp3")
print("Transcription:", result["text"])

Exercise 5 — Speech Embeddings with wav2vec2

Task: Extract frame-level embeddings from speech_sample.wav.

Solution:

from transformers import Wav2Vec2Processor, Wav2Vec2Model
import soundfile as sf
import torch

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

wave, sr = sf.read("speech_sample.wav")
inputs = processor(wave, sampling_rate=sr, return_tensors="pt")

with torch.no_grad():
    hidden = model(**inputs).last_hidden_state  # [batch, time, hidden_dim]
print("Embeddings:", hidden.shape)

Exercise 6 — Video Frame Sampling & Features (VideoMAE)

Task: Sample ~8 frames from clip.mp4, extract embeddings with VideoMAE.

Solution:

# pip install av
from transformers import VideoMAEFeatureExtractor, VideoMAEModel
import av, torch

feature_extractor = VideoMAEFeatureExtractor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")

container = av.open("clip.mp4")
frames = []
for i, frame in enumerate(container.decode(video=0)):
    if i % 30 == 0:  # sample 1 frame per second if 30fps
        frames.append(frame.to_ndarray(format="rgb24"))
    if len(frames) == 8:
        break

inputs = feature_extractor(frames, return_tensors="pt")
with torch.no_grad():
    out = model(**inputs)
video_emb = out.last_hidden_state  # [batch=1, tokens, hidden_dim]
print("Video embeddings:", video_emb.shape)

Exercise 7 — Cross-Modal Mini-Pipeline (ASR → Caption → Summary)

Task:

Transcribe audio (lecture.mp3) with Whisper.
Caption a slide image (slide.png) with BLIP.
Concatenate both texts and produce a short summary with a small text model (e.g., DistilGPT-2).

Solution:

# ASR
import whisper
from transformers import BlipProcessor, BlipForConditionalGeneration, AutoTokenizer, AutoModelForCausalLM
from PIL import Image
import torch

# 1) Whisper transcription
whisper_model = whisper.load_model("small")
asr = whisper_model.transcribe("lecture.mp3")["text"]

# 2) BLIP caption
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
cap = blip.generate(**blip_proc(images=Image.open("slide.png").convert("RGB"), return_tensors="pt"), max_new_tokens=40)
caption = blip_proc.decode(cap[0], skip_special_tokens=True)

# 3) Summarize with a tiny LM (toy)
tok = AutoTokenizer.from_pretrained("distilgpt2")
lm = AutoModelForCausalLM.from_pretrained("distilgpt2")
prompt = f"Lecture transcript: {asr}\nSlide: {caption}\n\nTL;DR summary:"
ids = tok.encode(prompt, return_tensors="pt")
with torch.no_grad():
    out = lm.generate(ids, max_new_tokens=80, do_sample=True, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))

(For higher-quality summarization, swap in a stronger seq2seq model like facebook/bart-large-cnn.)

Exercise 8 — Visual Question Answering (VQA) with LLaVA-style Interface (Demo)

Task: Use a community VQA checkpoint to answer a question about chart.png (e.g., “What trend is increasing after 2020?”).

Note: Full LLaVA weights are large; the snippet shows the interface pattern using a lighter VQA model (dandelin/vilt-b32-finetuned-vqa).

Solution:

from transformers import ViltProcessor, ViltForQuestionAnswering
from PIL import Image
import torch

processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")

image = Image.open("chart.png").convert("RGB")
question = "What trend is increasing after 2020?"
inputs = processor(image, question, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
answer_idx = logits.argmax(-1).item()
print("Answer:", model.config.id2label[answer_idx])

Exercise 9 — Robustness Check: Add Noise to Audio Before ASR

Task: Add white noise to speech_clean.wav and compare Whisper transcriptions (clean vs noisy).

Solution:

import numpy as np, soundfile as sf, whisper

# Load clean audio
wave, sr = sf.read("speech_clean.wav")
rng = np.random.default_rng(0)
noise = rng.normal(scale=0.02, size=len(wave))  # mild noise
noisy = (wave + noise).astype(np.float32)
sf.write("speech_noisy.wav", noisy, sr)

model = whisper.load_model("base")
clean_txt = model.transcribe("speech_clean.wav")["text"]
noisy_txt = model.transcribe("speech_noisy.wav")["text"]
print("CLEAN:", clean_txt)
print("NOISY:", noisy_txt)

Exercise 10 — Batch Captioning & CLIP Rerank

Task: Caption a set of images, then rerank candidate captions with CLIP to pick the most image-relevant one.

Solution:

from transformers import BlipProcessor, BlipForConditionalGeneration, CLIPProcessor, CLIPModel
from PIL import Image
import torch, glob

# Load models
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_proc = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

images = [Image.open(p).convert("RGB") for p in glob.glob("imgs/*.jpg")]

for img in images:
    # Generate a few variants by sampling
    caps = []
    for _ in range(3):
        with torch.no_grad():
            out = blip.generate(**blip_proc(images=img, return_tensors="pt"),
                                max_new_tokens=30, do_sample=True, top_p=0.9)
        caps.append(blip_proc.decode(out[0], skip_special_tokens=True))

    # Rerank with CLIP
    inputs = clip_proc(text=caps, images=img, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = clip_model(**inputs).logits_per_image
    best = caps[int(logits.softmax(dim=1).argmax())]
    print("Chosen caption:", best)

What You Practiced

Text+image alignment (CLIP), captioning (BLIP), and VQA patterns.
Speech transcription (Whisper) and speech embeddings (wav2vec2).
Video feature extraction (VideoMAE) and cross-modal pipelines that combine ASR and vision.
Simple robustness checks (noise) and reranking strategies to boost quality.

Practical Exercises – Chapter 5

Notes:
You’ll need Python 3.9+, transformers, torch, and task-specific libs (e.g., sentencepiece, soundfile, openai-whisper, av, Pillow).
Model downloads happen automatically the first time you run each snippet.

Exercise 1 — Text+Image Matching with CLIP

Task: Given an image (dog.jpg) and two candidate captions, compute which caption best matches the image.

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("dog.jpg")  # replace with your image path
texts = ["a photo of a dog", "a photo of a cat"]

inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
    out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)

for t, p in zip(texts, probs.tolist()):
    print(f"{t} -> {p:.4f}")

print("Prediction:", texts[int(probs.argmax())])

Exercise 2 — Zero-Shot Image Classification with CLIP

Task: Use CLIP to pick a label for bird.jpg among ["sparrow", "eagle", "penguin"].

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

labels = ["a photo of a sparrow", "a photo of an eagle", "a photo of a penguin"]
image = Image.open("bird.jpg")

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
    out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)

for lab, p in zip(labels, probs.tolist()):
    print(f"{lab} -> {p:.4f}")
print("Predicted label:", labels[int(probs.argmax())])

Exercise 3 — Image Captioning with BLIP

Task: Generate a caption for scene.jpg using BLIP.

Solution:

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

image = Image.open("scene.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=30)
caption = processor.decode(out[0], skip_special_tokens=True)
print("Caption:", caption)

Exercise 4 — Whisper Transcription (Speech → Text)

Task: Transcribe speech_sample.mp3.

Solution:

# pip install git+https://github.com/openai/whisper.git
import whisper

model = whisper.load_model("base")  # "tiny", "small", "medium", "large" also available
result = model.transcribe("speech_sample.mp3")
print("Transcription:", result["text"])

Exercise 5 — Speech Embeddings with wav2vec2

Task: Extract frame-level embeddings from speech_sample.wav.

Solution:

from transformers import Wav2Vec2Processor, Wav2Vec2Model
import soundfile as sf
import torch

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

wave, sr = sf.read("speech_sample.wav")
inputs = processor(wave, sampling_rate=sr, return_tensors="pt")

with torch.no_grad():
    hidden = model(**inputs).last_hidden_state  # [batch, time, hidden_dim]
print("Embeddings:", hidden.shape)

Exercise 6 — Video Frame Sampling & Features (VideoMAE)

Task: Sample ~8 frames from clip.mp4, extract embeddings with VideoMAE.

Solution:

# pip install av
from transformers import VideoMAEFeatureExtractor, VideoMAEModel
import av, torch

feature_extractor = VideoMAEFeatureExtractor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")

container = av.open("clip.mp4")
frames = []
for i, frame in enumerate(container.decode(video=0)):
    if i % 30 == 0:  # sample 1 frame per second if 30fps
        frames.append(frame.to_ndarray(format="rgb24"))
    if len(frames) == 8:
        break

inputs = feature_extractor(frames, return_tensors="pt")
with torch.no_grad():
    out = model(**inputs)
video_emb = out.last_hidden_state  # [batch=1, tokens, hidden_dim]
print("Video embeddings:", video_emb.shape)

Exercise 7 — Cross-Modal Mini-Pipeline (ASR → Caption → Summary)

Task:

Transcribe audio (lecture.mp3) with Whisper.
Caption a slide image (slide.png) with BLIP.
Concatenate both texts and produce a short summary with a small text model (e.g., DistilGPT-2).

Solution:

# ASR
import whisper
from transformers import BlipProcessor, BlipForConditionalGeneration, AutoTokenizer, AutoModelForCausalLM
from PIL import Image
import torch

# 1) Whisper transcription
whisper_model = whisper.load_model("small")
asr = whisper_model.transcribe("lecture.mp3")["text"]

# 2) BLIP caption
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
cap = blip.generate(**blip_proc(images=Image.open("slide.png").convert("RGB"), return_tensors="pt"), max_new_tokens=40)
caption = blip_proc.decode(cap[0], skip_special_tokens=True)

# 3) Summarize with a tiny LM (toy)
tok = AutoTokenizer.from_pretrained("distilgpt2")
lm = AutoModelForCausalLM.from_pretrained("distilgpt2")
prompt = f"Lecture transcript: {asr}\nSlide: {caption}\n\nTL;DR summary:"
ids = tok.encode(prompt, return_tensors="pt")
with torch.no_grad():
    out = lm.generate(ids, max_new_tokens=80, do_sample=True, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))

(For higher-quality summarization, swap in a stronger seq2seq model like facebook/bart-large-cnn.)

Exercise 8 — Visual Question Answering (VQA) with LLaVA-style Interface (Demo)

Task: Use a community VQA checkpoint to answer a question about chart.png (e.g., “What trend is increasing after 2020?”).

Note: Full LLaVA weights are large; the snippet shows the interface pattern using a lighter VQA model (dandelin/vilt-b32-finetuned-vqa).

Solution:

from transformers import ViltProcessor, ViltForQuestionAnswering
from PIL import Image
import torch

processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")

image = Image.open("chart.png").convert("RGB")
question = "What trend is increasing after 2020?"
inputs = processor(image, question, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
answer_idx = logits.argmax(-1).item()
print("Answer:", model.config.id2label[answer_idx])

Exercise 9 — Robustness Check: Add Noise to Audio Before ASR

Task: Add white noise to speech_clean.wav and compare Whisper transcriptions (clean vs noisy).

Solution:

import numpy as np, soundfile as sf, whisper

# Load clean audio
wave, sr = sf.read("speech_clean.wav")
rng = np.random.default_rng(0)
noise = rng.normal(scale=0.02, size=len(wave))  # mild noise
noisy = (wave + noise).astype(np.float32)
sf.write("speech_noisy.wav", noisy, sr)

model = whisper.load_model("base")
clean_txt = model.transcribe("speech_clean.wav")["text"]
noisy_txt = model.transcribe("speech_noisy.wav")["text"]
print("CLEAN:", clean_txt)
print("NOISY:", noisy_txt)

Exercise 10 — Batch Captioning & CLIP Rerank

Task: Caption a set of images, then rerank candidate captions with CLIP to pick the most image-relevant one.

Solution:

from transformers import BlipProcessor, BlipForConditionalGeneration, CLIPProcessor, CLIPModel
from PIL import Image
import torch, glob

# Load models
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_proc = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

images = [Image.open(p).convert("RGB") for p in glob.glob("imgs/*.jpg")]

for img in images:
    # Generate a few variants by sampling
    caps = []
    for _ in range(3):
        with torch.no_grad():
            out = blip.generate(**blip_proc(images=img, return_tensors="pt"),
                                max_new_tokens=30, do_sample=True, top_p=0.9)
        caps.append(blip_proc.decode(out[0], skip_special_tokens=True))

    # Rerank with CLIP
    inputs = clip_proc(text=caps, images=img, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = clip_model(**inputs).logits_per_image
    best = caps[int(logits.softmax(dim=1).argmax())]
    print("Chosen caption:", best)

What You Practiced

Text+image alignment (CLIP), captioning (BLIP), and VQA patterns.
Speech transcription (Whisper) and speech embeddings (wav2vec2).
Video feature extraction (VideoMAE) and cross-modal pipelines that combine ASR and vision.
Simple robustness checks (noise) and reranking strategies to boost quality.

Purchase this book

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Chapter 5: Beyond Text: Multimodal LLMs

Practical Exercises – Chapter 5

Exercise 1 — Text+Image Matching with CLIP

Exercise 2 — Zero-Shot Image Classification with CLIP

Exercise 3 — Image Captioning with BLIP

Exercise 4 — Whisper Transcription (Speech → Text)

Exercise 5 — Speech Embeddings with wav2vec2

Exercise 6 — Video Frame Sampling & Features (VideoMAE)

Exercise 7 — Cross-Modal Mini-Pipeline (ASR → Caption → Summary)

Exercise 8 — Visual Question Answering (VQA) with LLaVA-style Interface (Demo)

Exercise 9 — Robustness Check: Add Noise to Audio Before ASR

Exercise 10 — Batch Captioning & CLIP Rerank

Practical Exercises – Chapter 5

Exercise 1 — Text+Image Matching with CLIP

Exercise 2 — Zero-Shot Image Classification with CLIP

Exercise 3 — Image Captioning with BLIP

Exercise 4 — Whisper Transcription (Speech → Text)

Exercise 5 — Speech Embeddings with wav2vec2

Exercise 6 — Video Frame Sampling & Features (VideoMAE)

Exercise 7 — Cross-Modal Mini-Pipeline (ASR → Caption → Summary)

Exercise 8 — Visual Question Answering (VQA) with LLaVA-style Interface (Demo)

Exercise 9 — Robustness Check: Add Noise to Audio Before ASR

Exercise 10 — Batch Captioning & CLIP Rerank

Practical Exercises – Chapter 5

Exercise 1 — Text+Image Matching with CLIP

Exercise 2 — Zero-Shot Image Classification with CLIP

Exercise 3 — Image Captioning with BLIP

Exercise 4 — Whisper Transcription (Speech → Text)

Exercise 5 — Speech Embeddings with wav2vec2

Exercise 6 — Video Frame Sampling & Features (VideoMAE)

Exercise 7 — Cross-Modal Mini-Pipeline (ASR → Caption → Summary)

Exercise 8 — Visual Question Answering (VQA) with LLaVA-style Interface (Demo)

Exercise 9 — Robustness Check: Add Noise to Audio Before ASR

Exercise 10 — Batch Captioning & CLIP Rerank

Practical Exercises – Chapter 5

Exercise 1 — Text+Image Matching with CLIP

Exercise 2 — Zero-Shot Image Classification with CLIP

Exercise 3 — Image Captioning with BLIP

Exercise 4 — Whisper Transcription (Speech → Text)

Exercise 5 — Speech Embeddings with wav2vec2

Exercise 6 — Video Frame Sampling & Features (VideoMAE)

Exercise 7 — Cross-Modal Mini-Pipeline (ASR → Caption → Summary)

Exercise 8 — Visual Question Answering (VQA) with LLaVA-style Interface (Demo)

Exercise 9 — Robustness Check: Add Noise to Audio Before ASR

Exercise 10 — Batch Captioning & CLIP Rerank