Chapter 5: Beyond Text: Multimodal LLMs
Practical Exercises – Chapter 5
These exercises help you practice text+image alignment (CLIP/LLaVA-style), captioning (BLIP), speech transcription (Whisper), speech embeddings (wav2vec2), video features (VideoMAE), and simple cross-modal pipelines.
Notes:
- You’ll need Python 3.9+,
transformers,torch, and task-specific libs (e.g.,sentencepiece,soundfile,openai-whisper,av,Pillow).- Model downloads happen automatically the first time you run each snippet.
Exercise 1 — Text+Image Matching with CLIP
Task: Given an image (dog.jpg) and two candidate captions, compute which caption best matches the image.
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("dog.jpg") # replace with your image path
texts = ["a photo of a dog", "a photo of a cat"]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)
for t, p in zip(texts, probs.tolist()):
print(f"{t} -> {p:.4f}")
print("Prediction:", texts[int(probs.argmax())])
Exercise 2 — Zero-Shot Image Classification with CLIP
Task: Use CLIP to pick a label for bird.jpg among ["sparrow", "eagle", "penguin"].
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
labels = ["a photo of a sparrow", "a photo of an eagle", "a photo of a penguin"]
image = Image.open("bird.jpg")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)
for lab, p in zip(labels, probs.tolist()):
print(f"{lab} -> {p:.4f}")
print("Predicted label:", labels[int(probs.argmax())])
Exercise 3 — Image Captioning with BLIP
Task: Generate a caption for scene.jpg using BLIP.
Solution:
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
image = Image.open("scene.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=30)
caption = processor.decode(out[0], skip_special_tokens=True)
print("Caption:", caption)
Exercise 4 — Whisper Transcription (Speech → Text)
Task: Transcribe speech_sample.mp3.
Solution:
# pip install git+https://github.com/openai/whisper.git
import whisper
model = whisper.load_model("base") # "tiny", "small", "medium", "large" also available
result = model.transcribe("speech_sample.mp3")
print("Transcription:", result["text"])
Exercise 5 — Speech Embeddings with wav2vec2
Task: Extract frame-level embeddings from speech_sample.wav.
Solution:
from transformers import Wav2Vec2Processor, Wav2Vec2Model
import soundfile as sf
import torch
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")
wave, sr = sf.read("speech_sample.wav")
inputs = processor(wave, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
hidden = model(**inputs).last_hidden_state # [batch, time, hidden_dim]
print("Embeddings:", hidden.shape)
Exercise 6 — Video Frame Sampling & Features (VideoMAE)
Task: Sample ~8 frames from clip.mp4, extract embeddings with VideoMAE.
Solution:
# pip install av
from transformers import VideoMAEFeatureExtractor, VideoMAEModel
import av, torch
feature_extractor = VideoMAEFeatureExtractor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")
container = av.open("clip.mp4")
frames = []
for i, frame in enumerate(container.decode(video=0)):
if i % 30 == 0: # sample 1 frame per second if 30fps
frames.append(frame.to_ndarray(format="rgb24"))
if len(frames) == 8:
break
inputs = feature_extractor(frames, return_tensors="pt")
with torch.no_grad():
out = model(**inputs)
video_emb = out.last_hidden_state # [batch=1, tokens, hidden_dim]
print("Video embeddings:", video_emb.shape)
Exercise 7 — Cross-Modal Mini-Pipeline (ASR → Caption → Summary)
Task:
- Transcribe audio (
lecture.mp3) with Whisper. - Caption a slide image (
slide.png) with BLIP. - Concatenate both texts and produce a short summary with a small text model (e.g., DistilGPT-2).
Solution:
# ASR
import whisper
from transformers import BlipProcessor, BlipForConditionalGeneration, AutoTokenizer, AutoModelForCausalLM
from PIL import Image
import torch
# 1) Whisper transcription
whisper_model = whisper.load_model("small")
asr = whisper_model.transcribe("lecture.mp3")["text"]
# 2) BLIP caption
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
cap = blip.generate(**blip_proc(images=Image.open("slide.png").convert("RGB"), return_tensors="pt"), max_new_tokens=40)
caption = blip_proc.decode(cap[0], skip_special_tokens=True)
# 3) Summarize with a tiny LM (toy)
tok = AutoTokenizer.from_pretrained("distilgpt2")
lm = AutoModelForCausalLM.from_pretrained("distilgpt2")
prompt = f"Lecture transcript: {asr}\nSlide: {caption}\n\nTL;DR summary:"
ids = tok.encode(prompt, return_tensors="pt")
with torch.no_grad():
out = lm.generate(ids, max_new_tokens=80, do_sample=True, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
(For higher-quality summarization, swap in a stronger seq2seq model like facebook/bart-large-cnn.)
Exercise 8 — Visual Question Answering (VQA) with LLaVA-style Interface (Demo)
Task: Use a community VQA checkpoint to answer a question about chart.png (e.g., “What trend is increasing after 2020?”).
Note: Full LLaVA weights are large; the snippet shows the interface pattern using a lighter VQA model (dandelin/vilt-b32-finetuned-vqa).
Solution:
from transformers import ViltProcessor, ViltForQuestionAnswering
from PIL import Image
import torch
processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
image = Image.open("chart.png").convert("RGB")
question = "What trend is increasing after 2020?"
inputs = processor(image, question, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
answer_idx = logits.argmax(-1).item()
print("Answer:", model.config.id2label[answer_idx])
Exercise 9 — Robustness Check: Add Noise to Audio Before ASR
Task: Add white noise to speech_clean.wav and compare Whisper transcriptions (clean vs noisy).
Solution:
import numpy as np, soundfile as sf, whisper
# Load clean audio
wave, sr = sf.read("speech_clean.wav")
rng = np.random.default_rng(0)
noise = rng.normal(scale=0.02, size=len(wave)) # mild noise
noisy = (wave + noise).astype(np.float32)
sf.write("speech_noisy.wav", noisy, sr)
model = whisper.load_model("base")
clean_txt = model.transcribe("speech_clean.wav")["text"]
noisy_txt = model.transcribe("speech_noisy.wav")["text"]
print("CLEAN:", clean_txt)
print("NOISY:", noisy_txt)
Exercise 10 — Batch Captioning & CLIP Rerank
Task: Caption a set of images, then rerank candidate captions with CLIP to pick the most image-relevant one.
Solution:
from transformers import BlipProcessor, BlipForConditionalGeneration, CLIPProcessor, CLIPModel
from PIL import Image
import torch, glob
# Load models
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_proc = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
images = [Image.open(p).convert("RGB") for p in glob.glob("imgs/*.jpg")]
for img in images:
# Generate a few variants by sampling
caps = []
for _ in range(3):
with torch.no_grad():
out = blip.generate(**blip_proc(images=img, return_tensors="pt"),
max_new_tokens=30, do_sample=True, top_p=0.9)
caps.append(blip_proc.decode(out[0], skip_special_tokens=True))
# Rerank with CLIP
inputs = clip_proc(text=caps, images=img, return_tensors="pt", padding=True)
with torch.no_grad():
logits = clip_model(**inputs).logits_per_image
best = caps[int(logits.softmax(dim=1).argmax())]
print("Chosen caption:", best)
What You Practiced
- Text+image alignment (CLIP), captioning (BLIP), and VQA patterns.
- Speech transcription (Whisper) and speech embeddings (wav2vec2).
- Video feature extraction (VideoMAE) and cross-modal pipelines that combine ASR and vision.
- Simple robustness checks (noise) and reranking strategies to boost quality.
Practical Exercises – Chapter 5
These exercises help you practice text+image alignment (CLIP/LLaVA-style), captioning (BLIP), speech transcription (Whisper), speech embeddings (wav2vec2), video features (VideoMAE), and simple cross-modal pipelines.
Notes:
- You’ll need Python 3.9+,
transformers,torch, and task-specific libs (e.g.,sentencepiece,soundfile,openai-whisper,av,Pillow).- Model downloads happen automatically the first time you run each snippet.
Exercise 1 — Text+Image Matching with CLIP
Task: Given an image (dog.jpg) and two candidate captions, compute which caption best matches the image.
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("dog.jpg") # replace with your image path
texts = ["a photo of a dog", "a photo of a cat"]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)
for t, p in zip(texts, probs.tolist()):
print(f"{t} -> {p:.4f}")
print("Prediction:", texts[int(probs.argmax())])
Exercise 2 — Zero-Shot Image Classification with CLIP
Task: Use CLIP to pick a label for bird.jpg among ["sparrow", "eagle", "penguin"].
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
labels = ["a photo of a sparrow", "a photo of an eagle", "a photo of a penguin"]
image = Image.open("bird.jpg")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)
for lab, p in zip(labels, probs.tolist()):
print(f"{lab} -> {p:.4f}")
print("Predicted label:", labels[int(probs.argmax())])
Exercise 3 — Image Captioning with BLIP
Task: Generate a caption for scene.jpg using BLIP.
Solution:
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
image = Image.open("scene.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=30)
caption = processor.decode(out[0], skip_special_tokens=True)
print("Caption:", caption)
Exercise 4 — Whisper Transcription (Speech → Text)
Task: Transcribe speech_sample.mp3.
Solution:
# pip install git+https://github.com/openai/whisper.git
import whisper
model = whisper.load_model("base") # "tiny", "small", "medium", "large" also available
result = model.transcribe("speech_sample.mp3")
print("Transcription:", result["text"])
Exercise 5 — Speech Embeddings with wav2vec2
Task: Extract frame-level embeddings from speech_sample.wav.
Solution:
from transformers import Wav2Vec2Processor, Wav2Vec2Model
import soundfile as sf
import torch
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")
wave, sr = sf.read("speech_sample.wav")
inputs = processor(wave, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
hidden = model(**inputs).last_hidden_state # [batch, time, hidden_dim]
print("Embeddings:", hidden.shape)
Exercise 6 — Video Frame Sampling & Features (VideoMAE)
Task: Sample ~8 frames from clip.mp4, extract embeddings with VideoMAE.
Solution:
# pip install av
from transformers import VideoMAEFeatureExtractor, VideoMAEModel
import av, torch
feature_extractor = VideoMAEFeatureExtractor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")
container = av.open("clip.mp4")
frames = []
for i, frame in enumerate(container.decode(video=0)):
if i % 30 == 0: # sample 1 frame per second if 30fps
frames.append(frame.to_ndarray(format="rgb24"))
if len(frames) == 8:
break
inputs = feature_extractor(frames, return_tensors="pt")
with torch.no_grad():
out = model(**inputs)
video_emb = out.last_hidden_state # [batch=1, tokens, hidden_dim]
print("Video embeddings:", video_emb.shape)
Exercise 7 — Cross-Modal Mini-Pipeline (ASR → Caption → Summary)
Task:
- Transcribe audio (
lecture.mp3) with Whisper. - Caption a slide image (
slide.png) with BLIP. - Concatenate both texts and produce a short summary with a small text model (e.g., DistilGPT-2).
Solution:
# ASR
import whisper
from transformers import BlipProcessor, BlipForConditionalGeneration, AutoTokenizer, AutoModelForCausalLM
from PIL import Image
import torch
# 1) Whisper transcription
whisper_model = whisper.load_model("small")
asr = whisper_model.transcribe("lecture.mp3")["text"]
# 2) BLIP caption
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
cap = blip.generate(**blip_proc(images=Image.open("slide.png").convert("RGB"), return_tensors="pt"), max_new_tokens=40)
caption = blip_proc.decode(cap[0], skip_special_tokens=True)
# 3) Summarize with a tiny LM (toy)
tok = AutoTokenizer.from_pretrained("distilgpt2")
lm = AutoModelForCausalLM.from_pretrained("distilgpt2")
prompt = f"Lecture transcript: {asr}\nSlide: {caption}\n\nTL;DR summary:"
ids = tok.encode(prompt, return_tensors="pt")
with torch.no_grad():
out = lm.generate(ids, max_new_tokens=80, do_sample=True, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
(For higher-quality summarization, swap in a stronger seq2seq model like facebook/bart-large-cnn.)
Exercise 8 — Visual Question Answering (VQA) with LLaVA-style Interface (Demo)
Task: Use a community VQA checkpoint to answer a question about chart.png (e.g., “What trend is increasing after 2020?”).
Note: Full LLaVA weights are large; the snippet shows the interface pattern using a lighter VQA model (dandelin/vilt-b32-finetuned-vqa).
Solution:
from transformers import ViltProcessor, ViltForQuestionAnswering
from PIL import Image
import torch
processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
image = Image.open("chart.png").convert("RGB")
question = "What trend is increasing after 2020?"
inputs = processor(image, question, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
answer_idx = logits.argmax(-1).item()
print("Answer:", model.config.id2label[answer_idx])
Exercise 9 — Robustness Check: Add Noise to Audio Before ASR
Task: Add white noise to speech_clean.wav and compare Whisper transcriptions (clean vs noisy).
Solution:
import numpy as np, soundfile as sf, whisper
# Load clean audio
wave, sr = sf.read("speech_clean.wav")
rng = np.random.default_rng(0)
noise = rng.normal(scale=0.02, size=len(wave)) # mild noise
noisy = (wave + noise).astype(np.float32)
sf.write("speech_noisy.wav", noisy, sr)
model = whisper.load_model("base")
clean_txt = model.transcribe("speech_clean.wav")["text"]
noisy_txt = model.transcribe("speech_noisy.wav")["text"]
print("CLEAN:", clean_txt)
print("NOISY:", noisy_txt)
Exercise 10 — Batch Captioning & CLIP Rerank
Task: Caption a set of images, then rerank candidate captions with CLIP to pick the most image-relevant one.
Solution:
from transformers import BlipProcessor, BlipForConditionalGeneration, CLIPProcessor, CLIPModel
from PIL import Image
import torch, glob
# Load models
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_proc = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
images = [Image.open(p).convert("RGB") for p in glob.glob("imgs/*.jpg")]
for img in images:
# Generate a few variants by sampling
caps = []
for _ in range(3):
with torch.no_grad():
out = blip.generate(**blip_proc(images=img, return_tensors="pt"),
max_new_tokens=30, do_sample=True, top_p=0.9)
caps.append(blip_proc.decode(out[0], skip_special_tokens=True))
# Rerank with CLIP
inputs = clip_proc(text=caps, images=img, return_tensors="pt", padding=True)
with torch.no_grad():
logits = clip_model(**inputs).logits_per_image
best = caps[int(logits.softmax(dim=1).argmax())]
print("Chosen caption:", best)
What You Practiced
- Text+image alignment (CLIP), captioning (BLIP), and VQA patterns.
- Speech transcription (Whisper) and speech embeddings (wav2vec2).
- Video feature extraction (VideoMAE) and cross-modal pipelines that combine ASR and vision.
- Simple robustness checks (noise) and reranking strategies to boost quality.
Practical Exercises – Chapter 5
These exercises help you practice text+image alignment (CLIP/LLaVA-style), captioning (BLIP), speech transcription (Whisper), speech embeddings (wav2vec2), video features (VideoMAE), and simple cross-modal pipelines.
Notes:
- You’ll need Python 3.9+,
transformers,torch, and task-specific libs (e.g.,sentencepiece,soundfile,openai-whisper,av,Pillow).- Model downloads happen automatically the first time you run each snippet.
Exercise 1 — Text+Image Matching with CLIP
Task: Given an image (dog.jpg) and two candidate captions, compute which caption best matches the image.
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("dog.jpg") # replace with your image path
texts = ["a photo of a dog", "a photo of a cat"]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)
for t, p in zip(texts, probs.tolist()):
print(f"{t} -> {p:.4f}")
print("Prediction:", texts[int(probs.argmax())])
Exercise 2 — Zero-Shot Image Classification with CLIP
Task: Use CLIP to pick a label for bird.jpg among ["sparrow", "eagle", "penguin"].
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
labels = ["a photo of a sparrow", "a photo of an eagle", "a photo of a penguin"]
image = Image.open("bird.jpg")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)
for lab, p in zip(labels, probs.tolist()):
print(f"{lab} -> {p:.4f}")
print("Predicted label:", labels[int(probs.argmax())])
Exercise 3 — Image Captioning with BLIP
Task: Generate a caption for scene.jpg using BLIP.
Solution:
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
image = Image.open("scene.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=30)
caption = processor.decode(out[0], skip_special_tokens=True)
print("Caption:", caption)
Exercise 4 — Whisper Transcription (Speech → Text)
Task: Transcribe speech_sample.mp3.
Solution:
# pip install git+https://github.com/openai/whisper.git
import whisper
model = whisper.load_model("base") # "tiny", "small", "medium", "large" also available
result = model.transcribe("speech_sample.mp3")
print("Transcription:", result["text"])
Exercise 5 — Speech Embeddings with wav2vec2
Task: Extract frame-level embeddings from speech_sample.wav.
Solution:
from transformers import Wav2Vec2Processor, Wav2Vec2Model
import soundfile as sf
import torch
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")
wave, sr = sf.read("speech_sample.wav")
inputs = processor(wave, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
hidden = model(**inputs).last_hidden_state # [batch, time, hidden_dim]
print("Embeddings:", hidden.shape)
Exercise 6 — Video Frame Sampling & Features (VideoMAE)
Task: Sample ~8 frames from clip.mp4, extract embeddings with VideoMAE.
Solution:
# pip install av
from transformers import VideoMAEFeatureExtractor, VideoMAEModel
import av, torch
feature_extractor = VideoMAEFeatureExtractor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")
container = av.open("clip.mp4")
frames = []
for i, frame in enumerate(container.decode(video=0)):
if i % 30 == 0: # sample 1 frame per second if 30fps
frames.append(frame.to_ndarray(format="rgb24"))
if len(frames) == 8:
break
inputs = feature_extractor(frames, return_tensors="pt")
with torch.no_grad():
out = model(**inputs)
video_emb = out.last_hidden_state # [batch=1, tokens, hidden_dim]
print("Video embeddings:", video_emb.shape)
Exercise 7 — Cross-Modal Mini-Pipeline (ASR → Caption → Summary)
Task:
- Transcribe audio (
lecture.mp3) with Whisper. - Caption a slide image (
slide.png) with BLIP. - Concatenate both texts and produce a short summary with a small text model (e.g., DistilGPT-2).
Solution:
# ASR
import whisper
from transformers import BlipProcessor, BlipForConditionalGeneration, AutoTokenizer, AutoModelForCausalLM
from PIL import Image
import torch
# 1) Whisper transcription
whisper_model = whisper.load_model("small")
asr = whisper_model.transcribe("lecture.mp3")["text"]
# 2) BLIP caption
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
cap = blip.generate(**blip_proc(images=Image.open("slide.png").convert("RGB"), return_tensors="pt"), max_new_tokens=40)
caption = blip_proc.decode(cap[0], skip_special_tokens=True)
# 3) Summarize with a tiny LM (toy)
tok = AutoTokenizer.from_pretrained("distilgpt2")
lm = AutoModelForCausalLM.from_pretrained("distilgpt2")
prompt = f"Lecture transcript: {asr}\nSlide: {caption}\n\nTL;DR summary:"
ids = tok.encode(prompt, return_tensors="pt")
with torch.no_grad():
out = lm.generate(ids, max_new_tokens=80, do_sample=True, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
(For higher-quality summarization, swap in a stronger seq2seq model like facebook/bart-large-cnn.)
Exercise 8 — Visual Question Answering (VQA) with LLaVA-style Interface (Demo)
Task: Use a community VQA checkpoint to answer a question about chart.png (e.g., “What trend is increasing after 2020?”).
Note: Full LLaVA weights are large; the snippet shows the interface pattern using a lighter VQA model (dandelin/vilt-b32-finetuned-vqa).
Solution:
from transformers import ViltProcessor, ViltForQuestionAnswering
from PIL import Image
import torch
processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
image = Image.open("chart.png").convert("RGB")
question = "What trend is increasing after 2020?"
inputs = processor(image, question, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
answer_idx = logits.argmax(-1).item()
print("Answer:", model.config.id2label[answer_idx])
Exercise 9 — Robustness Check: Add Noise to Audio Before ASR
Task: Add white noise to speech_clean.wav and compare Whisper transcriptions (clean vs noisy).
Solution:
import numpy as np, soundfile as sf, whisper
# Load clean audio
wave, sr = sf.read("speech_clean.wav")
rng = np.random.default_rng(0)
noise = rng.normal(scale=0.02, size=len(wave)) # mild noise
noisy = (wave + noise).astype(np.float32)
sf.write("speech_noisy.wav", noisy, sr)
model = whisper.load_model("base")
clean_txt = model.transcribe("speech_clean.wav")["text"]
noisy_txt = model.transcribe("speech_noisy.wav")["text"]
print("CLEAN:", clean_txt)
print("NOISY:", noisy_txt)
Exercise 10 — Batch Captioning & CLIP Rerank
Task: Caption a set of images, then rerank candidate captions with CLIP to pick the most image-relevant one.
Solution:
from transformers import BlipProcessor, BlipForConditionalGeneration, CLIPProcessor, CLIPModel
from PIL import Image
import torch, glob
# Load models
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_proc = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
images = [Image.open(p).convert("RGB") for p in glob.glob("imgs/*.jpg")]
for img in images:
# Generate a few variants by sampling
caps = []
for _ in range(3):
with torch.no_grad():
out = blip.generate(**blip_proc(images=img, return_tensors="pt"),
max_new_tokens=30, do_sample=True, top_p=0.9)
caps.append(blip_proc.decode(out[0], skip_special_tokens=True))
# Rerank with CLIP
inputs = clip_proc(text=caps, images=img, return_tensors="pt", padding=True)
with torch.no_grad():
logits = clip_model(**inputs).logits_per_image
best = caps[int(logits.softmax(dim=1).argmax())]
print("Chosen caption:", best)
What You Practiced
- Text+image alignment (CLIP), captioning (BLIP), and VQA patterns.
- Speech transcription (Whisper) and speech embeddings (wav2vec2).
- Video feature extraction (VideoMAE) and cross-modal pipelines that combine ASR and vision.
- Simple robustness checks (noise) and reranking strategies to boost quality.
Practical Exercises – Chapter 5
These exercises help you practice text+image alignment (CLIP/LLaVA-style), captioning (BLIP), speech transcription (Whisper), speech embeddings (wav2vec2), video features (VideoMAE), and simple cross-modal pipelines.
Notes:
- You’ll need Python 3.9+,
transformers,torch, and task-specific libs (e.g.,sentencepiece,soundfile,openai-whisper,av,Pillow).- Model downloads happen automatically the first time you run each snippet.
Exercise 1 — Text+Image Matching with CLIP
Task: Given an image (dog.jpg) and two candidate captions, compute which caption best matches the image.
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("dog.jpg") # replace with your image path
texts = ["a photo of a dog", "a photo of a cat"]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)
for t, p in zip(texts, probs.tolist()):
print(f"{t} -> {p:.4f}")
print("Prediction:", texts[int(probs.argmax())])
Exercise 2 — Zero-Shot Image Classification with CLIP
Task: Use CLIP to pick a label for bird.jpg among ["sparrow", "eagle", "penguin"].
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
labels = ["a photo of a sparrow", "a photo of an eagle", "a photo of a penguin"]
image = Image.open("bird.jpg")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1).squeeze(0)
for lab, p in zip(labels, probs.tolist()):
print(f"{lab} -> {p:.4f}")
print("Predicted label:", labels[int(probs.argmax())])
Exercise 3 — Image Captioning with BLIP
Task: Generate a caption for scene.jpg using BLIP.
Solution:
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
image = Image.open("scene.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=30)
caption = processor.decode(out[0], skip_special_tokens=True)
print("Caption:", caption)
Exercise 4 — Whisper Transcription (Speech → Text)
Task: Transcribe speech_sample.mp3.
Solution:
# pip install git+https://github.com/openai/whisper.git
import whisper
model = whisper.load_model("base") # "tiny", "small", "medium", "large" also available
result = model.transcribe("speech_sample.mp3")
print("Transcription:", result["text"])
Exercise 5 — Speech Embeddings with wav2vec2
Task: Extract frame-level embeddings from speech_sample.wav.
Solution:
from transformers import Wav2Vec2Processor, Wav2Vec2Model
import soundfile as sf
import torch
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")
wave, sr = sf.read("speech_sample.wav")
inputs = processor(wave, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
hidden = model(**inputs).last_hidden_state # [batch, time, hidden_dim]
print("Embeddings:", hidden.shape)
Exercise 6 — Video Frame Sampling & Features (VideoMAE)
Task: Sample ~8 frames from clip.mp4, extract embeddings with VideoMAE.
Solution:
# pip install av
from transformers import VideoMAEFeatureExtractor, VideoMAEModel
import av, torch
feature_extractor = VideoMAEFeatureExtractor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")
container = av.open("clip.mp4")
frames = []
for i, frame in enumerate(container.decode(video=0)):
if i % 30 == 0: # sample 1 frame per second if 30fps
frames.append(frame.to_ndarray(format="rgb24"))
if len(frames) == 8:
break
inputs = feature_extractor(frames, return_tensors="pt")
with torch.no_grad():
out = model(**inputs)
video_emb = out.last_hidden_state # [batch=1, tokens, hidden_dim]
print("Video embeddings:", video_emb.shape)
Exercise 7 — Cross-Modal Mini-Pipeline (ASR → Caption → Summary)
Task:
- Transcribe audio (
lecture.mp3) with Whisper. - Caption a slide image (
slide.png) with BLIP. - Concatenate both texts and produce a short summary with a small text model (e.g., DistilGPT-2).
Solution:
# ASR
import whisper
from transformers import BlipProcessor, BlipForConditionalGeneration, AutoTokenizer, AutoModelForCausalLM
from PIL import Image
import torch
# 1) Whisper transcription
whisper_model = whisper.load_model("small")
asr = whisper_model.transcribe("lecture.mp3")["text"]
# 2) BLIP caption
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
cap = blip.generate(**blip_proc(images=Image.open("slide.png").convert("RGB"), return_tensors="pt"), max_new_tokens=40)
caption = blip_proc.decode(cap[0], skip_special_tokens=True)
# 3) Summarize with a tiny LM (toy)
tok = AutoTokenizer.from_pretrained("distilgpt2")
lm = AutoModelForCausalLM.from_pretrained("distilgpt2")
prompt = f"Lecture transcript: {asr}\nSlide: {caption}\n\nTL;DR summary:"
ids = tok.encode(prompt, return_tensors="pt")
with torch.no_grad():
out = lm.generate(ids, max_new_tokens=80, do_sample=True, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
(For higher-quality summarization, swap in a stronger seq2seq model like facebook/bart-large-cnn.)
Exercise 8 — Visual Question Answering (VQA) with LLaVA-style Interface (Demo)
Task: Use a community VQA checkpoint to answer a question about chart.png (e.g., “What trend is increasing after 2020?”).
Note: Full LLaVA weights are large; the snippet shows the interface pattern using a lighter VQA model (dandelin/vilt-b32-finetuned-vqa).
Solution:
from transformers import ViltProcessor, ViltForQuestionAnswering
from PIL import Image
import torch
processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
image = Image.open("chart.png").convert("RGB")
question = "What trend is increasing after 2020?"
inputs = processor(image, question, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
answer_idx = logits.argmax(-1).item()
print("Answer:", model.config.id2label[answer_idx])
Exercise 9 — Robustness Check: Add Noise to Audio Before ASR
Task: Add white noise to speech_clean.wav and compare Whisper transcriptions (clean vs noisy).
Solution:
import numpy as np, soundfile as sf, whisper
# Load clean audio
wave, sr = sf.read("speech_clean.wav")
rng = np.random.default_rng(0)
noise = rng.normal(scale=0.02, size=len(wave)) # mild noise
noisy = (wave + noise).astype(np.float32)
sf.write("speech_noisy.wav", noisy, sr)
model = whisper.load_model("base")
clean_txt = model.transcribe("speech_clean.wav")["text"]
noisy_txt = model.transcribe("speech_noisy.wav")["text"]
print("CLEAN:", clean_txt)
print("NOISY:", noisy_txt)
Exercise 10 — Batch Captioning & CLIP Rerank
Task: Caption a set of images, then rerank candidate captions with CLIP to pick the most image-relevant one.
Solution:
from transformers import BlipProcessor, BlipForConditionalGeneration, CLIPProcessor, CLIPModel
from PIL import Image
import torch, glob
# Load models
blip_proc = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_proc = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
images = [Image.open(p).convert("RGB") for p in glob.glob("imgs/*.jpg")]
for img in images:
# Generate a few variants by sampling
caps = []
for _ in range(3):
with torch.no_grad():
out = blip.generate(**blip_proc(images=img, return_tensors="pt"),
max_new_tokens=30, do_sample=True, top_p=0.9)
caps.append(blip_proc.decode(out[0], skip_special_tokens=True))
# Rerank with CLIP
inputs = clip_proc(text=caps, images=img, return_tensors="pt", padding=True)
with torch.no_grad():
logits = clip_model(**inputs).logits_per_image
best = caps[int(logits.softmax(dim=1).argmax())]
print("Chosen caption:", best)
What You Practiced
- Text+image alignment (CLIP), captioning (BLIP), and VQA patterns.
- Speech transcription (Whisper) and speech embeddings (wav2vec2).
- Video feature extraction (VideoMAE) and cross-modal pipelines that combine ASR and vision.
- Simple robustness checks (noise) and reranking strategies to boost quality.
