Chapter 6: Multimodal Applications of Transformers
6.4 Practical Exercises
This practical exercises section provides hands-on tasks to deepen your understanding of multimodal AI. Each exercise explores how to integrate text, images, and video data using state-of-the-art models. Solutions are provided with detailed code examples to guide you through implementation.
Exercise 1: Image-Text Matching with CLIP
Task: Use CLIP to match a text query with the most relevant image from a set of candidates.
Instructions:
- Load several images and a text query.
- Use CLIP to compute similarity scores between the text and each image.
- Identify the most relevant image.
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
# Load CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Load images
images = [Image.open(f"image_{i}.jpg") for i in range(1, 4)] # Replace with actual image paths
text_query = "A beautiful sunset over the ocean."
# Preprocess inputs
inputs = processor(text=[text_query], images=images, return_tensors="pt", padding=True)
# Compute similarity scores
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # Image-to-text scores
probs = logits_per_image.softmax(dim=1)
# Identify the most relevant image
best_match_index = probs.argmax().item()
print(f"Most relevant image is image_{best_match_index + 1}.jpg")
Exercise 2: Video Classification with VideoMAE
Task: Classify actions in a video using VideoMAE.
Instructions:
- Extract frames from a video.
- Use VideoMAE to classify the video content.
- Display the predicted action.
Solution:
from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
import cv2
# Extract frames from video
def extract_frames(video_path, frame_rate=10):
cap = cv2.VideoCapture(video_path)
frames = []
count = 0
success = True
while success:
success, frame = cap.read()
if count % frame_rate == 0 and success:
frames.append(cv2.resize(frame, (224, 224))) # Resize to model input size
count += 1
cap.release()
return frames
video_path = "example_video.mp4" # Replace with your video path
frames = extract_frames(video_path)
# Load VideoMAE model and processor
model = VideoMAEForVideoClassification.from_pretrained("facebook/videomae-base")
processor = VideoMAEFeatureExtractor.from_pretrained("facebook/videomae-base")
# Preprocess frames
inputs = processor(frames, return_tensors="pt")
# Classify video
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(dim=-1).item()
print(f"Predicted class: {predicted_class}")
Exercise 3: Generate Captions for Video Frames
Task: Generate captions for each frame of a video using a vision-language model.
Instructions:
- Extract frames from a video.
- Use a vision-language model like CLIP to generate captions for each frame.
- Display the generated captions.
Solution:
from transformers import VisionEncoderDecoderModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
# Load vision-language model and processors
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
processor = AutoImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
# Example: Use frames extracted in Exercise 2
captions = []
for frame in frames[:5]: # Limit to first 5 frames for demo
pil_image = Image.fromarray(frame)
inputs = processor(images=pil_image, return_tensors="pt")
pixel_values = inputs.pixel_values
generated_ids = model.generate(pixel_values, max_length=16, num_beams=4)
caption = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
captions.append(caption)
print("Generated Captions:")
for i, caption in enumerate(captions):
print(f"Frame {i + 1}: {caption}")
Exercise 4: Multimodal Retrieval
Task: Implement a simple text-to-video retrieval system using a pretrained multimodal model.
Instructions:
- Define a text query and extract frames from multiple videos.
- Compute similarity scores between the query and each video.
- Rank the videos based on relevance.
Solution:
# Assuming frames are extracted from multiple videos as lists of images
video_frames = [extract_frames(f"video_{i}.mp4") for i in range(1, 4)] # Replace with video paths
text_query = "A person running on the beach."
# Process each video
video_scores = []
for i, frames in enumerate(video_frames):
inputs = processor(text=[text_query], images=frames, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
avg_score = logits_per_image.softmax(dim=1).mean().item()
video_scores.append((f"video_{i + 1}.mp4", avg_score))
# Rank videos by relevance
ranked_videos = sorted(video_scores, key=lambda x: x[1], reverse=True)
print("Ranked Videos:")
for video, score in ranked_videos:
print(f"{video}: {score:.2f}")
These exercises demonstrate the capabilities of multimodal AI by integrating text, image, and video data. From matching text queries with images to generating captions for video frames and building retrieval systems, these tasks provide hands-on experience in developing powerful and intuitive multimodal applications. Experiment further with these models and datasets to unlock their full potential in real-world scenarios.
6.4 Practical Exercises
This practical exercises section provides hands-on tasks to deepen your understanding of multimodal AI. Each exercise explores how to integrate text, images, and video data using state-of-the-art models. Solutions are provided with detailed code examples to guide you through implementation.
Exercise 1: Image-Text Matching with CLIP
Task: Use CLIP to match a text query with the most relevant image from a set of candidates.
Instructions:
- Load several images and a text query.
- Use CLIP to compute similarity scores between the text and each image.
- Identify the most relevant image.
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
# Load CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Load images
images = [Image.open(f"image_{i}.jpg") for i in range(1, 4)] # Replace with actual image paths
text_query = "A beautiful sunset over the ocean."
# Preprocess inputs
inputs = processor(text=[text_query], images=images, return_tensors="pt", padding=True)
# Compute similarity scores
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # Image-to-text scores
probs = logits_per_image.softmax(dim=1)
# Identify the most relevant image
best_match_index = probs.argmax().item()
print(f"Most relevant image is image_{best_match_index + 1}.jpg")
Exercise 2: Video Classification with VideoMAE
Task: Classify actions in a video using VideoMAE.
Instructions:
- Extract frames from a video.
- Use VideoMAE to classify the video content.
- Display the predicted action.
Solution:
from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
import cv2
# Extract frames from video
def extract_frames(video_path, frame_rate=10):
cap = cv2.VideoCapture(video_path)
frames = []
count = 0
success = True
while success:
success, frame = cap.read()
if count % frame_rate == 0 and success:
frames.append(cv2.resize(frame, (224, 224))) # Resize to model input size
count += 1
cap.release()
return frames
video_path = "example_video.mp4" # Replace with your video path
frames = extract_frames(video_path)
# Load VideoMAE model and processor
model = VideoMAEForVideoClassification.from_pretrained("facebook/videomae-base")
processor = VideoMAEFeatureExtractor.from_pretrained("facebook/videomae-base")
# Preprocess frames
inputs = processor(frames, return_tensors="pt")
# Classify video
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(dim=-1).item()
print(f"Predicted class: {predicted_class}")
Exercise 3: Generate Captions for Video Frames
Task: Generate captions for each frame of a video using a vision-language model.
Instructions:
- Extract frames from a video.
- Use a vision-language model like CLIP to generate captions for each frame.
- Display the generated captions.
Solution:
from transformers import VisionEncoderDecoderModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
# Load vision-language model and processors
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
processor = AutoImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
# Example: Use frames extracted in Exercise 2
captions = []
for frame in frames[:5]: # Limit to first 5 frames for demo
pil_image = Image.fromarray(frame)
inputs = processor(images=pil_image, return_tensors="pt")
pixel_values = inputs.pixel_values
generated_ids = model.generate(pixel_values, max_length=16, num_beams=4)
caption = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
captions.append(caption)
print("Generated Captions:")
for i, caption in enumerate(captions):
print(f"Frame {i + 1}: {caption}")
Exercise 4: Multimodal Retrieval
Task: Implement a simple text-to-video retrieval system using a pretrained multimodal model.
Instructions:
- Define a text query and extract frames from multiple videos.
- Compute similarity scores between the query and each video.
- Rank the videos based on relevance.
Solution:
# Assuming frames are extracted from multiple videos as lists of images
video_frames = [extract_frames(f"video_{i}.mp4") for i in range(1, 4)] # Replace with video paths
text_query = "A person running on the beach."
# Process each video
video_scores = []
for i, frames in enumerate(video_frames):
inputs = processor(text=[text_query], images=frames, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
avg_score = logits_per_image.softmax(dim=1).mean().item()
video_scores.append((f"video_{i + 1}.mp4", avg_score))
# Rank videos by relevance
ranked_videos = sorted(video_scores, key=lambda x: x[1], reverse=True)
print("Ranked Videos:")
for video, score in ranked_videos:
print(f"{video}: {score:.2f}")
These exercises demonstrate the capabilities of multimodal AI by integrating text, image, and video data. From matching text queries with images to generating captions for video frames and building retrieval systems, these tasks provide hands-on experience in developing powerful and intuitive multimodal applications. Experiment further with these models and datasets to unlock their full potential in real-world scenarios.
6.4 Practical Exercises
This practical exercises section provides hands-on tasks to deepen your understanding of multimodal AI. Each exercise explores how to integrate text, images, and video data using state-of-the-art models. Solutions are provided with detailed code examples to guide you through implementation.
Exercise 1: Image-Text Matching with CLIP
Task: Use CLIP to match a text query with the most relevant image from a set of candidates.
Instructions:
- Load several images and a text query.
- Use CLIP to compute similarity scores between the text and each image.
- Identify the most relevant image.
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
# Load CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Load images
images = [Image.open(f"image_{i}.jpg") for i in range(1, 4)] # Replace with actual image paths
text_query = "A beautiful sunset over the ocean."
# Preprocess inputs
inputs = processor(text=[text_query], images=images, return_tensors="pt", padding=True)
# Compute similarity scores
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # Image-to-text scores
probs = logits_per_image.softmax(dim=1)
# Identify the most relevant image
best_match_index = probs.argmax().item()
print(f"Most relevant image is image_{best_match_index + 1}.jpg")
Exercise 2: Video Classification with VideoMAE
Task: Classify actions in a video using VideoMAE.
Instructions:
- Extract frames from a video.
- Use VideoMAE to classify the video content.
- Display the predicted action.
Solution:
from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
import cv2
# Extract frames from video
def extract_frames(video_path, frame_rate=10):
cap = cv2.VideoCapture(video_path)
frames = []
count = 0
success = True
while success:
success, frame = cap.read()
if count % frame_rate == 0 and success:
frames.append(cv2.resize(frame, (224, 224))) # Resize to model input size
count += 1
cap.release()
return frames
video_path = "example_video.mp4" # Replace with your video path
frames = extract_frames(video_path)
# Load VideoMAE model and processor
model = VideoMAEForVideoClassification.from_pretrained("facebook/videomae-base")
processor = VideoMAEFeatureExtractor.from_pretrained("facebook/videomae-base")
# Preprocess frames
inputs = processor(frames, return_tensors="pt")
# Classify video
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(dim=-1).item()
print(f"Predicted class: {predicted_class}")
Exercise 3: Generate Captions for Video Frames
Task: Generate captions for each frame of a video using a vision-language model.
Instructions:
- Extract frames from a video.
- Use a vision-language model like CLIP to generate captions for each frame.
- Display the generated captions.
Solution:
from transformers import VisionEncoderDecoderModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
# Load vision-language model and processors
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
processor = AutoImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
# Example: Use frames extracted in Exercise 2
captions = []
for frame in frames[:5]: # Limit to first 5 frames for demo
pil_image = Image.fromarray(frame)
inputs = processor(images=pil_image, return_tensors="pt")
pixel_values = inputs.pixel_values
generated_ids = model.generate(pixel_values, max_length=16, num_beams=4)
caption = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
captions.append(caption)
print("Generated Captions:")
for i, caption in enumerate(captions):
print(f"Frame {i + 1}: {caption}")
Exercise 4: Multimodal Retrieval
Task: Implement a simple text-to-video retrieval system using a pretrained multimodal model.
Instructions:
- Define a text query and extract frames from multiple videos.
- Compute similarity scores between the query and each video.
- Rank the videos based on relevance.
Solution:
# Assuming frames are extracted from multiple videos as lists of images
video_frames = [extract_frames(f"video_{i}.mp4") for i in range(1, 4)] # Replace with video paths
text_query = "A person running on the beach."
# Process each video
video_scores = []
for i, frames in enumerate(video_frames):
inputs = processor(text=[text_query], images=frames, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
avg_score = logits_per_image.softmax(dim=1).mean().item()
video_scores.append((f"video_{i + 1}.mp4", avg_score))
# Rank videos by relevance
ranked_videos = sorted(video_scores, key=lambda x: x[1], reverse=True)
print("Ranked Videos:")
for video, score in ranked_videos:
print(f"{video}: {score:.2f}")
These exercises demonstrate the capabilities of multimodal AI by integrating text, image, and video data. From matching text queries with images to generating captions for video frames and building retrieval systems, these tasks provide hands-on experience in developing powerful and intuitive multimodal applications. Experiment further with these models and datasets to unlock their full potential in real-world scenarios.
6.4 Practical Exercises
This practical exercises section provides hands-on tasks to deepen your understanding of multimodal AI. Each exercise explores how to integrate text, images, and video data using state-of-the-art models. Solutions are provided with detailed code examples to guide you through implementation.
Exercise 1: Image-Text Matching with CLIP
Task: Use CLIP to match a text query with the most relevant image from a set of candidates.
Instructions:
- Load several images and a text query.
- Use CLIP to compute similarity scores between the text and each image.
- Identify the most relevant image.
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
# Load CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Load images
images = [Image.open(f"image_{i}.jpg") for i in range(1, 4)] # Replace with actual image paths
text_query = "A beautiful sunset over the ocean."
# Preprocess inputs
inputs = processor(text=[text_query], images=images, return_tensors="pt", padding=True)
# Compute similarity scores
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # Image-to-text scores
probs = logits_per_image.softmax(dim=1)
# Identify the most relevant image
best_match_index = probs.argmax().item()
print(f"Most relevant image is image_{best_match_index + 1}.jpg")
Exercise 2: Video Classification with VideoMAE
Task: Classify actions in a video using VideoMAE.
Instructions:
- Extract frames from a video.
- Use VideoMAE to classify the video content.
- Display the predicted action.
Solution:
from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
import cv2
# Extract frames from video
def extract_frames(video_path, frame_rate=10):
cap = cv2.VideoCapture(video_path)
frames = []
count = 0
success = True
while success:
success, frame = cap.read()
if count % frame_rate == 0 and success:
frames.append(cv2.resize(frame, (224, 224))) # Resize to model input size
count += 1
cap.release()
return frames
video_path = "example_video.mp4" # Replace with your video path
frames = extract_frames(video_path)
# Load VideoMAE model and processor
model = VideoMAEForVideoClassification.from_pretrained("facebook/videomae-base")
processor = VideoMAEFeatureExtractor.from_pretrained("facebook/videomae-base")
# Preprocess frames
inputs = processor(frames, return_tensors="pt")
# Classify video
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(dim=-1).item()
print(f"Predicted class: {predicted_class}")
Exercise 3: Generate Captions for Video Frames
Task: Generate captions for each frame of a video using a vision-language model.
Instructions:
- Extract frames from a video.
- Use a vision-language model like CLIP to generate captions for each frame.
- Display the generated captions.
Solution:
from transformers import VisionEncoderDecoderModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
# Load vision-language model and processors
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
processor = AutoImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
# Example: Use frames extracted in Exercise 2
captions = []
for frame in frames[:5]: # Limit to first 5 frames for demo
pil_image = Image.fromarray(frame)
inputs = processor(images=pil_image, return_tensors="pt")
pixel_values = inputs.pixel_values
generated_ids = model.generate(pixel_values, max_length=16, num_beams=4)
caption = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
captions.append(caption)
print("Generated Captions:")
for i, caption in enumerate(captions):
print(f"Frame {i + 1}: {caption}")
Exercise 4: Multimodal Retrieval
Task: Implement a simple text-to-video retrieval system using a pretrained multimodal model.
Instructions:
- Define a text query and extract frames from multiple videos.
- Compute similarity scores between the query and each video.
- Rank the videos based on relevance.
Solution:
# Assuming frames are extracted from multiple videos as lists of images
video_frames = [extract_frames(f"video_{i}.mp4") for i in range(1, 4)] # Replace with video paths
text_query = "A person running on the beach."
# Process each video
video_scores = []
for i, frames in enumerate(video_frames):
inputs = processor(text=[text_query], images=frames, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
avg_score = logits_per_image.softmax(dim=1).mean().item()
video_scores.append((f"video_{i + 1}.mp4", avg_score))
# Rank videos by relevance
ranked_videos = sorted(video_scores, key=lambda x: x[1], reverse=True)
print("Ranked Videos:")
for video, score in ranked_videos:
print(f"{video}: {score:.2f}")
These exercises demonstrate the capabilities of multimodal AI by integrating text, image, and video data. From matching text queries with images to generating captions for video frames and building retrieval systems, these tasks provide hands-on experience in developing powerful and intuitive multimodal applications. Experiment further with these models and datasets to unlock their full potential in real-world scenarios.