Project 6: Multimodal Video Analysis and Summarization
Step 3: Transcribe Audio from Video
Extract audio from the video and transcribe it using Whisper, an advanced speech recognition model developed by OpenAI. Whisper excels at converting spoken words into text with high accuracy across multiple languages and accents.
The model can handle various audio qualities and background noise levels, making it ideal for processing video content. During this step, we'll separate the audio track from the video file and feed it through Whisper to generate a detailed transcription that captures not just the words, but also maintains proper punctuation and speaker attribution when possible.
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# Load Whisper model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
# Extract audio from video
def extract_audio(video_path, output_audio_path="audio.wav"):
cap = cv2.VideoCapture(video_path)
fps = int(cap.get(cv2.CAP_PROP_FPS))
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
duration = total_frames / fps
cap.release()
# Convert to audio using ffmpeg (requires ffmpeg installed)
import subprocess
subprocess.run(["ffmpeg", "-i", video_path, "-q:a", "0", "-map", "a", output_audio_path])
# Transcribe audio
audio_path = "audio.wav"
extract_audio(video_path, audio_path)
audio, rate = librosa.load(audio_path, sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
generated_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")
Let me break down this code that handles audio extraction and transcription from video:
1. Library Imports and Model Setup
- Uses librosa for audio processing and the Whisper model from the transformers library
- Initializes the Whisper model and processor, specifically using the "whisper-small" variant for efficient processing
2. Audio Extraction Function
- The
extract_audio
function handles converting video to audio: - Captures video metadata (fps and frame count)
- Uses ffmpeg to extract the audio track and save it as a WAV file
3. Audio Transcription Process
- The transcription workflow includes:
- Loading the extracted audio file using librosa at 16kHz sampling rate
- Processing the audio through the Whisper model
- Decoding the model's output into readable text
The Whisper model is particularly powerful as it can handle various audio qualities and maintain proper punctuation and speaker attribution.
Step 3: Transcribe Audio from Video
Extract audio from the video and transcribe it using Whisper, an advanced speech recognition model developed by OpenAI. Whisper excels at converting spoken words into text with high accuracy across multiple languages and accents.
The model can handle various audio qualities and background noise levels, making it ideal for processing video content. During this step, we'll separate the audio track from the video file and feed it through Whisper to generate a detailed transcription that captures not just the words, but also maintains proper punctuation and speaker attribution when possible.
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# Load Whisper model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
# Extract audio from video
def extract_audio(video_path, output_audio_path="audio.wav"):
cap = cv2.VideoCapture(video_path)
fps = int(cap.get(cv2.CAP_PROP_FPS))
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
duration = total_frames / fps
cap.release()
# Convert to audio using ffmpeg (requires ffmpeg installed)
import subprocess
subprocess.run(["ffmpeg", "-i", video_path, "-q:a", "0", "-map", "a", output_audio_path])
# Transcribe audio
audio_path = "audio.wav"
extract_audio(video_path, audio_path)
audio, rate = librosa.load(audio_path, sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
generated_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")
Let me break down this code that handles audio extraction and transcription from video:
1. Library Imports and Model Setup
- Uses librosa for audio processing and the Whisper model from the transformers library
- Initializes the Whisper model and processor, specifically using the "whisper-small" variant for efficient processing
2. Audio Extraction Function
- The
extract_audio
function handles converting video to audio: - Captures video metadata (fps and frame count)
- Uses ffmpeg to extract the audio track and save it as a WAV file
3. Audio Transcription Process
- The transcription workflow includes:
- Loading the extracted audio file using librosa at 16kHz sampling rate
- Processing the audio through the Whisper model
- Decoding the model's output into readable text
The Whisper model is particularly powerful as it can handle various audio qualities and maintain proper punctuation and speaker attribution.
Step 3: Transcribe Audio from Video
Extract audio from the video and transcribe it using Whisper, an advanced speech recognition model developed by OpenAI. Whisper excels at converting spoken words into text with high accuracy across multiple languages and accents.
The model can handle various audio qualities and background noise levels, making it ideal for processing video content. During this step, we'll separate the audio track from the video file and feed it through Whisper to generate a detailed transcription that captures not just the words, but also maintains proper punctuation and speaker attribution when possible.
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# Load Whisper model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
# Extract audio from video
def extract_audio(video_path, output_audio_path="audio.wav"):
cap = cv2.VideoCapture(video_path)
fps = int(cap.get(cv2.CAP_PROP_FPS))
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
duration = total_frames / fps
cap.release()
# Convert to audio using ffmpeg (requires ffmpeg installed)
import subprocess
subprocess.run(["ffmpeg", "-i", video_path, "-q:a", "0", "-map", "a", output_audio_path])
# Transcribe audio
audio_path = "audio.wav"
extract_audio(video_path, audio_path)
audio, rate = librosa.load(audio_path, sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
generated_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")
Let me break down this code that handles audio extraction and transcription from video:
1. Library Imports and Model Setup
- Uses librosa for audio processing and the Whisper model from the transformers library
- Initializes the Whisper model and processor, specifically using the "whisper-small" variant for efficient processing
2. Audio Extraction Function
- The
extract_audio
function handles converting video to audio: - Captures video metadata (fps and frame count)
- Uses ffmpeg to extract the audio track and save it as a WAV file
3. Audio Transcription Process
- The transcription workflow includes:
- Loading the extracted audio file using librosa at 16kHz sampling rate
- Processing the audio through the Whisper model
- Decoding the model's output into readable text
The Whisper model is particularly powerful as it can handle various audio qualities and maintain proper punctuation and speaker attribution.
Step 3: Transcribe Audio from Video
Extract audio from the video and transcribe it using Whisper, an advanced speech recognition model developed by OpenAI. Whisper excels at converting spoken words into text with high accuracy across multiple languages and accents.
The model can handle various audio qualities and background noise levels, making it ideal for processing video content. During this step, we'll separate the audio track from the video file and feed it through Whisper to generate a detailed transcription that captures not just the words, but also maintains proper punctuation and speaker attribution when possible.
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# Load Whisper model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
# Extract audio from video
def extract_audio(video_path, output_audio_path="audio.wav"):
cap = cv2.VideoCapture(video_path)
fps = int(cap.get(cv2.CAP_PROP_FPS))
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
duration = total_frames / fps
cap.release()
# Convert to audio using ffmpeg (requires ffmpeg installed)
import subprocess
subprocess.run(["ffmpeg", "-i", video_path, "-q:a", "0", "-map", "a", output_audio_path])
# Transcribe audio
audio_path = "audio.wav"
extract_audio(video_path, audio_path)
audio, rate = librosa.load(audio_path, sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
generated_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")
Let me break down this code that handles audio extraction and transcription from video:
1. Library Imports and Model Setup
- Uses librosa for audio processing and the Whisper model from the transformers library
- Initializes the Whisper model and processor, specifically using the "whisper-small" variant for efficient processing
2. Audio Extraction Function
- The
extract_audio
function handles converting video to audio: - Captures video metadata (fps and frame count)
- Uses ffmpeg to extract the audio track and save it as a WAV file
3. Audio Transcription Process
- The transcription workflow includes:
- Loading the extracted audio file using librosa at 16kHz sampling rate
- Processing the audio through the Whisper model
- Decoding the model's output into readable text
The Whisper model is particularly powerful as it can handle various audio qualities and maintain proper punctuation and speaker attribution.