Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconNLP con Transformers, técnicas avanzadas y aplicaciones multimodales
NLP con Transformers, técnicas avanzadas y aplicaciones multimodales

Project 6: Multimodal Video Analysis and Summarization

Step 3: Transcribe Audio from Video

Extract audio from the video and transcribe it using Whisper, an advanced speech recognition model developed by OpenAI. Whisper excels at converting spoken words into text with high accuracy across multiple languages and accents.

The model can handle various audio qualities and background noise levels, making it ideal for processing video content. During this step, we'll separate the audio track from the video file and feed it through Whisper to generate a detailed transcription that captures not just the words, but also maintains proper punctuation and speaker attribution when possible.

import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load Whisper model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

# Extract audio from video
def extract_audio(video_path, output_audio_path="audio.wav"):
    cap = cv2.VideoCapture(video_path)
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    duration = total_frames / fps
    cap.release()

    # Convert to audio using ffmpeg (requires ffmpeg installed)
    import subprocess
    subprocess.run(["ffmpeg", "-i", video_path, "-q:a", "0", "-map", "a", output_audio_path])

# Transcribe audio
audio_path = "audio.wav"
extract_audio(video_path, audio_path)
audio, rate = librosa.load(audio_path, sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
generated_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(f"Transcription: {transcription}")

Let me break down this code that handles audio extraction and transcription from video:

1. Library Imports and Model Setup

  • Uses librosa for audio processing and the Whisper model from the transformers library
  • Initializes the Whisper model and processor, specifically using the "whisper-small" variant for efficient processing

2. Audio Extraction Function

  • The extract_audio function handles converting video to audio:
  • Captures video metadata (fps and frame count)
  • Uses ffmpeg to extract the audio track and save it as a WAV file

3. Audio Transcription Process

  • The transcription workflow includes:
  • Loading the extracted audio file using librosa at 16kHz sampling rate
  • Processing the audio through the Whisper model
  • Decoding the model's output into readable text

The Whisper model is particularly powerful as it can handle various audio qualities and maintain proper punctuation and speaker attribution.

Step 3: Transcribe Audio from Video

Extract audio from the video and transcribe it using Whisper, an advanced speech recognition model developed by OpenAI. Whisper excels at converting spoken words into text with high accuracy across multiple languages and accents.

The model can handle various audio qualities and background noise levels, making it ideal for processing video content. During this step, we'll separate the audio track from the video file and feed it through Whisper to generate a detailed transcription that captures not just the words, but also maintains proper punctuation and speaker attribution when possible.

import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load Whisper model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

# Extract audio from video
def extract_audio(video_path, output_audio_path="audio.wav"):
    cap = cv2.VideoCapture(video_path)
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    duration = total_frames / fps
    cap.release()

    # Convert to audio using ffmpeg (requires ffmpeg installed)
    import subprocess
    subprocess.run(["ffmpeg", "-i", video_path, "-q:a", "0", "-map", "a", output_audio_path])

# Transcribe audio
audio_path = "audio.wav"
extract_audio(video_path, audio_path)
audio, rate = librosa.load(audio_path, sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
generated_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(f"Transcription: {transcription}")

Let me break down this code that handles audio extraction and transcription from video:

1. Library Imports and Model Setup

  • Uses librosa for audio processing and the Whisper model from the transformers library
  • Initializes the Whisper model and processor, specifically using the "whisper-small" variant for efficient processing

2. Audio Extraction Function

  • The extract_audio function handles converting video to audio:
  • Captures video metadata (fps and frame count)
  • Uses ffmpeg to extract the audio track and save it as a WAV file

3. Audio Transcription Process

  • The transcription workflow includes:
  • Loading the extracted audio file using librosa at 16kHz sampling rate
  • Processing the audio through the Whisper model
  • Decoding the model's output into readable text

The Whisper model is particularly powerful as it can handle various audio qualities and maintain proper punctuation and speaker attribution.

Step 3: Transcribe Audio from Video

Extract audio from the video and transcribe it using Whisper, an advanced speech recognition model developed by OpenAI. Whisper excels at converting spoken words into text with high accuracy across multiple languages and accents.

The model can handle various audio qualities and background noise levels, making it ideal for processing video content. During this step, we'll separate the audio track from the video file and feed it through Whisper to generate a detailed transcription that captures not just the words, but also maintains proper punctuation and speaker attribution when possible.

import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load Whisper model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

# Extract audio from video
def extract_audio(video_path, output_audio_path="audio.wav"):
    cap = cv2.VideoCapture(video_path)
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    duration = total_frames / fps
    cap.release()

    # Convert to audio using ffmpeg (requires ffmpeg installed)
    import subprocess
    subprocess.run(["ffmpeg", "-i", video_path, "-q:a", "0", "-map", "a", output_audio_path])

# Transcribe audio
audio_path = "audio.wav"
extract_audio(video_path, audio_path)
audio, rate = librosa.load(audio_path, sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
generated_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(f"Transcription: {transcription}")

Let me break down this code that handles audio extraction and transcription from video:

1. Library Imports and Model Setup

  • Uses librosa for audio processing and the Whisper model from the transformers library
  • Initializes the Whisper model and processor, specifically using the "whisper-small" variant for efficient processing

2. Audio Extraction Function

  • The extract_audio function handles converting video to audio:
  • Captures video metadata (fps and frame count)
  • Uses ffmpeg to extract the audio track and save it as a WAV file

3. Audio Transcription Process

  • The transcription workflow includes:
  • Loading the extracted audio file using librosa at 16kHz sampling rate
  • Processing the audio through the Whisper model
  • Decoding the model's output into readable text

The Whisper model is particularly powerful as it can handle various audio qualities and maintain proper punctuation and speaker attribution.

Step 3: Transcribe Audio from Video

Extract audio from the video and transcribe it using Whisper, an advanced speech recognition model developed by OpenAI. Whisper excels at converting spoken words into text with high accuracy across multiple languages and accents.

The model can handle various audio qualities and background noise levels, making it ideal for processing video content. During this step, we'll separate the audio track from the video file and feed it through Whisper to generate a detailed transcription that captures not just the words, but also maintains proper punctuation and speaker attribution when possible.

import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load Whisper model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

# Extract audio from video
def extract_audio(video_path, output_audio_path="audio.wav"):
    cap = cv2.VideoCapture(video_path)
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    duration = total_frames / fps
    cap.release()

    # Convert to audio using ffmpeg (requires ffmpeg installed)
    import subprocess
    subprocess.run(["ffmpeg", "-i", video_path, "-q:a", "0", "-map", "a", output_audio_path])

# Transcribe audio
audio_path = "audio.wav"
extract_audio(video_path, audio_path)
audio, rate = librosa.load(audio_path, sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
generated_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(f"Transcription: {transcription}")

Let me break down this code that handles audio extraction and transcription from video:

1. Library Imports and Model Setup

  • Uses librosa for audio processing and the Whisper model from the transformers library
  • Initializes the Whisper model and processor, specifically using the "whisper-small" variant for efficient processing

2. Audio Extraction Function

  • The extract_audio function handles converting video to audio:
  • Captures video metadata (fps and frame count)
  • Uses ffmpeg to extract the audio track and save it as a WAV file

3. Audio Transcription Process

  • The transcription workflow includes:
  • Loading the extracted audio file using librosa at 16kHz sampling rate
  • Processing the audio through the Whisper model
  • Decoding the model's output into readable text

The Whisper model is particularly powerful as it can handle various audio qualities and maintain proper punctuation and speaker attribution.