Chapter 2: Audio Understanding and Generation with Whisper and GPT-4o
2.2 Transcription and Translation with Whisper API
The Whisper API represents a significant advancement in automated speech recognition and translation technology. This section explores the fundamental aspects of working with Whisper, including its capabilities, implementation methods, and practical applications. We'll examine how to effectively utilize the API for both transcription and translation tasks, covering everything from basic setup to advanced features.
Whether you're building applications for content creators, developing educational tools, or creating accessibility solutions, understanding Whisper's functionality is crucial. We'll walk through detailed examples and best practices that demonstrate how to integrate this powerful tool into your projects, while highlighting important considerations for optimal performance.
Throughout this section, you'll learn not just the technical implementation details, but also the strategic considerations for choosing appropriate response formats and handling various audio inputs. This knowledge will enable you to build robust, scalable solutions for audio processing needs.
2.2.1 What Is Whisper?
Whisper is OpenAI's groundbreaking open-source automatic speech recognition (ASR) model that represents a significant advancement in audio processing technology. This sophisticated system excels at handling diverse audio inputs, capable of processing various file formats including .mp3
, .mp4
, .wav
, and .m4a
. What sets Whisper apart is its dual functionality: it not only converts spoken content into highly accurate text transcriptions but also offers powerful translation capabilities, seamlessly converting non-English audio content into fluent English output. The model's robust architecture ensures high accuracy across different accents, speaking styles, and background noise conditions.
Whisper demonstrates exceptional versatility across numerous applications:
Meeting or podcast transcriptions
Converts lengthy discussions and presentations into searchable, shareable text documents with remarkable accuracy. This functionality is particularly valuable for businesses and content creators who need to:
- Archive important meetings for future reference
- Create accessible versions of audio content
- Enable quick searching through hours of recorded content
- Generate written documentation from verbal discussions
- Support compliance requirements for record-keeping
The high accuracy rate ensures that technical terms, proper names, and complex discussions are captured correctly while maintaining the natural flow of conversation.
Example:
Download the audio sample here: https://files.cuantum.tech/audio/meeting_snippet.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt
current_location = "Houston, Texas, United States"
print(f"Running Whisper transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local audio file
# IMPORTANT: Replace 'meeting_snippet.mp3' with the actual filename.
audio_file_path = "meeting_snippet.mp3"
# --- Optional Parameters for potentially better accuracy ---
# Specify language (ISO-639-1 code) if known, otherwise Whisper auto-detects.
# Example: "en" for English, "es" for Spanish, "de" for German
known_language = "en" # Set to None to auto-detect
# Provide a prompt with context, names, or jargon expected in the audio.
# This helps Whisper recognize specific terms accurately.
transcription_prompt = "The discussion involves Project Phoenix, stakeholders like Dr. Evelyn Reed and ACME Corp, and technical terms such as multi-threaded processing and cloud-native architecture." # Set to None if no prompt needed
# --- Function to Transcribe Audio ---
def transcribe_audio(client, file_path, language=None, prompt=None):
"""
Transcribes the given audio file using the Whisper API.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str, optional): The language code (ISO-639-1). Defaults to None (auto-detect).
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
str: The transcription text, or None if an error occurs.
"""
print(f"\nAttempting to transcribe audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Check file size (optional but good practice)
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
print("Consider splitting the file into smaller chunks.")
# You might choose to exit here or attempt anyway
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
return None
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for transcription...")
# --- Make the API Call ---
response = client.audio.transcriptions.create(
model="whisper-1", # Specify the Whisper model
file=audio_file, # The audio file object
language=language, # Optional: Specify language
prompt=prompt, # Optional: Provide context prompt
response_format="text" # Request plain text output
# Other options: 'json', 'verbose_json', 'srt', 'vtt'
)
# The response object for "text" format is directly the string
transcription_text = response
print("Transcription successful.")
return transcription_text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
# Provide hints for common errors
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported (mp3, mp4, mpeg, mpga, m4a, wav, webm).")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit. Please split the file.")
return None
except FileNotFoundError: # Already handled above, but good practice
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
transcription = transcribe_audio(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if transcription:
print("\n--- Transcription Result ---")
print(transcription)
print("----------------------------\n")
# How this helps the use case:
print("This plain text transcription can now be:")
print("- Saved as a text document (.txt) for archiving.")
print("- Indexed and searched easily for specific keywords or names.")
print("- Used to generate meeting minutes or documentation.")
print("- Copied into accessibility tools or documents.")
print("- Stored for compliance record-keeping.")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_transcription.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(transcription)
print(f"Transcription saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving transcription to file: {e}")
else:
print("\nTranscription failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates transcribing audio files like meeting recordings or podcast episodes using OpenAI's Whisper API. The goal is to convert spoken content into accurate, searchable text, addressing needs like archiving, accessibility, and documentation.
- Prerequisites: It requires the
openai
andpython-dotenv
libraries, an OpenAI API key configured in a.env
file, and a sample audio file (meeting_snippet.mp3
). - Audio File Requirements: The script highlights the supported audio formats (MP3, WAV, M4A, etc.) and the crucial 25MB file size limit per API request. It includes a warning and explanation that longer files must be segmented (chunked) before processing, although the code itself handles a single file within the limit.
- Initialization: It sets up the standard
OpenAI
client using the API key. - Transcription Function (
transcribe_audio
):- Takes the
client
,file_path
, and optionallanguage
andprompt
arguments. - Includes a check for file existence and size.
- Opens the audio file in binary read mode (
"rb"
). - API Call: Uses
client.audio.transcriptions.create
with:model="whisper-1"
: Specifies the model known for high accuracy.file=audio_file
: Passes the opened file object.language
: Optionally provides the language code (e.g.,"en"
) to potentially improve accuracy if the language is known. IfNone
, Whisper auto-detects.prompt
: Optionally provides contextual keywords, names (like "Project Phoenix", "Dr. Evelyn Reed"), or jargon. This significantly helps Whisper accurately transcribe specialized terms often found in meetings or technical podcasts, directly addressing the need to capture complex discussions correctly.response_format="text"
: Requests the output directly as a plain text string, which is ideal for immediate use in documents, search indexing, etc. Other formats likeverbose_json
(for timestamps) orsrt
/vtt
(for subtitles) could be requested if needed.
- Error Handling: Includes
try...except
blocks for API errors (providing hints for common issues like file size or format) and file system errors.
- Takes the
- Output and Usefulness:
- The resulting transcription text is printed.
- The code explicitly connects this output back to the use case benefits: creating searchable archives, generating documentation, supporting accessibility, and enabling compliance.
- It includes an optional step to save the transcription directly to a
.txt
file.
This example provides a practical implementation of Whisper for the described use case, emphasizing accuracy features (prompting, language specification) and explaining how the text output facilitates the desired downstream tasks like searching and archiving. Remember to use an actual audio file (within the size limit) for testing.
Voice note conversion
Transforms quick voice memos into organized text notes, making reviewing and archiving spoken thoughts easier. This functionality is particularly valuable for:
- Creating quick reminders and to-do lists while on the go
- Capturing creative ideas or brainstorming sessions without interrupting the flow of thought
- Taking notes during field work or site visits where typing is impractical
- Documenting observations or research findings in real-time
The system maintains the natural flow of speech while organizing the content into clear, readable text that can be easily searched, shared, or integrated into other documents.
Example:
This script focuses on taking a typical voice memo audio file and converting it into searchable, usable text, ideal for capturing ideas, reminders, or field notes.
Download the sample audio here: https://files.cuantum.tech/audio/voice_memo.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Miami, Florida, United States"
print(f"Running Whisper voice note transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local voice note audio file
# IMPORTANT: Replace 'voice_memo.m4a' with the actual filename.
audio_file_path = "voice_memo.mp3"
# --- Optional Parameters (Often not needed for simple voice notes) ---
# Language auto-detection is usually sufficient for voice notes.
known_language = None # Set to "en", "es", etc. if needed
# Prompt is useful if your notes contain specific jargon/names, otherwise leave as None.
transcription_prompt = None # Example: "Remember to mention Project Chimera and the client ZetaCorp."
# --- Function to Transcribe Voice Note ---
def transcribe_voice_note(client, file_path, language=None, prompt=None):
"""
Transcribes the given voice note audio file using the Whisper API.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str, optional): The language code (ISO-639-1). Defaults to None (auto-detect).
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
str: The transcription text, or None if an error occurs.
"""
print(f"\nAttempting to transcribe voice note: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: Check file size (less likely to be an issue for voice notes)
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
# return None # Decide if you want to stop for large files
except OSError as e:
print(f"Error accessing file properties: {e}")
# Continue attempt even if size check fails
pass
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending voice note to Whisper API for transcription...")
# --- Make the API Call ---
response = client.audio.transcriptions.create(
model="whisper-1", # Specify the Whisper model
file=audio_file, # The audio file object
language=language, # Defaults to None (auto-detect)
prompt=prompt, # Defaults to None
response_format="text" # Request plain text output for easy use as notes
)
# The response object for "text" format is directly the string
note_text = response
print("Transcription successful.")
return note_text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported (m4a, mp3, wav, etc.).")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
transcribed_note = transcribe_voice_note(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if transcribed_note:
print("\n--- Transcribed Voice Note Text ---")
print(transcribed_note)
print("-----------------------------------\n")
# How this helps the use case:
print("This transcribed text from your voice note can be easily:")
print("- Copied into reminder apps or to-do lists.")
print("- Saved as a text file for archiving creative ideas or brainstorms.")
print("- Searched later for specific keywords or topics.")
print("- Shared via email or messaging apps.")
print("- Integrated into reports or documentation (e.g., field notes).")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_transcription.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(transcribed_note)
print(f"Transcription saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving transcription to file: {e}")
else:
print("\nVoice note transcription failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates using the Whisper API to transcribe short audio recordings like voice notes or memos. This is ideal for quickly capturing thoughts, ideas, reminders, or field observations and converting them into easily manageable text.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and a sample voice note audio file (common formats like.m4a
,.mp3
,.wav
are supported). - Audio File Handling: The script emphasizes that voice notes are typically well under the 25MB API limit. It opens the file in binary read mode (
"rb"
). - Initialization: Sets up the
OpenAI
client. - Transcription Function (
transcribe_voice_note
):- Similar to the meeting transcription function but tailored for voice notes.
- Defaults: It defaults
language
toNone
(auto-detect) andprompt
toNone
, as these are often sufficient for typical voice memos where specific jargon might be less common. The user can still provide these if needed. - API Call: Uses
client.audio.transcriptions.create
withmodel="whisper-1"
. response_format="text"
: Explicitly requests plain text output, which is the most practical format for notes – easy to read, search, copy, and share.- Error Handling: Includes standard
try...except
blocks for API and file errors.
- Output and Usefulness:
- Prints the resulting text transcription.
- Explicitly connects the output to the benefits mentioned in the use case description: creating reminders/to-dos, capturing ideas, documenting observations, enabling search, and sharing.
- Includes an optional step to save the transcribed note to a
.txt
file.
This example provides a clear and practical implementation for converting voice notes to text using Whisper, highlighting its convenience for capturing information on the go. Remember to use an actual voice note audio file for testing and update the audio_file_path
variable accordingly.
Multilingual speech translation
Breaks down language barriers by providing accurate translations while preserving the original context and meaning. This powerful feature enables:
- Real-time communication across language barriers in international meetings and conferences
- Translation of educational content for global audiences while maintaining academic integrity
- Cross-cultural business negotiations with precise translation of technical terms and cultural nuances
- Documentation translation for multinational organizations with consistent terminology
The system can detect the source language automatically and provides translations that maintain the speaker's original tone, intent, and professional context, making it invaluable for global collaboration.
Example:
This example takes an audio file containing speech in a language other than English and translates it directly into English text.
Download the sample audio here: https://files.cuantum.tech/audio/spanish_speech.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Denver, Colorado, United States"
print(f"Running Whisper speech translation example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local non-English audio file
# IMPORTANT: Replace 'spanish_speech.mp3' with the actual filename.
audio_file_path = "spanish_speech.mp3"
# --- Optional Parameters ---
# Prompt can help guide recognition of specific names/terms in the SOURCE language
# before translation, potentially improving accuracy of the final English text.
translation_prompt = None # Example: "The discussion mentions La Sagrada Familia and Parc Güell."
# --- Function to Translate Speech to English ---
def translate_speech_to_english(client, file_path, prompt=None):
"""
Translates speech from the given audio file into English using the Whisper API.
The source language is automatically detected.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file (non-English speech).
prompt (str, optional): A text prompt to guide source language recognition. Defaults to None.
Returns:
str: The translated English text, or None if an error occurs.
"""
print(f"\nAttempting to translate speech from audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: Check file size
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Translation may fail.")
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
pass # Continue attempt
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for translation to English...")
# --- Make the API Call for Translation ---
# Note: Using the 'translations' endpoint, not 'transcriptions'
response = client.audio.translations.create(
model="whisper-1", # Specify the Whisper model
file=audio_file, # The audio file object
prompt=prompt, # Optional: Provide context prompt
response_format="text" # Request plain text English output
# Other options: 'json', 'verbose_json', 'srt', 'vtt'
)
# The response object for "text" format is directly the translated string
translated_text = response
print("Translation successful.")
return translated_text
except OpenAIError as e:
print(f"OpenAI API Error during translation: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
# Whisper might error if it cannot detect speech or supported language
elif "language could not be detected" in str(e).lower():
print("Hint: Ensure the audio contains detectable speech in a supported language.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during translation: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
english_translation = translate_speech_to_english(
client,
audio_file_path,
prompt=translation_prompt
)
if english_translation:
print("\n--- Translated English Text ---")
print(english_translation)
print("-------------------------------\n")
# How this helps the use case:
print("This English translation enables:")
print("- Understanding discussions from international meetings.")
print("- Making educational content accessible to English-speaking audiences.")
print("- Facilitating cross-cultural business communication.")
print("- Creating English versions of documentation originally recorded in other languages.")
print("- Quick communication across language barriers using voice input.")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_translation_en.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(english_translation)
print(f"Translation saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving translation to file: {e}")
else:
print("\nSpeech translation failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates the Whisper API's capability to perform speech-to-text translation, specifically translating audio from various source languages directly into English text. This addresses the need for breaking down language barriers in global communication scenarios.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and crucially, a sample audio file containing speech in a language other than English (e.g., Spanish, French, German). - Key API Endpoint: This example uses
client.audio.translations.create
, which is distinct from thetranscriptions
endpoint used previously. Thetranslations
endpoint is specifically designed to output English text. - Automatic Language Detection: A key feature highlighted is that the source language of the audio file does not need to be specified; Whisper automatically detects it before translating to English.
- Initialization: Sets up the
OpenAI
client. - Translation Function (
translate_speech_to_english
):- Takes the
client
,file_path
, and an optionalprompt
. - Opens the non-English audio file in binary read mode (
"rb"
). - API Call: Uses
client.audio.translations.create
with:model="whisper-1"
: The standard Whisper model.file=audio_file
: The audio file object.prompt
: Optionally provides context (in the source language or English, often helps with names/terms) to aid accurate recognition before translation, helping to preserve nuances and technical terms.response_format="text"
: Requests plain English text output.
- Error Handling: Includes
try...except
blocks, noting potential errors if speech or a supported language isn't detected in the audio.
- Takes the
- Output and Usefulness:
- Prints the resulting English translation text.
- Explicitly connects this output to the benefits described in the use case: enabling understanding in international meetings, translating educational/business content for global audiences, and facilitating cross-language documentation and communication.
- Shows how to optionally save the English translation to a
.txt
file.
This example effectively showcases Whisper's powerful translation feature, making it invaluable for scenarios requiring communication or content understanding across different languages. Remember to use an audio file with non-English speech for testing and update the audio_file_path
accordingly.
Accessibility tools
Enables better digital inclusion by providing real-time transcription services for hearing-impaired users, offering several key benefits:
- Empowers deaf and hard-of-hearing individuals to participate fully in audio-based content
- Provides instant access to spoken information in professional settings like meetings and conferences
- Supports educational environments by making lectures and discussions accessible to all students
- Enhances media consumption by enabling accurate, real-time captioning for videos and live streams
Example:
This example focuses on using the verbose_json
output format to get segment-level timestamps, which are essential for syncing text with audio or video.
Download the sample audio here: https://files.cuantum.tech/audio/lecture_snippet.mp3
import os
import json # To potentially parse the verbose_json output nicely
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Atlanta, Georgia, United States" # User location context
print(f"Running Whisper accessibility transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local audio file
# IMPORTANT: Replace 'lecture_snippet.mp3' with the actual filename.
audio_file_path = "lecture_snippet.mp3"
# --- Optional Parameters ---
# Specifying language is often good for accuracy in accessibility contexts
known_language = "en"
# Prompt can help with specific terminology in lectures or meetings
transcription_prompt = "The lecture discusses quantum entanglement, superposition, and Bell's theorem." # Set to None if not needed
# --- Function to Transcribe Audio with Timestamps ---
def transcribe_for_accessibility(client, file_path, language="en", prompt=None):
"""
Transcribes the audio file using Whisper, requesting timestamped output
suitable for accessibility applications (e.g., captioning).
Note on 'Real-time': This function processes a complete file. True real-time
captioning requires capturing audio in chunks, sending each chunk to the API
quickly, and displaying the results sequentially. This example generates the
*type* of data needed for such applications from a pre-recorded file.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str): The language code (ISO-639-1). Defaults to "en".
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
dict: The parsed verbose_json response containing text and segments
with timestamps, or None if an error occurs.
"""
print(f"\nAttempting to transcribe for accessibility: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: File size check
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
pass # Continue attempt
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for timestamped transcription...")
# --- Make the API Call for Transcription ---
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language,
prompt=prompt,
# Request detailed JSON output including timestamps
response_format="verbose_json",
# Explicitly request segment-level timestamps
timestamp_granularities=["segment"]
)
# The response object is already a Pydantic model behaving like a dict
timestamped_data = response
print("Timestamped transcription successful.")
# You can access response.text, response.segments etc directly
return timestamped_data
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Function to Format Timestamps ---
def format_timestamp(seconds):
"""Converts seconds to HH:MM:SS.fff format."""
td = datetime.timedelta(seconds=seconds)
total_milliseconds = int(td.total_seconds() * 1000)
hours, remainder = divmod(total_milliseconds, 3600000)
minutes, remainder = divmod(remainder, 60000)
seconds, milliseconds = divmod(remainder, 1000)
return f"{hours:02}:{minutes:02}:{seconds:02}.{milliseconds:03}"
# --- Main Execution ---
if __name__ == "__main__":
transcription_data = transcribe_for_accessibility(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if transcription_data:
print("\n--- Full Transcription Text ---")
# Access the full text directly from the response object
print(transcription_data.text)
print("-------------------------------\n")
print("--- Transcription Segments with Timestamps ---")
# Iterate through segments for timestamped data
for segment in transcription_data.segments:
start_time = format_timestamp(segment['start'])
end_time = format_timestamp(segment['end'])
segment_text = segment['text']
print(f"[{start_time} --> {end_time}] {segment_text}")
print("---------------------------------------------\n")
# How this helps the use case:
print("This timestamped data enables:")
print("- Displaying captions synchronized with video or audio streams.")
print("- Highlighting text in real-time as it's spoken in meetings or lectures.")
print("- Creating accessible versions of educational/media content.")
print("- Allowing users to navigate audio by clicking on text segments.")
print("- Fuller participation for hearing-impaired individuals.")
# Optional: Save the detailed JSON output
output_json_file = os.path.splitext(audio_file_path)[0] + "_timestamps.json"
try:
# The response object can be converted to dict for JSON serialization
with open(output_json_file, "w", encoding="utf-8") as f:
# Use .model_dump_json() for Pydantic V2 models from openai>=1.0.0
# or .dict() for older versions/models
try:
f.write(transcription_data.model_dump_json(indent=2))
except AttributeError:
# Fallback for older versions or different object types
import json
f.write(json.dumps(transcription_data, default=lambda o: o.__dict__, indent=2))
print(f"Detailed timestamp data saved to '{output_json_file}'")
except (IOError, TypeError, AttributeError) as e:
print(f"Error saving timestamp data to JSON file: {e}")
else:
print("\nTranscription for accessibility failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates using the Whisper API to generate highly accurate transcriptions with segment-level timestamps. This output is crucial for accessibility applications, enabling features like real-time captioning, synchronized text highlighting, and navigable transcripts for hearing-impaired users.
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file (lecture_snippet.mp3
). - Key API Parameters:
- Endpoint: Uses
client.audio.transcriptions.create
. response_format="verbose_json"
: This is essential. It requests a detailed JSON object containing not only the full transcription text but also a list of segments, each with start/end times and the corresponding text.timestamp_granularities=["segment"]
: Explicitly requests segment-level timing information (though often included by default withverbose_json
). Word-level timestamps can also be requested if needed (["word"]
), but segments are typically used for captioning.language
/prompt
: Specifying the language (en
) and providing a prompt can enhance accuracy, which is vital for accessibility.
- Endpoint: Uses
- "Real-Time" Consideration: The explanation clarifies that while this code processes a file, the timestamped output it generates is what's needed for real-time applications. True live captioning would involve feeding audio chunks to the API rapidly.
- Initialization & Function (
transcribe_for_accessibility
): Standard client setup. The function encapsulates the API call requestingverbose_json
. - Output Processing:
- The code first prints the full concatenated text (
transcription_data.text
). - It then iterates through the
transcription_data.segments
list. - For each
segment
, it extracts thestart
time,end
time, andtext
. - A helper function (
format_timestamp
) converts the times (in seconds) into a standardHH:MM:SS.fff
format. - It prints each segment with its timing information (e.g.,
[00:00:01.234 --> 00:00:05.678] This is the first segment text.
).
- The code first prints the full concatenated text (
- Use Case Relevance: The output clearly shows how this timestamped data directly enables the benefits described: synchronizing text with audio/video for captions, allowing participation in meetings/lectures, making educational content accessible, and enhancing media consumption.
- Saving Output: Includes an option to save the complete
verbose_json
response to a file for later use or more complex processing. It handles potential differences in serializing the response object from theopenai
library.
This example effectively demonstrates how to obtain the necessary timestamped data from Whisper to power various accessibility features, making audio content more inclusive. Remember to use a relevant audio file for testing.
Video captioning workflows
Streamlines the creation of accurate subtitles and closed captions for video content, supporting multiple output formats including SRT, WebVTT, and other industry-standard caption formats. This capability is essential for:
- Content creators who need to make their videos accessible across multiple platforms
- Broadcasting companies requiring accurate closed captioning for regulatory compliance
- Educational institutions creating accessible video content for diverse student populations
- Social media managers who want to increase video engagement through auto-captioning
The system can automatically detect speaker changes, handle timing synchronization, and format captions according to industry best practices, making it an invaluable tool for professional video production workflows.
Example:
This code example takes an audio file (which you would typically extract from your video first) and uses Whisper to create accurately timestamped captions in the standard .srt
format.
Download the audio sample here: https://files.cuantum.tech/audio/video_audio.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Austin, Texas, United States" # User location context
print(f"Running Whisper caption generation (SRT) example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local audio file (extracted from video)
# IMPORTANT: Replace 'video_audio.mp3' with the actual filename.
audio_file_path = "video_audio.mp3"
# --- Optional Parameters ---
# Specifying language can improve caption accuracy
known_language = "en"
# Prompt can help with names, brands, or specific terminology in the video
transcription_prompt = "The video features interviews with Dr. Anya Sharma about sustainable agriculture and mentions the company 'TerraGrow'." # Set to None if not needed
# --- Function to Generate SRT Captions ---
def generate_captions_srt(client, file_path, language="en", prompt=None):
"""
Transcribes the audio file using Whisper and returns captions in SRT format.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file (extracted from video).
language (str): The language code (ISO-639-1). Defaults to "en".
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
str: The caption data in SRT format string, or None if an error occurs.
"""
print(f"\nAttempting to generate SRT captions for: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: File size check
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
pass # Continue attempt
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for SRT caption generation...")
# --- Make the API Call for Transcription ---
# Request 'srt' format directly
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language,
prompt=prompt,
response_format="srt" # Request SRT format output
# Other options include "vtt"
)
# The response object for "srt" format is directly the SRT string
srt_content = response
print("SRT caption generation successful.")
return srt_content
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
srt_captions = generate_captions_srt(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if srt_captions:
# --- Save the SRT Content to a File ---
output_srt_file = os.path.splitext(audio_file_path)[0] + ".srt"
try:
with open(output_srt_file, "w", encoding="utf-8") as f:
f.write(srt_captions)
print(f"\nSRT captions saved successfully to '{output_srt_file}'")
# How this helps the use case:
print("\nThis SRT file can be used to:")
print("- Add closed captions or subtitles to video players (like YouTube, Vimeo, VLC).")
print("- Import into video editing software for caption integration.")
print("- Meet accessibility requirements and regulatory compliance (e.g., broadcasting).")
print("- Improve video engagement on social media platforms.")
print("- Make educational video content accessible to more students.")
except IOError as e:
print(f"Error saving SRT captions to file: {e}")
else:
print("\nSRT caption generation failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates how to use the Whisper API to streamline video captioning workflows by directly generating industry-standard SRT (SubRip Text) subtitle files from audio.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup and API key. Crucially, it requires an audio file that has been extracted from the target video. The script includes a note explaining this necessary pre-processing step. - Key API Parameter (
response_format="srt"
):- The core of this use case is requesting the
srt
format directly from theclient.audio.transcriptions.create
endpoint. - This tells Whisper to format the output according to SRT specifications, including sequential numbering, start/end timestamps (e.g.,
00:00:20,000 --> 00:00:24,400
), and the corresponding text chunk. The API handles the timing synchronization automatically.vtt
(WebVTT) is another common format that can be requested similarly.
- The core of this use case is requesting the
- Initialization & Function (
generate_captions_srt
): Standard client setup. The function encapsulates the API call specifically requesting SRT format. Optionallanguage
andprompt
parameters can be used to enhance accuracy. - Output Handling:
- The
response
from the API call (whenresponse_format="srt"
) is directly the complete content of the SRT file as a single string. - The code then saves this string directly into a file with the
.srt
extension.
- The
- Use Case Relevance:
- The explanation highlights how this generated
.srt
file directly serves the needs of content creators, broadcasters, educators, and social media managers. - It can be easily uploaded to video platforms, imported into editing software, or used to meet accessibility and compliance standards. It significantly simplifies the traditionally time-consuming process of manual caption creation.
- The explanation highlights how this generated
- Error Handling: Includes standard checks for API and file system errors.
This example provides a highly practical demonstration of using Whisper for a common video production task, showcasing how to get accurately timestamped captions in a ready-to-use format. Remember the essential first step is to extract the audio from the video you wish to caption.
💡 Tip: Whisper works best when audio is clear, and it supports input files up to 25MB in size per request.
2.2.2 Response Formats You Can Use
Whether you're building a simple transcription service or developing a complex video captioning system, understanding these formats is crucial for effectively implementing Whisper in your applications.
The flexibility of Whisper's response formats allows developers to seamlessly integrate transcription and translation capabilities into different types of applications, from basic text output to more sophisticated JSON-based processing pipelines. Each format serves distinct purposes and offers unique advantages depending on your use case.
Whisper supports multiple formats for your needs:
2.2.3 Practical Applications of Whisper
Whisper's versatility extends far beyond basic transcription, offering numerous practical applications across different industries and use cases. From streamlining business operations to enhancing educational experiences, Whisper's capabilities can be leveraged in innovative ways to solve real-world challenges. Let's explore some of the most impactful applications of this technology and understand how they can benefit different user groups.
These applications demonstrate how Whisper can be integrated into various workflows to improve efficiency, accessibility, and communication across different sectors. Each use case showcases unique advantages and potential implementations that can transform how we handle audio content in professional and personal contexts.
Note-taking Tools
Transform spoken content into written text automatically through advanced speech recognition, enabling quick and accurate documentation of verbal communications. This technology excels at capturing natural speech patterns, technical terminology, and multiple speaker interactions, making it easier to document and review lectures, interviews, and meetings. The automated transcription process maintains context and speaker attribution while converting audio to easily searchable and editable text formats.
This tool is particularly valuable for:
- Students capturing detailed lecture notes while staying engaged in class discussions
- Journalists documenting interviews and press conferences with precise quotations
- Business professionals keeping accurate records of client meetings and team brainstorming sessions
- Researchers conducting and transcribing qualitative interviews
The automated notes can be further enhanced with features like timestamp markers, speaker identification, and keyword highlighting, making it easier to navigate and reference specific parts of the discussion later. This dramatically improves productivity by eliminating the need for manual transcription while ensuring comprehensive documentation of important conversations.
Multilingual Applications
Enable seamless communication across language barriers by converting spoken words from one language to another in real-time. This powerful capability allows for instant translation of spoken content, supporting over 100 languages with high accuracy. The system can detect language automatically and handle various accents and dialects, making it particularly valuable for:
- International Business: Facilitating real-time communication in multinational meetings, negotiations, and presentations without the need for human interpreters
- Travel and Tourism: Enabling travelers to communicate effectively with locals, understand announcements, and navigate foreign environments
- Global Education: Supporting distance learning programs and international student exchanges by breaking down language barriers
- Customer Service: Allowing support teams to assist customers in their preferred language, improving service quality and satisfaction
The technology works across different audio environments and can handle multiple speakers, making it ideal for international business meetings, travel applications, global communication platforms, and cross-cultural exchanges.
Accessibility Solutions
Create real-time captions and transcripts for hearing-impaired individuals, ensuring equal access to audio content. This essential technology serves multiple accessibility purposes:
- Real-time Captioning: Provides immediate text representation of spoken words during live events, meetings, and presentations
- Accurate Transcription: Generates detailed written records of audio content for later reference and study
- Multi-format Support: Offers content in various formats including closed captions, subtitles, and downloadable transcripts
This technology can be integrated into:
- Video Conferencing Platforms: Enabling real-time captioning during virtual meetings and webinars
- Educational Platforms: Supporting students with hearing impairments in both online and traditional classroom settings
- Media Players: Providing synchronized captions for videos, podcasts, and other multimedia content
- Live Events: Offering real-time text displays during conferences, performances, and public speaking events
These accessibility features not only support individuals with hearing impairments but also benefit:
- Non-native speakers who prefer reading along while listening
- People in sound-sensitive environments where audio isn't practical
- Visual learners who process information better through text
- Organizations aiming to comply with accessibility regulations and standards
Podcast Enhancement
Convert audio content into searchable text formats through comprehensive transcription, enabling multiple benefits:
- Content Discovery: Listeners can easily search through episode transcripts to find specific topics, quotes, or discussions of interest
- Content Repurposing: Creators can extract key quotes and insights for social media, blog posts, or newsletters
- Accessibility: Makes content available to hearing-impaired audiences and those who prefer reading
- SEO Benefits: Search engines can index the full content of episodes, improving discoverability and organic traffic
- Enhanced Engagement: Readers can skim content before listening, bookmark important sections, and reference material later
- Analytics Insights: Analyze most-searched terms and popular segments to inform future content strategy
Language Learning Support
Provide immediate feedback to language learners by converting their spoken practice into text, allowing them to check pronunciation, grammar, and vocabulary usage. This technology creates an interactive and effective learning experience through several key mechanisms:
- Real-time Pronunciation Feedback: Learners can compare their spoken words with the transcribed text to identify pronunciation errors and areas for improvement
- Grammar Analysis: The system can highlight grammatical structures and potential errors in the transcribed text, helping learners understand their mistakes
- Vocabulary Enhancement: Students can track their active vocabulary usage and receive suggestions for more varied word choices
The technology particularly benefits:
- Self-directed learners practicing independently
- Language teachers tracking student progress
- Online language learning platforms offering speaking practice
- Language exchange participants wanting to verify their communication
When integrated with language learning applications, this feature can provide structured practice sessions, progress tracking, and personalized feedback that helps learners build confidence in their speaking abilities.
Example:
This example takes an audio file of a learner speaking a specific target language and transcribes it, providing the essential text output needed for pronunciation comparison, grammar analysis, or vocabulary review.
Download the sample audio file here: https://files.cuantum.tech/audio/language_practice_fr.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Orlando, Florida, United States" # User location context
print(f"Running Whisper language learning transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# --- Language Learning Specific Configuration ---
# Define the path to the language practice audio file
# IMPORTANT: Replace 'language_practice_fr.mp3' with the actual filename.
audio_file_path = "language_practice_fr.mp3"
# ** CRUCIAL: Specify the language being spoken in the audio file **
# Use the correct ISO-639-1 code (e.g., "en", "es", "fr", "de", "ja", "zh")
# This tells Whisper how to interpret the sounds.
target_language = "fr" # Example: French
# Optional: Provide context about the practice session (e.g., the expected phrase)
# This can help improve accuracy, especially for specific vocabulary or sentences.
practice_prompt = "The student is practicing ordering food at a French cafe, mentioning croissants and coffee." # Set to None if not needed
# --- Function to Transcribe Language Practice Audio ---
def transcribe_language_practice(client, file_path, language, prompt=None):
"""
Transcribes audio of language practice using the Whisper API.
Specifying the target language is crucial for this use case.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str): The ISO-639-1 code of the language being spoken.
prompt (str, optional): A text prompt providing context. Defaults to None.
Returns:
str: The transcribed text in the target language, or None if an error occurs.
"""
print(f"\nAttempting to transcribe language practice ({language}) from: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
if not language:
print("Error: Target language must be specified for language learning transcription.")
return None
# Optional: File size check
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
except OSError as e:
print(f"Error accessing file properties: {e}")
pass
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print(f"Sending audio ({language}) to Whisper API for transcription...")
# --- Make the API Call for Transcription ---
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language, # Pass the specified target language
prompt=prompt,
response_format="text" # Get plain text for easy comparison/analysis
)
# The response object for "text" format is directly the string
practice_transcription = response
print("Transcription successful.")
return practice_transcription
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
elif "invalid language code" in str(e).lower():
print(f"Hint: Check if '{language}' is a valid ISO-639-1 language code supported by Whisper.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
transcribed_text = transcribe_language_practice(
client,
audio_file_path,
language=target_language,
prompt=practice_prompt
)
if transcribed_text:
print("\n--- Transcribed Practice Text ---")
print(f"(Language: {target_language})")
print(transcribed_text)
print("---------------------------------\n")
# How this helps the use case:
print("This transcription provides the basis for language learning feedback:")
print("- **Pronunciation Check:** Learner compares their speech to the text, identifying discrepancies.")
print("- **Grammar/Vocabulary Analysis:** This text can be compared against expected sentences or analyzed (potentially by another AI like GPT-4o, or specific tools) for grammatical correctness and vocabulary usage.")
print("- **Progress Tracking:** Teachers or platforms can store transcriptions to monitor improvement over time.")
print("- **Self-Correction:** Learners get immediate textual representation of their speech for review.")
print("\nNote: Further analysis (grammar checking, pronunciation scoring, etc.) requires additional logic beyond this transcription step.")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_transcription.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(transcribed_text)
print(f"Transcription saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving transcription to file: {e}")
else:
print("\nLanguage practice transcription failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates using the Whisper API as a tool for language learning support. It transcribes audio recordings of a learner speaking a target language, providing the essential text output needed for various feedback mechanisms.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and an audio file containing the learner's spoken practice in the target language (e.g.,language_practice_fr.mp3
). - Key Parameter (
language
):- Unlike general transcription where auto-detect might suffice, for language learning, explicitly specifying the
language
the learner is attempting to speak is crucial. This is set using thetarget_language
variable (e.g.,"fr"
for French). - This ensures Whisper interprets the audio using the correct phonetic and vocabulary model, providing a more accurate transcription for feedback.
- Unlike general transcription where auto-detect might suffice, for language learning, explicitly specifying the
- Optional Prompt: The
prompt
parameter can be used to give Whisper context about the practice session (e.g., the specific phrase being practiced), which can improve recognition accuracy. - Initialization & Function (
transcribe_language_practice
): Standard client setup. The function requires thelanguage
parameter and performs the transcription usingclient.audio.transcriptions.create
. - Output (
response_format="text"
): Plain text output is requested as it's the most direct format for learners or systems to compare against expected text, analyze grammar, or review vocabulary. - Feedback Mechanism (Important Note): The explanation clearly states that this script only provides the transcription. The actual feedback (pronunciation scoring, grammar correction, vocabulary suggestions) requires additional processing. This transcribed text serves as the input for those subsequent analysis steps, which could involve rule-based checks, comparison algorithms, or even feeding the text to another LLM like GPT-4o for analysis.
- Use Case Relevance: The output section explains how this transcription enables the described benefits: allowing learners to check their pronunciation against text, providing material for grammar/vocabulary analysis, facilitating progress tracking, and supporting self-correction.
This example provides a practical starting point for integrating Whisper into language learning applications, focusing on generating the core textual data needed for effective feedback loops. Remember to use an audio file of someone speaking the specified target_language
for testing.
Summary
In this section, you learned several key aspects of audio processing and understanding:
- Transcribe spoken language into text using Whisper
- Convert various audio formats into accurate text transcriptions
- Handle multiple languages and accents with high accuracy
- Process both short clips and longer recordings effectively
- Translate foreign-language audio to English
- Convert non-English speech directly to English text
- Maintain context and meaning across language barriers
- Support multilingual content processing
- Choose between plain text, JSON, and subtitle outputs
- Select the most appropriate format for your specific use case
- Generate subtitles with precise timestamps
- Structure data in machine-readable JSON format
- Apply these tools in real-world use cases like accessibility, education, and content creation
- Create accessible content with accurate transcriptions
- Support language learning and educational initiatives
- Streamline content production workflows
Whisper is incredibly fast, easy to use, and works across languages—making it one of the most valuable tools in the OpenAI ecosystem. Its versatility and accuracy make it suitable for both individual users and enterprise-scale applications, while its open-source nature allows for custom implementations and modifications to suit specific needs.
2.2 Transcription and Translation with Whisper API
The Whisper API represents a significant advancement in automated speech recognition and translation technology. This section explores the fundamental aspects of working with Whisper, including its capabilities, implementation methods, and practical applications. We'll examine how to effectively utilize the API for both transcription and translation tasks, covering everything from basic setup to advanced features.
Whether you're building applications for content creators, developing educational tools, or creating accessibility solutions, understanding Whisper's functionality is crucial. We'll walk through detailed examples and best practices that demonstrate how to integrate this powerful tool into your projects, while highlighting important considerations for optimal performance.
Throughout this section, you'll learn not just the technical implementation details, but also the strategic considerations for choosing appropriate response formats and handling various audio inputs. This knowledge will enable you to build robust, scalable solutions for audio processing needs.
2.2.1 What Is Whisper?
Whisper is OpenAI's groundbreaking open-source automatic speech recognition (ASR) model that represents a significant advancement in audio processing technology. This sophisticated system excels at handling diverse audio inputs, capable of processing various file formats including .mp3
, .mp4
, .wav
, and .m4a
. What sets Whisper apart is its dual functionality: it not only converts spoken content into highly accurate text transcriptions but also offers powerful translation capabilities, seamlessly converting non-English audio content into fluent English output. The model's robust architecture ensures high accuracy across different accents, speaking styles, and background noise conditions.
Whisper demonstrates exceptional versatility across numerous applications:
Meeting or podcast transcriptions
Converts lengthy discussions and presentations into searchable, shareable text documents with remarkable accuracy. This functionality is particularly valuable for businesses and content creators who need to:
- Archive important meetings for future reference
- Create accessible versions of audio content
- Enable quick searching through hours of recorded content
- Generate written documentation from verbal discussions
- Support compliance requirements for record-keeping
The high accuracy rate ensures that technical terms, proper names, and complex discussions are captured correctly while maintaining the natural flow of conversation.
Example:
Download the audio sample here: https://files.cuantum.tech/audio/meeting_snippet.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt
current_location = "Houston, Texas, United States"
print(f"Running Whisper transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local audio file
# IMPORTANT: Replace 'meeting_snippet.mp3' with the actual filename.
audio_file_path = "meeting_snippet.mp3"
# --- Optional Parameters for potentially better accuracy ---
# Specify language (ISO-639-1 code) if known, otherwise Whisper auto-detects.
# Example: "en" for English, "es" for Spanish, "de" for German
known_language = "en" # Set to None to auto-detect
# Provide a prompt with context, names, or jargon expected in the audio.
# This helps Whisper recognize specific terms accurately.
transcription_prompt = "The discussion involves Project Phoenix, stakeholders like Dr. Evelyn Reed and ACME Corp, and technical terms such as multi-threaded processing and cloud-native architecture." # Set to None if no prompt needed
# --- Function to Transcribe Audio ---
def transcribe_audio(client, file_path, language=None, prompt=None):
"""
Transcribes the given audio file using the Whisper API.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str, optional): The language code (ISO-639-1). Defaults to None (auto-detect).
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
str: The transcription text, or None if an error occurs.
"""
print(f"\nAttempting to transcribe audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Check file size (optional but good practice)
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
print("Consider splitting the file into smaller chunks.")
# You might choose to exit here or attempt anyway
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
return None
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for transcription...")
# --- Make the API Call ---
response = client.audio.transcriptions.create(
model="whisper-1", # Specify the Whisper model
file=audio_file, # The audio file object
language=language, # Optional: Specify language
prompt=prompt, # Optional: Provide context prompt
response_format="text" # Request plain text output
# Other options: 'json', 'verbose_json', 'srt', 'vtt'
)
# The response object for "text" format is directly the string
transcription_text = response
print("Transcription successful.")
return transcription_text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
# Provide hints for common errors
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported (mp3, mp4, mpeg, mpga, m4a, wav, webm).")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit. Please split the file.")
return None
except FileNotFoundError: # Already handled above, but good practice
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
transcription = transcribe_audio(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if transcription:
print("\n--- Transcription Result ---")
print(transcription)
print("----------------------------\n")
# How this helps the use case:
print("This plain text transcription can now be:")
print("- Saved as a text document (.txt) for archiving.")
print("- Indexed and searched easily for specific keywords or names.")
print("- Used to generate meeting minutes or documentation.")
print("- Copied into accessibility tools or documents.")
print("- Stored for compliance record-keeping.")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_transcription.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(transcription)
print(f"Transcription saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving transcription to file: {e}")
else:
print("\nTranscription failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates transcribing audio files like meeting recordings or podcast episodes using OpenAI's Whisper API. The goal is to convert spoken content into accurate, searchable text, addressing needs like archiving, accessibility, and documentation.
- Prerequisites: It requires the
openai
andpython-dotenv
libraries, an OpenAI API key configured in a.env
file, and a sample audio file (meeting_snippet.mp3
). - Audio File Requirements: The script highlights the supported audio formats (MP3, WAV, M4A, etc.) and the crucial 25MB file size limit per API request. It includes a warning and explanation that longer files must be segmented (chunked) before processing, although the code itself handles a single file within the limit.
- Initialization: It sets up the standard
OpenAI
client using the API key. - Transcription Function (
transcribe_audio
):- Takes the
client
,file_path
, and optionallanguage
andprompt
arguments. - Includes a check for file existence and size.
- Opens the audio file in binary read mode (
"rb"
). - API Call: Uses
client.audio.transcriptions.create
with:model="whisper-1"
: Specifies the model known for high accuracy.file=audio_file
: Passes the opened file object.language
: Optionally provides the language code (e.g.,"en"
) to potentially improve accuracy if the language is known. IfNone
, Whisper auto-detects.prompt
: Optionally provides contextual keywords, names (like "Project Phoenix", "Dr. Evelyn Reed"), or jargon. This significantly helps Whisper accurately transcribe specialized terms often found in meetings or technical podcasts, directly addressing the need to capture complex discussions correctly.response_format="text"
: Requests the output directly as a plain text string, which is ideal for immediate use in documents, search indexing, etc. Other formats likeverbose_json
(for timestamps) orsrt
/vtt
(for subtitles) could be requested if needed.
- Error Handling: Includes
try...except
blocks for API errors (providing hints for common issues like file size or format) and file system errors.
- Takes the
- Output and Usefulness:
- The resulting transcription text is printed.
- The code explicitly connects this output back to the use case benefits: creating searchable archives, generating documentation, supporting accessibility, and enabling compliance.
- It includes an optional step to save the transcription directly to a
.txt
file.
This example provides a practical implementation of Whisper for the described use case, emphasizing accuracy features (prompting, language specification) and explaining how the text output facilitates the desired downstream tasks like searching and archiving. Remember to use an actual audio file (within the size limit) for testing.
Voice note conversion
Transforms quick voice memos into organized text notes, making reviewing and archiving spoken thoughts easier. This functionality is particularly valuable for:
- Creating quick reminders and to-do lists while on the go
- Capturing creative ideas or brainstorming sessions without interrupting the flow of thought
- Taking notes during field work or site visits where typing is impractical
- Documenting observations or research findings in real-time
The system maintains the natural flow of speech while organizing the content into clear, readable text that can be easily searched, shared, or integrated into other documents.
Example:
This script focuses on taking a typical voice memo audio file and converting it into searchable, usable text, ideal for capturing ideas, reminders, or field notes.
Download the sample audio here: https://files.cuantum.tech/audio/voice_memo.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Miami, Florida, United States"
print(f"Running Whisper voice note transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local voice note audio file
# IMPORTANT: Replace 'voice_memo.m4a' with the actual filename.
audio_file_path = "voice_memo.mp3"
# --- Optional Parameters (Often not needed for simple voice notes) ---
# Language auto-detection is usually sufficient for voice notes.
known_language = None # Set to "en", "es", etc. if needed
# Prompt is useful if your notes contain specific jargon/names, otherwise leave as None.
transcription_prompt = None # Example: "Remember to mention Project Chimera and the client ZetaCorp."
# --- Function to Transcribe Voice Note ---
def transcribe_voice_note(client, file_path, language=None, prompt=None):
"""
Transcribes the given voice note audio file using the Whisper API.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str, optional): The language code (ISO-639-1). Defaults to None (auto-detect).
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
str: The transcription text, or None if an error occurs.
"""
print(f"\nAttempting to transcribe voice note: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: Check file size (less likely to be an issue for voice notes)
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
# return None # Decide if you want to stop for large files
except OSError as e:
print(f"Error accessing file properties: {e}")
# Continue attempt even if size check fails
pass
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending voice note to Whisper API for transcription...")
# --- Make the API Call ---
response = client.audio.transcriptions.create(
model="whisper-1", # Specify the Whisper model
file=audio_file, # The audio file object
language=language, # Defaults to None (auto-detect)
prompt=prompt, # Defaults to None
response_format="text" # Request plain text output for easy use as notes
)
# The response object for "text" format is directly the string
note_text = response
print("Transcription successful.")
return note_text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported (m4a, mp3, wav, etc.).")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
transcribed_note = transcribe_voice_note(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if transcribed_note:
print("\n--- Transcribed Voice Note Text ---")
print(transcribed_note)
print("-----------------------------------\n")
# How this helps the use case:
print("This transcribed text from your voice note can be easily:")
print("- Copied into reminder apps or to-do lists.")
print("- Saved as a text file for archiving creative ideas or brainstorms.")
print("- Searched later for specific keywords or topics.")
print("- Shared via email or messaging apps.")
print("- Integrated into reports or documentation (e.g., field notes).")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_transcription.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(transcribed_note)
print(f"Transcription saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving transcription to file: {e}")
else:
print("\nVoice note transcription failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates using the Whisper API to transcribe short audio recordings like voice notes or memos. This is ideal for quickly capturing thoughts, ideas, reminders, or field observations and converting them into easily manageable text.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and a sample voice note audio file (common formats like.m4a
,.mp3
,.wav
are supported). - Audio File Handling: The script emphasizes that voice notes are typically well under the 25MB API limit. It opens the file in binary read mode (
"rb"
). - Initialization: Sets up the
OpenAI
client. - Transcription Function (
transcribe_voice_note
):- Similar to the meeting transcription function but tailored for voice notes.
- Defaults: It defaults
language
toNone
(auto-detect) andprompt
toNone
, as these are often sufficient for typical voice memos where specific jargon might be less common. The user can still provide these if needed. - API Call: Uses
client.audio.transcriptions.create
withmodel="whisper-1"
. response_format="text"
: Explicitly requests plain text output, which is the most practical format for notes – easy to read, search, copy, and share.- Error Handling: Includes standard
try...except
blocks for API and file errors.
- Output and Usefulness:
- Prints the resulting text transcription.
- Explicitly connects the output to the benefits mentioned in the use case description: creating reminders/to-dos, capturing ideas, documenting observations, enabling search, and sharing.
- Includes an optional step to save the transcribed note to a
.txt
file.
This example provides a clear and practical implementation for converting voice notes to text using Whisper, highlighting its convenience for capturing information on the go. Remember to use an actual voice note audio file for testing and update the audio_file_path
variable accordingly.
Multilingual speech translation
Breaks down language barriers by providing accurate translations while preserving the original context and meaning. This powerful feature enables:
- Real-time communication across language barriers in international meetings and conferences
- Translation of educational content for global audiences while maintaining academic integrity
- Cross-cultural business negotiations with precise translation of technical terms and cultural nuances
- Documentation translation for multinational organizations with consistent terminology
The system can detect the source language automatically and provides translations that maintain the speaker's original tone, intent, and professional context, making it invaluable for global collaboration.
Example:
This example takes an audio file containing speech in a language other than English and translates it directly into English text.
Download the sample audio here: https://files.cuantum.tech/audio/spanish_speech.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Denver, Colorado, United States"
print(f"Running Whisper speech translation example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local non-English audio file
# IMPORTANT: Replace 'spanish_speech.mp3' with the actual filename.
audio_file_path = "spanish_speech.mp3"
# --- Optional Parameters ---
# Prompt can help guide recognition of specific names/terms in the SOURCE language
# before translation, potentially improving accuracy of the final English text.
translation_prompt = None # Example: "The discussion mentions La Sagrada Familia and Parc Güell."
# --- Function to Translate Speech to English ---
def translate_speech_to_english(client, file_path, prompt=None):
"""
Translates speech from the given audio file into English using the Whisper API.
The source language is automatically detected.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file (non-English speech).
prompt (str, optional): A text prompt to guide source language recognition. Defaults to None.
Returns:
str: The translated English text, or None if an error occurs.
"""
print(f"\nAttempting to translate speech from audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: Check file size
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Translation may fail.")
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
pass # Continue attempt
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for translation to English...")
# --- Make the API Call for Translation ---
# Note: Using the 'translations' endpoint, not 'transcriptions'
response = client.audio.translations.create(
model="whisper-1", # Specify the Whisper model
file=audio_file, # The audio file object
prompt=prompt, # Optional: Provide context prompt
response_format="text" # Request plain text English output
# Other options: 'json', 'verbose_json', 'srt', 'vtt'
)
# The response object for "text" format is directly the translated string
translated_text = response
print("Translation successful.")
return translated_text
except OpenAIError as e:
print(f"OpenAI API Error during translation: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
# Whisper might error if it cannot detect speech or supported language
elif "language could not be detected" in str(e).lower():
print("Hint: Ensure the audio contains detectable speech in a supported language.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during translation: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
english_translation = translate_speech_to_english(
client,
audio_file_path,
prompt=translation_prompt
)
if english_translation:
print("\n--- Translated English Text ---")
print(english_translation)
print("-------------------------------\n")
# How this helps the use case:
print("This English translation enables:")
print("- Understanding discussions from international meetings.")
print("- Making educational content accessible to English-speaking audiences.")
print("- Facilitating cross-cultural business communication.")
print("- Creating English versions of documentation originally recorded in other languages.")
print("- Quick communication across language barriers using voice input.")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_translation_en.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(english_translation)
print(f"Translation saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving translation to file: {e}")
else:
print("\nSpeech translation failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates the Whisper API's capability to perform speech-to-text translation, specifically translating audio from various source languages directly into English text. This addresses the need for breaking down language barriers in global communication scenarios.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and crucially, a sample audio file containing speech in a language other than English (e.g., Spanish, French, German). - Key API Endpoint: This example uses
client.audio.translations.create
, which is distinct from thetranscriptions
endpoint used previously. Thetranslations
endpoint is specifically designed to output English text. - Automatic Language Detection: A key feature highlighted is that the source language of the audio file does not need to be specified; Whisper automatically detects it before translating to English.
- Initialization: Sets up the
OpenAI
client. - Translation Function (
translate_speech_to_english
):- Takes the
client
,file_path
, and an optionalprompt
. - Opens the non-English audio file in binary read mode (
"rb"
). - API Call: Uses
client.audio.translations.create
with:model="whisper-1"
: The standard Whisper model.file=audio_file
: The audio file object.prompt
: Optionally provides context (in the source language or English, often helps with names/terms) to aid accurate recognition before translation, helping to preserve nuances and technical terms.response_format="text"
: Requests plain English text output.
- Error Handling: Includes
try...except
blocks, noting potential errors if speech or a supported language isn't detected in the audio.
- Takes the
- Output and Usefulness:
- Prints the resulting English translation text.
- Explicitly connects this output to the benefits described in the use case: enabling understanding in international meetings, translating educational/business content for global audiences, and facilitating cross-language documentation and communication.
- Shows how to optionally save the English translation to a
.txt
file.
This example effectively showcases Whisper's powerful translation feature, making it invaluable for scenarios requiring communication or content understanding across different languages. Remember to use an audio file with non-English speech for testing and update the audio_file_path
accordingly.
Accessibility tools
Enables better digital inclusion by providing real-time transcription services for hearing-impaired users, offering several key benefits:
- Empowers deaf and hard-of-hearing individuals to participate fully in audio-based content
- Provides instant access to spoken information in professional settings like meetings and conferences
- Supports educational environments by making lectures and discussions accessible to all students
- Enhances media consumption by enabling accurate, real-time captioning for videos and live streams
Example:
This example focuses on using the verbose_json
output format to get segment-level timestamps, which are essential for syncing text with audio or video.
Download the sample audio here: https://files.cuantum.tech/audio/lecture_snippet.mp3
import os
import json # To potentially parse the verbose_json output nicely
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Atlanta, Georgia, United States" # User location context
print(f"Running Whisper accessibility transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local audio file
# IMPORTANT: Replace 'lecture_snippet.mp3' with the actual filename.
audio_file_path = "lecture_snippet.mp3"
# --- Optional Parameters ---
# Specifying language is often good for accuracy in accessibility contexts
known_language = "en"
# Prompt can help with specific terminology in lectures or meetings
transcription_prompt = "The lecture discusses quantum entanglement, superposition, and Bell's theorem." # Set to None if not needed
# --- Function to Transcribe Audio with Timestamps ---
def transcribe_for_accessibility(client, file_path, language="en", prompt=None):
"""
Transcribes the audio file using Whisper, requesting timestamped output
suitable for accessibility applications (e.g., captioning).
Note on 'Real-time': This function processes a complete file. True real-time
captioning requires capturing audio in chunks, sending each chunk to the API
quickly, and displaying the results sequentially. This example generates the
*type* of data needed for such applications from a pre-recorded file.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str): The language code (ISO-639-1). Defaults to "en".
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
dict: The parsed verbose_json response containing text and segments
with timestamps, or None if an error occurs.
"""
print(f"\nAttempting to transcribe for accessibility: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: File size check
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
pass # Continue attempt
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for timestamped transcription...")
# --- Make the API Call for Transcription ---
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language,
prompt=prompt,
# Request detailed JSON output including timestamps
response_format="verbose_json",
# Explicitly request segment-level timestamps
timestamp_granularities=["segment"]
)
# The response object is already a Pydantic model behaving like a dict
timestamped_data = response
print("Timestamped transcription successful.")
# You can access response.text, response.segments etc directly
return timestamped_data
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Function to Format Timestamps ---
def format_timestamp(seconds):
"""Converts seconds to HH:MM:SS.fff format."""
td = datetime.timedelta(seconds=seconds)
total_milliseconds = int(td.total_seconds() * 1000)
hours, remainder = divmod(total_milliseconds, 3600000)
minutes, remainder = divmod(remainder, 60000)
seconds, milliseconds = divmod(remainder, 1000)
return f"{hours:02}:{minutes:02}:{seconds:02}.{milliseconds:03}"
# --- Main Execution ---
if __name__ == "__main__":
transcription_data = transcribe_for_accessibility(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if transcription_data:
print("\n--- Full Transcription Text ---")
# Access the full text directly from the response object
print(transcription_data.text)
print("-------------------------------\n")
print("--- Transcription Segments with Timestamps ---")
# Iterate through segments for timestamped data
for segment in transcription_data.segments:
start_time = format_timestamp(segment['start'])
end_time = format_timestamp(segment['end'])
segment_text = segment['text']
print(f"[{start_time} --> {end_time}] {segment_text}")
print("---------------------------------------------\n")
# How this helps the use case:
print("This timestamped data enables:")
print("- Displaying captions synchronized with video or audio streams.")
print("- Highlighting text in real-time as it's spoken in meetings or lectures.")
print("- Creating accessible versions of educational/media content.")
print("- Allowing users to navigate audio by clicking on text segments.")
print("- Fuller participation for hearing-impaired individuals.")
# Optional: Save the detailed JSON output
output_json_file = os.path.splitext(audio_file_path)[0] + "_timestamps.json"
try:
# The response object can be converted to dict for JSON serialization
with open(output_json_file, "w", encoding="utf-8") as f:
# Use .model_dump_json() for Pydantic V2 models from openai>=1.0.0
# or .dict() for older versions/models
try:
f.write(transcription_data.model_dump_json(indent=2))
except AttributeError:
# Fallback for older versions or different object types
import json
f.write(json.dumps(transcription_data, default=lambda o: o.__dict__, indent=2))
print(f"Detailed timestamp data saved to '{output_json_file}'")
except (IOError, TypeError, AttributeError) as e:
print(f"Error saving timestamp data to JSON file: {e}")
else:
print("\nTranscription for accessibility failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates using the Whisper API to generate highly accurate transcriptions with segment-level timestamps. This output is crucial for accessibility applications, enabling features like real-time captioning, synchronized text highlighting, and navigable transcripts for hearing-impaired users.
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file (lecture_snippet.mp3
). - Key API Parameters:
- Endpoint: Uses
client.audio.transcriptions.create
. response_format="verbose_json"
: This is essential. It requests a detailed JSON object containing not only the full transcription text but also a list of segments, each with start/end times and the corresponding text.timestamp_granularities=["segment"]
: Explicitly requests segment-level timing information (though often included by default withverbose_json
). Word-level timestamps can also be requested if needed (["word"]
), but segments are typically used for captioning.language
/prompt
: Specifying the language (en
) and providing a prompt can enhance accuracy, which is vital for accessibility.
- Endpoint: Uses
- "Real-Time" Consideration: The explanation clarifies that while this code processes a file, the timestamped output it generates is what's needed for real-time applications. True live captioning would involve feeding audio chunks to the API rapidly.
- Initialization & Function (
transcribe_for_accessibility
): Standard client setup. The function encapsulates the API call requestingverbose_json
. - Output Processing:
- The code first prints the full concatenated text (
transcription_data.text
). - It then iterates through the
transcription_data.segments
list. - For each
segment
, it extracts thestart
time,end
time, andtext
. - A helper function (
format_timestamp
) converts the times (in seconds) into a standardHH:MM:SS.fff
format. - It prints each segment with its timing information (e.g.,
[00:00:01.234 --> 00:00:05.678] This is the first segment text.
).
- The code first prints the full concatenated text (
- Use Case Relevance: The output clearly shows how this timestamped data directly enables the benefits described: synchronizing text with audio/video for captions, allowing participation in meetings/lectures, making educational content accessible, and enhancing media consumption.
- Saving Output: Includes an option to save the complete
verbose_json
response to a file for later use or more complex processing. It handles potential differences in serializing the response object from theopenai
library.
This example effectively demonstrates how to obtain the necessary timestamped data from Whisper to power various accessibility features, making audio content more inclusive. Remember to use a relevant audio file for testing.
Video captioning workflows
Streamlines the creation of accurate subtitles and closed captions for video content, supporting multiple output formats including SRT, WebVTT, and other industry-standard caption formats. This capability is essential for:
- Content creators who need to make their videos accessible across multiple platforms
- Broadcasting companies requiring accurate closed captioning for regulatory compliance
- Educational institutions creating accessible video content for diverse student populations
- Social media managers who want to increase video engagement through auto-captioning
The system can automatically detect speaker changes, handle timing synchronization, and format captions according to industry best practices, making it an invaluable tool for professional video production workflows.
Example:
This code example takes an audio file (which you would typically extract from your video first) and uses Whisper to create accurately timestamped captions in the standard .srt
format.
Download the audio sample here: https://files.cuantum.tech/audio/video_audio.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Austin, Texas, United States" # User location context
print(f"Running Whisper caption generation (SRT) example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local audio file (extracted from video)
# IMPORTANT: Replace 'video_audio.mp3' with the actual filename.
audio_file_path = "video_audio.mp3"
# --- Optional Parameters ---
# Specifying language can improve caption accuracy
known_language = "en"
# Prompt can help with names, brands, or specific terminology in the video
transcription_prompt = "The video features interviews with Dr. Anya Sharma about sustainable agriculture and mentions the company 'TerraGrow'." # Set to None if not needed
# --- Function to Generate SRT Captions ---
def generate_captions_srt(client, file_path, language="en", prompt=None):
"""
Transcribes the audio file using Whisper and returns captions in SRT format.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file (extracted from video).
language (str): The language code (ISO-639-1). Defaults to "en".
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
str: The caption data in SRT format string, or None if an error occurs.
"""
print(f"\nAttempting to generate SRT captions for: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: File size check
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
pass # Continue attempt
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for SRT caption generation...")
# --- Make the API Call for Transcription ---
# Request 'srt' format directly
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language,
prompt=prompt,
response_format="srt" # Request SRT format output
# Other options include "vtt"
)
# The response object for "srt" format is directly the SRT string
srt_content = response
print("SRT caption generation successful.")
return srt_content
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
srt_captions = generate_captions_srt(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if srt_captions:
# --- Save the SRT Content to a File ---
output_srt_file = os.path.splitext(audio_file_path)[0] + ".srt"
try:
with open(output_srt_file, "w", encoding="utf-8") as f:
f.write(srt_captions)
print(f"\nSRT captions saved successfully to '{output_srt_file}'")
# How this helps the use case:
print("\nThis SRT file can be used to:")
print("- Add closed captions or subtitles to video players (like YouTube, Vimeo, VLC).")
print("- Import into video editing software for caption integration.")
print("- Meet accessibility requirements and regulatory compliance (e.g., broadcasting).")
print("- Improve video engagement on social media platforms.")
print("- Make educational video content accessible to more students.")
except IOError as e:
print(f"Error saving SRT captions to file: {e}")
else:
print("\nSRT caption generation failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates how to use the Whisper API to streamline video captioning workflows by directly generating industry-standard SRT (SubRip Text) subtitle files from audio.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup and API key. Crucially, it requires an audio file that has been extracted from the target video. The script includes a note explaining this necessary pre-processing step. - Key API Parameter (
response_format="srt"
):- The core of this use case is requesting the
srt
format directly from theclient.audio.transcriptions.create
endpoint. - This tells Whisper to format the output according to SRT specifications, including sequential numbering, start/end timestamps (e.g.,
00:00:20,000 --> 00:00:24,400
), and the corresponding text chunk. The API handles the timing synchronization automatically.vtt
(WebVTT) is another common format that can be requested similarly.
- The core of this use case is requesting the
- Initialization & Function (
generate_captions_srt
): Standard client setup. The function encapsulates the API call specifically requesting SRT format. Optionallanguage
andprompt
parameters can be used to enhance accuracy. - Output Handling:
- The
response
from the API call (whenresponse_format="srt"
) is directly the complete content of the SRT file as a single string. - The code then saves this string directly into a file with the
.srt
extension.
- The
- Use Case Relevance:
- The explanation highlights how this generated
.srt
file directly serves the needs of content creators, broadcasters, educators, and social media managers. - It can be easily uploaded to video platforms, imported into editing software, or used to meet accessibility and compliance standards. It significantly simplifies the traditionally time-consuming process of manual caption creation.
- The explanation highlights how this generated
- Error Handling: Includes standard checks for API and file system errors.
This example provides a highly practical demonstration of using Whisper for a common video production task, showcasing how to get accurately timestamped captions in a ready-to-use format. Remember the essential first step is to extract the audio from the video you wish to caption.
💡 Tip: Whisper works best when audio is clear, and it supports input files up to 25MB in size per request.
2.2.2 Response Formats You Can Use
Whether you're building a simple transcription service or developing a complex video captioning system, understanding these formats is crucial for effectively implementing Whisper in your applications.
The flexibility of Whisper's response formats allows developers to seamlessly integrate transcription and translation capabilities into different types of applications, from basic text output to more sophisticated JSON-based processing pipelines. Each format serves distinct purposes and offers unique advantages depending on your use case.
Whisper supports multiple formats for your needs:
2.2.3 Practical Applications of Whisper
Whisper's versatility extends far beyond basic transcription, offering numerous practical applications across different industries and use cases. From streamlining business operations to enhancing educational experiences, Whisper's capabilities can be leveraged in innovative ways to solve real-world challenges. Let's explore some of the most impactful applications of this technology and understand how they can benefit different user groups.
These applications demonstrate how Whisper can be integrated into various workflows to improve efficiency, accessibility, and communication across different sectors. Each use case showcases unique advantages and potential implementations that can transform how we handle audio content in professional and personal contexts.
Note-taking Tools
Transform spoken content into written text automatically through advanced speech recognition, enabling quick and accurate documentation of verbal communications. This technology excels at capturing natural speech patterns, technical terminology, and multiple speaker interactions, making it easier to document and review lectures, interviews, and meetings. The automated transcription process maintains context and speaker attribution while converting audio to easily searchable and editable text formats.
This tool is particularly valuable for:
- Students capturing detailed lecture notes while staying engaged in class discussions
- Journalists documenting interviews and press conferences with precise quotations
- Business professionals keeping accurate records of client meetings and team brainstorming sessions
- Researchers conducting and transcribing qualitative interviews
The automated notes can be further enhanced with features like timestamp markers, speaker identification, and keyword highlighting, making it easier to navigate and reference specific parts of the discussion later. This dramatically improves productivity by eliminating the need for manual transcription while ensuring comprehensive documentation of important conversations.
Multilingual Applications
Enable seamless communication across language barriers by converting spoken words from one language to another in real-time. This powerful capability allows for instant translation of spoken content, supporting over 100 languages with high accuracy. The system can detect language automatically and handle various accents and dialects, making it particularly valuable for:
- International Business: Facilitating real-time communication in multinational meetings, negotiations, and presentations without the need for human interpreters
- Travel and Tourism: Enabling travelers to communicate effectively with locals, understand announcements, and navigate foreign environments
- Global Education: Supporting distance learning programs and international student exchanges by breaking down language barriers
- Customer Service: Allowing support teams to assist customers in their preferred language, improving service quality and satisfaction
The technology works across different audio environments and can handle multiple speakers, making it ideal for international business meetings, travel applications, global communication platforms, and cross-cultural exchanges.
Accessibility Solutions
Create real-time captions and transcripts for hearing-impaired individuals, ensuring equal access to audio content. This essential technology serves multiple accessibility purposes:
- Real-time Captioning: Provides immediate text representation of spoken words during live events, meetings, and presentations
- Accurate Transcription: Generates detailed written records of audio content for later reference and study
- Multi-format Support: Offers content in various formats including closed captions, subtitles, and downloadable transcripts
This technology can be integrated into:
- Video Conferencing Platforms: Enabling real-time captioning during virtual meetings and webinars
- Educational Platforms: Supporting students with hearing impairments in both online and traditional classroom settings
- Media Players: Providing synchronized captions for videos, podcasts, and other multimedia content
- Live Events: Offering real-time text displays during conferences, performances, and public speaking events
These accessibility features not only support individuals with hearing impairments but also benefit:
- Non-native speakers who prefer reading along while listening
- People in sound-sensitive environments where audio isn't practical
- Visual learners who process information better through text
- Organizations aiming to comply with accessibility regulations and standards
Podcast Enhancement
Convert audio content into searchable text formats through comprehensive transcription, enabling multiple benefits:
- Content Discovery: Listeners can easily search through episode transcripts to find specific topics, quotes, or discussions of interest
- Content Repurposing: Creators can extract key quotes and insights for social media, blog posts, or newsletters
- Accessibility: Makes content available to hearing-impaired audiences and those who prefer reading
- SEO Benefits: Search engines can index the full content of episodes, improving discoverability and organic traffic
- Enhanced Engagement: Readers can skim content before listening, bookmark important sections, and reference material later
- Analytics Insights: Analyze most-searched terms and popular segments to inform future content strategy
Language Learning Support
Provide immediate feedback to language learners by converting their spoken practice into text, allowing them to check pronunciation, grammar, and vocabulary usage. This technology creates an interactive and effective learning experience through several key mechanisms:
- Real-time Pronunciation Feedback: Learners can compare their spoken words with the transcribed text to identify pronunciation errors and areas for improvement
- Grammar Analysis: The system can highlight grammatical structures and potential errors in the transcribed text, helping learners understand their mistakes
- Vocabulary Enhancement: Students can track their active vocabulary usage and receive suggestions for more varied word choices
The technology particularly benefits:
- Self-directed learners practicing independently
- Language teachers tracking student progress
- Online language learning platforms offering speaking practice
- Language exchange participants wanting to verify their communication
When integrated with language learning applications, this feature can provide structured practice sessions, progress tracking, and personalized feedback that helps learners build confidence in their speaking abilities.
Example:
This example takes an audio file of a learner speaking a specific target language and transcribes it, providing the essential text output needed for pronunciation comparison, grammar analysis, or vocabulary review.
Download the sample audio file here: https://files.cuantum.tech/audio/language_practice_fr.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Orlando, Florida, United States" # User location context
print(f"Running Whisper language learning transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# --- Language Learning Specific Configuration ---
# Define the path to the language practice audio file
# IMPORTANT: Replace 'language_practice_fr.mp3' with the actual filename.
audio_file_path = "language_practice_fr.mp3"
# ** CRUCIAL: Specify the language being spoken in the audio file **
# Use the correct ISO-639-1 code (e.g., "en", "es", "fr", "de", "ja", "zh")
# This tells Whisper how to interpret the sounds.
target_language = "fr" # Example: French
# Optional: Provide context about the practice session (e.g., the expected phrase)
# This can help improve accuracy, especially for specific vocabulary or sentences.
practice_prompt = "The student is practicing ordering food at a French cafe, mentioning croissants and coffee." # Set to None if not needed
# --- Function to Transcribe Language Practice Audio ---
def transcribe_language_practice(client, file_path, language, prompt=None):
"""
Transcribes audio of language practice using the Whisper API.
Specifying the target language is crucial for this use case.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str): The ISO-639-1 code of the language being spoken.
prompt (str, optional): A text prompt providing context. Defaults to None.
Returns:
str: The transcribed text in the target language, or None if an error occurs.
"""
print(f"\nAttempting to transcribe language practice ({language}) from: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
if not language:
print("Error: Target language must be specified for language learning transcription.")
return None
# Optional: File size check
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
except OSError as e:
print(f"Error accessing file properties: {e}")
pass
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print(f"Sending audio ({language}) to Whisper API for transcription...")
# --- Make the API Call for Transcription ---
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language, # Pass the specified target language
prompt=prompt,
response_format="text" # Get plain text for easy comparison/analysis
)
# The response object for "text" format is directly the string
practice_transcription = response
print("Transcription successful.")
return practice_transcription
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
elif "invalid language code" in str(e).lower():
print(f"Hint: Check if '{language}' is a valid ISO-639-1 language code supported by Whisper.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
transcribed_text = transcribe_language_practice(
client,
audio_file_path,
language=target_language,
prompt=practice_prompt
)
if transcribed_text:
print("\n--- Transcribed Practice Text ---")
print(f"(Language: {target_language})")
print(transcribed_text)
print("---------------------------------\n")
# How this helps the use case:
print("This transcription provides the basis for language learning feedback:")
print("- **Pronunciation Check:** Learner compares their speech to the text, identifying discrepancies.")
print("- **Grammar/Vocabulary Analysis:** This text can be compared against expected sentences or analyzed (potentially by another AI like GPT-4o, or specific tools) for grammatical correctness and vocabulary usage.")
print("- **Progress Tracking:** Teachers or platforms can store transcriptions to monitor improvement over time.")
print("- **Self-Correction:** Learners get immediate textual representation of their speech for review.")
print("\nNote: Further analysis (grammar checking, pronunciation scoring, etc.) requires additional logic beyond this transcription step.")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_transcription.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(transcribed_text)
print(f"Transcription saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving transcription to file: {e}")
else:
print("\nLanguage practice transcription failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates using the Whisper API as a tool for language learning support. It transcribes audio recordings of a learner speaking a target language, providing the essential text output needed for various feedback mechanisms.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and an audio file containing the learner's spoken practice in the target language (e.g.,language_practice_fr.mp3
). - Key Parameter (
language
):- Unlike general transcription where auto-detect might suffice, for language learning, explicitly specifying the
language
the learner is attempting to speak is crucial. This is set using thetarget_language
variable (e.g.,"fr"
for French). - This ensures Whisper interprets the audio using the correct phonetic and vocabulary model, providing a more accurate transcription for feedback.
- Unlike general transcription where auto-detect might suffice, for language learning, explicitly specifying the
- Optional Prompt: The
prompt
parameter can be used to give Whisper context about the practice session (e.g., the specific phrase being practiced), which can improve recognition accuracy. - Initialization & Function (
transcribe_language_practice
): Standard client setup. The function requires thelanguage
parameter and performs the transcription usingclient.audio.transcriptions.create
. - Output (
response_format="text"
): Plain text output is requested as it's the most direct format for learners or systems to compare against expected text, analyze grammar, or review vocabulary. - Feedback Mechanism (Important Note): The explanation clearly states that this script only provides the transcription. The actual feedback (pronunciation scoring, grammar correction, vocabulary suggestions) requires additional processing. This transcribed text serves as the input for those subsequent analysis steps, which could involve rule-based checks, comparison algorithms, or even feeding the text to another LLM like GPT-4o for analysis.
- Use Case Relevance: The output section explains how this transcription enables the described benefits: allowing learners to check their pronunciation against text, providing material for grammar/vocabulary analysis, facilitating progress tracking, and supporting self-correction.
This example provides a practical starting point for integrating Whisper into language learning applications, focusing on generating the core textual data needed for effective feedback loops. Remember to use an audio file of someone speaking the specified target_language
for testing.
Summary
In this section, you learned several key aspects of audio processing and understanding:
- Transcribe spoken language into text using Whisper
- Convert various audio formats into accurate text transcriptions
- Handle multiple languages and accents with high accuracy
- Process both short clips and longer recordings effectively
- Translate foreign-language audio to English
- Convert non-English speech directly to English text
- Maintain context and meaning across language barriers
- Support multilingual content processing
- Choose between plain text, JSON, and subtitle outputs
- Select the most appropriate format for your specific use case
- Generate subtitles with precise timestamps
- Structure data in machine-readable JSON format
- Apply these tools in real-world use cases like accessibility, education, and content creation
- Create accessible content with accurate transcriptions
- Support language learning and educational initiatives
- Streamline content production workflows
Whisper is incredibly fast, easy to use, and works across languages—making it one of the most valuable tools in the OpenAI ecosystem. Its versatility and accuracy make it suitable for both individual users and enterprise-scale applications, while its open-source nature allows for custom implementations and modifications to suit specific needs.
2.2 Transcription and Translation with Whisper API
The Whisper API represents a significant advancement in automated speech recognition and translation technology. This section explores the fundamental aspects of working with Whisper, including its capabilities, implementation methods, and practical applications. We'll examine how to effectively utilize the API for both transcription and translation tasks, covering everything from basic setup to advanced features.
Whether you're building applications for content creators, developing educational tools, or creating accessibility solutions, understanding Whisper's functionality is crucial. We'll walk through detailed examples and best practices that demonstrate how to integrate this powerful tool into your projects, while highlighting important considerations for optimal performance.
Throughout this section, you'll learn not just the technical implementation details, but also the strategic considerations for choosing appropriate response formats and handling various audio inputs. This knowledge will enable you to build robust, scalable solutions for audio processing needs.
2.2.1 What Is Whisper?
Whisper is OpenAI's groundbreaking open-source automatic speech recognition (ASR) model that represents a significant advancement in audio processing technology. This sophisticated system excels at handling diverse audio inputs, capable of processing various file formats including .mp3
, .mp4
, .wav
, and .m4a
. What sets Whisper apart is its dual functionality: it not only converts spoken content into highly accurate text transcriptions but also offers powerful translation capabilities, seamlessly converting non-English audio content into fluent English output. The model's robust architecture ensures high accuracy across different accents, speaking styles, and background noise conditions.
Whisper demonstrates exceptional versatility across numerous applications:
Meeting or podcast transcriptions
Converts lengthy discussions and presentations into searchable, shareable text documents with remarkable accuracy. This functionality is particularly valuable for businesses and content creators who need to:
- Archive important meetings for future reference
- Create accessible versions of audio content
- Enable quick searching through hours of recorded content
- Generate written documentation from verbal discussions
- Support compliance requirements for record-keeping
The high accuracy rate ensures that technical terms, proper names, and complex discussions are captured correctly while maintaining the natural flow of conversation.
Example:
Download the audio sample here: https://files.cuantum.tech/audio/meeting_snippet.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt
current_location = "Houston, Texas, United States"
print(f"Running Whisper transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local audio file
# IMPORTANT: Replace 'meeting_snippet.mp3' with the actual filename.
audio_file_path = "meeting_snippet.mp3"
# --- Optional Parameters for potentially better accuracy ---
# Specify language (ISO-639-1 code) if known, otherwise Whisper auto-detects.
# Example: "en" for English, "es" for Spanish, "de" for German
known_language = "en" # Set to None to auto-detect
# Provide a prompt with context, names, or jargon expected in the audio.
# This helps Whisper recognize specific terms accurately.
transcription_prompt = "The discussion involves Project Phoenix, stakeholders like Dr. Evelyn Reed and ACME Corp, and technical terms such as multi-threaded processing and cloud-native architecture." # Set to None if no prompt needed
# --- Function to Transcribe Audio ---
def transcribe_audio(client, file_path, language=None, prompt=None):
"""
Transcribes the given audio file using the Whisper API.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str, optional): The language code (ISO-639-1). Defaults to None (auto-detect).
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
str: The transcription text, or None if an error occurs.
"""
print(f"\nAttempting to transcribe audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Check file size (optional but good practice)
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
print("Consider splitting the file into smaller chunks.")
# You might choose to exit here or attempt anyway
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
return None
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for transcription...")
# --- Make the API Call ---
response = client.audio.transcriptions.create(
model="whisper-1", # Specify the Whisper model
file=audio_file, # The audio file object
language=language, # Optional: Specify language
prompt=prompt, # Optional: Provide context prompt
response_format="text" # Request plain text output
# Other options: 'json', 'verbose_json', 'srt', 'vtt'
)
# The response object for "text" format is directly the string
transcription_text = response
print("Transcription successful.")
return transcription_text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
# Provide hints for common errors
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported (mp3, mp4, mpeg, mpga, m4a, wav, webm).")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit. Please split the file.")
return None
except FileNotFoundError: # Already handled above, but good practice
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
transcription = transcribe_audio(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if transcription:
print("\n--- Transcription Result ---")
print(transcription)
print("----------------------------\n")
# How this helps the use case:
print("This plain text transcription can now be:")
print("- Saved as a text document (.txt) for archiving.")
print("- Indexed and searched easily for specific keywords or names.")
print("- Used to generate meeting minutes or documentation.")
print("- Copied into accessibility tools or documents.")
print("- Stored for compliance record-keeping.")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_transcription.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(transcription)
print(f"Transcription saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving transcription to file: {e}")
else:
print("\nTranscription failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates transcribing audio files like meeting recordings or podcast episodes using OpenAI's Whisper API. The goal is to convert spoken content into accurate, searchable text, addressing needs like archiving, accessibility, and documentation.
- Prerequisites: It requires the
openai
andpython-dotenv
libraries, an OpenAI API key configured in a.env
file, and a sample audio file (meeting_snippet.mp3
). - Audio File Requirements: The script highlights the supported audio formats (MP3, WAV, M4A, etc.) and the crucial 25MB file size limit per API request. It includes a warning and explanation that longer files must be segmented (chunked) before processing, although the code itself handles a single file within the limit.
- Initialization: It sets up the standard
OpenAI
client using the API key. - Transcription Function (
transcribe_audio
):- Takes the
client
,file_path
, and optionallanguage
andprompt
arguments. - Includes a check for file existence and size.
- Opens the audio file in binary read mode (
"rb"
). - API Call: Uses
client.audio.transcriptions.create
with:model="whisper-1"
: Specifies the model known for high accuracy.file=audio_file
: Passes the opened file object.language
: Optionally provides the language code (e.g.,"en"
) to potentially improve accuracy if the language is known. IfNone
, Whisper auto-detects.prompt
: Optionally provides contextual keywords, names (like "Project Phoenix", "Dr. Evelyn Reed"), or jargon. This significantly helps Whisper accurately transcribe specialized terms often found in meetings or technical podcasts, directly addressing the need to capture complex discussions correctly.response_format="text"
: Requests the output directly as a plain text string, which is ideal for immediate use in documents, search indexing, etc. Other formats likeverbose_json
(for timestamps) orsrt
/vtt
(for subtitles) could be requested if needed.
- Error Handling: Includes
try...except
blocks for API errors (providing hints for common issues like file size or format) and file system errors.
- Takes the
- Output and Usefulness:
- The resulting transcription text is printed.
- The code explicitly connects this output back to the use case benefits: creating searchable archives, generating documentation, supporting accessibility, and enabling compliance.
- It includes an optional step to save the transcription directly to a
.txt
file.
This example provides a practical implementation of Whisper for the described use case, emphasizing accuracy features (prompting, language specification) and explaining how the text output facilitates the desired downstream tasks like searching and archiving. Remember to use an actual audio file (within the size limit) for testing.
Voice note conversion
Transforms quick voice memos into organized text notes, making reviewing and archiving spoken thoughts easier. This functionality is particularly valuable for:
- Creating quick reminders and to-do lists while on the go
- Capturing creative ideas or brainstorming sessions without interrupting the flow of thought
- Taking notes during field work or site visits where typing is impractical
- Documenting observations or research findings in real-time
The system maintains the natural flow of speech while organizing the content into clear, readable text that can be easily searched, shared, or integrated into other documents.
Example:
This script focuses on taking a typical voice memo audio file and converting it into searchable, usable text, ideal for capturing ideas, reminders, or field notes.
Download the sample audio here: https://files.cuantum.tech/audio/voice_memo.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Miami, Florida, United States"
print(f"Running Whisper voice note transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local voice note audio file
# IMPORTANT: Replace 'voice_memo.m4a' with the actual filename.
audio_file_path = "voice_memo.mp3"
# --- Optional Parameters (Often not needed for simple voice notes) ---
# Language auto-detection is usually sufficient for voice notes.
known_language = None # Set to "en", "es", etc. if needed
# Prompt is useful if your notes contain specific jargon/names, otherwise leave as None.
transcription_prompt = None # Example: "Remember to mention Project Chimera and the client ZetaCorp."
# --- Function to Transcribe Voice Note ---
def transcribe_voice_note(client, file_path, language=None, prompt=None):
"""
Transcribes the given voice note audio file using the Whisper API.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str, optional): The language code (ISO-639-1). Defaults to None (auto-detect).
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
str: The transcription text, or None if an error occurs.
"""
print(f"\nAttempting to transcribe voice note: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: Check file size (less likely to be an issue for voice notes)
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
# return None # Decide if you want to stop for large files
except OSError as e:
print(f"Error accessing file properties: {e}")
# Continue attempt even if size check fails
pass
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending voice note to Whisper API for transcription...")
# --- Make the API Call ---
response = client.audio.transcriptions.create(
model="whisper-1", # Specify the Whisper model
file=audio_file, # The audio file object
language=language, # Defaults to None (auto-detect)
prompt=prompt, # Defaults to None
response_format="text" # Request plain text output for easy use as notes
)
# The response object for "text" format is directly the string
note_text = response
print("Transcription successful.")
return note_text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported (m4a, mp3, wav, etc.).")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
transcribed_note = transcribe_voice_note(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if transcribed_note:
print("\n--- Transcribed Voice Note Text ---")
print(transcribed_note)
print("-----------------------------------\n")
# How this helps the use case:
print("This transcribed text from your voice note can be easily:")
print("- Copied into reminder apps or to-do lists.")
print("- Saved as a text file for archiving creative ideas or brainstorms.")
print("- Searched later for specific keywords or topics.")
print("- Shared via email or messaging apps.")
print("- Integrated into reports or documentation (e.g., field notes).")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_transcription.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(transcribed_note)
print(f"Transcription saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving transcription to file: {e}")
else:
print("\nVoice note transcription failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates using the Whisper API to transcribe short audio recordings like voice notes or memos. This is ideal for quickly capturing thoughts, ideas, reminders, or field observations and converting them into easily manageable text.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and a sample voice note audio file (common formats like.m4a
,.mp3
,.wav
are supported). - Audio File Handling: The script emphasizes that voice notes are typically well under the 25MB API limit. It opens the file in binary read mode (
"rb"
). - Initialization: Sets up the
OpenAI
client. - Transcription Function (
transcribe_voice_note
):- Similar to the meeting transcription function but tailored for voice notes.
- Defaults: It defaults
language
toNone
(auto-detect) andprompt
toNone
, as these are often sufficient for typical voice memos where specific jargon might be less common. The user can still provide these if needed. - API Call: Uses
client.audio.transcriptions.create
withmodel="whisper-1"
. response_format="text"
: Explicitly requests plain text output, which is the most practical format for notes – easy to read, search, copy, and share.- Error Handling: Includes standard
try...except
blocks for API and file errors.
- Output and Usefulness:
- Prints the resulting text transcription.
- Explicitly connects the output to the benefits mentioned in the use case description: creating reminders/to-dos, capturing ideas, documenting observations, enabling search, and sharing.
- Includes an optional step to save the transcribed note to a
.txt
file.
This example provides a clear and practical implementation for converting voice notes to text using Whisper, highlighting its convenience for capturing information on the go. Remember to use an actual voice note audio file for testing and update the audio_file_path
variable accordingly.
Multilingual speech translation
Breaks down language barriers by providing accurate translations while preserving the original context and meaning. This powerful feature enables:
- Real-time communication across language barriers in international meetings and conferences
- Translation of educational content for global audiences while maintaining academic integrity
- Cross-cultural business negotiations with precise translation of technical terms and cultural nuances
- Documentation translation for multinational organizations with consistent terminology
The system can detect the source language automatically and provides translations that maintain the speaker's original tone, intent, and professional context, making it invaluable for global collaboration.
Example:
This example takes an audio file containing speech in a language other than English and translates it directly into English text.
Download the sample audio here: https://files.cuantum.tech/audio/spanish_speech.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Denver, Colorado, United States"
print(f"Running Whisper speech translation example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local non-English audio file
# IMPORTANT: Replace 'spanish_speech.mp3' with the actual filename.
audio_file_path = "spanish_speech.mp3"
# --- Optional Parameters ---
# Prompt can help guide recognition of specific names/terms in the SOURCE language
# before translation, potentially improving accuracy of the final English text.
translation_prompt = None # Example: "The discussion mentions La Sagrada Familia and Parc Güell."
# --- Function to Translate Speech to English ---
def translate_speech_to_english(client, file_path, prompt=None):
"""
Translates speech from the given audio file into English using the Whisper API.
The source language is automatically detected.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file (non-English speech).
prompt (str, optional): A text prompt to guide source language recognition. Defaults to None.
Returns:
str: The translated English text, or None if an error occurs.
"""
print(f"\nAttempting to translate speech from audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: Check file size
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Translation may fail.")
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
pass # Continue attempt
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for translation to English...")
# --- Make the API Call for Translation ---
# Note: Using the 'translations' endpoint, not 'transcriptions'
response = client.audio.translations.create(
model="whisper-1", # Specify the Whisper model
file=audio_file, # The audio file object
prompt=prompt, # Optional: Provide context prompt
response_format="text" # Request plain text English output
# Other options: 'json', 'verbose_json', 'srt', 'vtt'
)
# The response object for "text" format is directly the translated string
translated_text = response
print("Translation successful.")
return translated_text
except OpenAIError as e:
print(f"OpenAI API Error during translation: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
# Whisper might error if it cannot detect speech or supported language
elif "language could not be detected" in str(e).lower():
print("Hint: Ensure the audio contains detectable speech in a supported language.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during translation: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
english_translation = translate_speech_to_english(
client,
audio_file_path,
prompt=translation_prompt
)
if english_translation:
print("\n--- Translated English Text ---")
print(english_translation)
print("-------------------------------\n")
# How this helps the use case:
print("This English translation enables:")
print("- Understanding discussions from international meetings.")
print("- Making educational content accessible to English-speaking audiences.")
print("- Facilitating cross-cultural business communication.")
print("- Creating English versions of documentation originally recorded in other languages.")
print("- Quick communication across language barriers using voice input.")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_translation_en.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(english_translation)
print(f"Translation saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving translation to file: {e}")
else:
print("\nSpeech translation failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates the Whisper API's capability to perform speech-to-text translation, specifically translating audio from various source languages directly into English text. This addresses the need for breaking down language barriers in global communication scenarios.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and crucially, a sample audio file containing speech in a language other than English (e.g., Spanish, French, German). - Key API Endpoint: This example uses
client.audio.translations.create
, which is distinct from thetranscriptions
endpoint used previously. Thetranslations
endpoint is specifically designed to output English text. - Automatic Language Detection: A key feature highlighted is that the source language of the audio file does not need to be specified; Whisper automatically detects it before translating to English.
- Initialization: Sets up the
OpenAI
client. - Translation Function (
translate_speech_to_english
):- Takes the
client
,file_path
, and an optionalprompt
. - Opens the non-English audio file in binary read mode (
"rb"
). - API Call: Uses
client.audio.translations.create
with:model="whisper-1"
: The standard Whisper model.file=audio_file
: The audio file object.prompt
: Optionally provides context (in the source language or English, often helps with names/terms) to aid accurate recognition before translation, helping to preserve nuances and technical terms.response_format="text"
: Requests plain English text output.
- Error Handling: Includes
try...except
blocks, noting potential errors if speech or a supported language isn't detected in the audio.
- Takes the
- Output and Usefulness:
- Prints the resulting English translation text.
- Explicitly connects this output to the benefits described in the use case: enabling understanding in international meetings, translating educational/business content for global audiences, and facilitating cross-language documentation and communication.
- Shows how to optionally save the English translation to a
.txt
file.
This example effectively showcases Whisper's powerful translation feature, making it invaluable for scenarios requiring communication or content understanding across different languages. Remember to use an audio file with non-English speech for testing and update the audio_file_path
accordingly.
Accessibility tools
Enables better digital inclusion by providing real-time transcription services for hearing-impaired users, offering several key benefits:
- Empowers deaf and hard-of-hearing individuals to participate fully in audio-based content
- Provides instant access to spoken information in professional settings like meetings and conferences
- Supports educational environments by making lectures and discussions accessible to all students
- Enhances media consumption by enabling accurate, real-time captioning for videos and live streams
Example:
This example focuses on using the verbose_json
output format to get segment-level timestamps, which are essential for syncing text with audio or video.
Download the sample audio here: https://files.cuantum.tech/audio/lecture_snippet.mp3
import os
import json # To potentially parse the verbose_json output nicely
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Atlanta, Georgia, United States" # User location context
print(f"Running Whisper accessibility transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local audio file
# IMPORTANT: Replace 'lecture_snippet.mp3' with the actual filename.
audio_file_path = "lecture_snippet.mp3"
# --- Optional Parameters ---
# Specifying language is often good for accuracy in accessibility contexts
known_language = "en"
# Prompt can help with specific terminology in lectures or meetings
transcription_prompt = "The lecture discusses quantum entanglement, superposition, and Bell's theorem." # Set to None if not needed
# --- Function to Transcribe Audio with Timestamps ---
def transcribe_for_accessibility(client, file_path, language="en", prompt=None):
"""
Transcribes the audio file using Whisper, requesting timestamped output
suitable for accessibility applications (e.g., captioning).
Note on 'Real-time': This function processes a complete file. True real-time
captioning requires capturing audio in chunks, sending each chunk to the API
quickly, and displaying the results sequentially. This example generates the
*type* of data needed for such applications from a pre-recorded file.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str): The language code (ISO-639-1). Defaults to "en".
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
dict: The parsed verbose_json response containing text and segments
with timestamps, or None if an error occurs.
"""
print(f"\nAttempting to transcribe for accessibility: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: File size check
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
pass # Continue attempt
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for timestamped transcription...")
# --- Make the API Call for Transcription ---
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language,
prompt=prompt,
# Request detailed JSON output including timestamps
response_format="verbose_json",
# Explicitly request segment-level timestamps
timestamp_granularities=["segment"]
)
# The response object is already a Pydantic model behaving like a dict
timestamped_data = response
print("Timestamped transcription successful.")
# You can access response.text, response.segments etc directly
return timestamped_data
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Function to Format Timestamps ---
def format_timestamp(seconds):
"""Converts seconds to HH:MM:SS.fff format."""
td = datetime.timedelta(seconds=seconds)
total_milliseconds = int(td.total_seconds() * 1000)
hours, remainder = divmod(total_milliseconds, 3600000)
minutes, remainder = divmod(remainder, 60000)
seconds, milliseconds = divmod(remainder, 1000)
return f"{hours:02}:{minutes:02}:{seconds:02}.{milliseconds:03}"
# --- Main Execution ---
if __name__ == "__main__":
transcription_data = transcribe_for_accessibility(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if transcription_data:
print("\n--- Full Transcription Text ---")
# Access the full text directly from the response object
print(transcription_data.text)
print("-------------------------------\n")
print("--- Transcription Segments with Timestamps ---")
# Iterate through segments for timestamped data
for segment in transcription_data.segments:
start_time = format_timestamp(segment['start'])
end_time = format_timestamp(segment['end'])
segment_text = segment['text']
print(f"[{start_time} --> {end_time}] {segment_text}")
print("---------------------------------------------\n")
# How this helps the use case:
print("This timestamped data enables:")
print("- Displaying captions synchronized with video or audio streams.")
print("- Highlighting text in real-time as it's spoken in meetings or lectures.")
print("- Creating accessible versions of educational/media content.")
print("- Allowing users to navigate audio by clicking on text segments.")
print("- Fuller participation for hearing-impaired individuals.")
# Optional: Save the detailed JSON output
output_json_file = os.path.splitext(audio_file_path)[0] + "_timestamps.json"
try:
# The response object can be converted to dict for JSON serialization
with open(output_json_file, "w", encoding="utf-8") as f:
# Use .model_dump_json() for Pydantic V2 models from openai>=1.0.0
# or .dict() for older versions/models
try:
f.write(transcription_data.model_dump_json(indent=2))
except AttributeError:
# Fallback for older versions or different object types
import json
f.write(json.dumps(transcription_data, default=lambda o: o.__dict__, indent=2))
print(f"Detailed timestamp data saved to '{output_json_file}'")
except (IOError, TypeError, AttributeError) as e:
print(f"Error saving timestamp data to JSON file: {e}")
else:
print("\nTranscription for accessibility failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates using the Whisper API to generate highly accurate transcriptions with segment-level timestamps. This output is crucial for accessibility applications, enabling features like real-time captioning, synchronized text highlighting, and navigable transcripts for hearing-impaired users.
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file (lecture_snippet.mp3
). - Key API Parameters:
- Endpoint: Uses
client.audio.transcriptions.create
. response_format="verbose_json"
: This is essential. It requests a detailed JSON object containing not only the full transcription text but also a list of segments, each with start/end times and the corresponding text.timestamp_granularities=["segment"]
: Explicitly requests segment-level timing information (though often included by default withverbose_json
). Word-level timestamps can also be requested if needed (["word"]
), but segments are typically used for captioning.language
/prompt
: Specifying the language (en
) and providing a prompt can enhance accuracy, which is vital for accessibility.
- Endpoint: Uses
- "Real-Time" Consideration: The explanation clarifies that while this code processes a file, the timestamped output it generates is what's needed for real-time applications. True live captioning would involve feeding audio chunks to the API rapidly.
- Initialization & Function (
transcribe_for_accessibility
): Standard client setup. The function encapsulates the API call requestingverbose_json
. - Output Processing:
- The code first prints the full concatenated text (
transcription_data.text
). - It then iterates through the
transcription_data.segments
list. - For each
segment
, it extracts thestart
time,end
time, andtext
. - A helper function (
format_timestamp
) converts the times (in seconds) into a standardHH:MM:SS.fff
format. - It prints each segment with its timing information (e.g.,
[00:00:01.234 --> 00:00:05.678] This is the first segment text.
).
- The code first prints the full concatenated text (
- Use Case Relevance: The output clearly shows how this timestamped data directly enables the benefits described: synchronizing text with audio/video for captions, allowing participation in meetings/lectures, making educational content accessible, and enhancing media consumption.
- Saving Output: Includes an option to save the complete
verbose_json
response to a file for later use or more complex processing. It handles potential differences in serializing the response object from theopenai
library.
This example effectively demonstrates how to obtain the necessary timestamped data from Whisper to power various accessibility features, making audio content more inclusive. Remember to use a relevant audio file for testing.
Video captioning workflows
Streamlines the creation of accurate subtitles and closed captions for video content, supporting multiple output formats including SRT, WebVTT, and other industry-standard caption formats. This capability is essential for:
- Content creators who need to make their videos accessible across multiple platforms
- Broadcasting companies requiring accurate closed captioning for regulatory compliance
- Educational institutions creating accessible video content for diverse student populations
- Social media managers who want to increase video engagement through auto-captioning
The system can automatically detect speaker changes, handle timing synchronization, and format captions according to industry best practices, making it an invaluable tool for professional video production workflows.
Example:
This code example takes an audio file (which you would typically extract from your video first) and uses Whisper to create accurately timestamped captions in the standard .srt
format.
Download the audio sample here: https://files.cuantum.tech/audio/video_audio.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Austin, Texas, United States" # User location context
print(f"Running Whisper caption generation (SRT) example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local audio file (extracted from video)
# IMPORTANT: Replace 'video_audio.mp3' with the actual filename.
audio_file_path = "video_audio.mp3"
# --- Optional Parameters ---
# Specifying language can improve caption accuracy
known_language = "en"
# Prompt can help with names, brands, or specific terminology in the video
transcription_prompt = "The video features interviews with Dr. Anya Sharma about sustainable agriculture and mentions the company 'TerraGrow'." # Set to None if not needed
# --- Function to Generate SRT Captions ---
def generate_captions_srt(client, file_path, language="en", prompt=None):
"""
Transcribes the audio file using Whisper and returns captions in SRT format.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file (extracted from video).
language (str): The language code (ISO-639-1). Defaults to "en".
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
str: The caption data in SRT format string, or None if an error occurs.
"""
print(f"\nAttempting to generate SRT captions for: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: File size check
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
pass # Continue attempt
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for SRT caption generation...")
# --- Make the API Call for Transcription ---
# Request 'srt' format directly
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language,
prompt=prompt,
response_format="srt" # Request SRT format output
# Other options include "vtt"
)
# The response object for "srt" format is directly the SRT string
srt_content = response
print("SRT caption generation successful.")
return srt_content
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
srt_captions = generate_captions_srt(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if srt_captions:
# --- Save the SRT Content to a File ---
output_srt_file = os.path.splitext(audio_file_path)[0] + ".srt"
try:
with open(output_srt_file, "w", encoding="utf-8") as f:
f.write(srt_captions)
print(f"\nSRT captions saved successfully to '{output_srt_file}'")
# How this helps the use case:
print("\nThis SRT file can be used to:")
print("- Add closed captions or subtitles to video players (like YouTube, Vimeo, VLC).")
print("- Import into video editing software for caption integration.")
print("- Meet accessibility requirements and regulatory compliance (e.g., broadcasting).")
print("- Improve video engagement on social media platforms.")
print("- Make educational video content accessible to more students.")
except IOError as e:
print(f"Error saving SRT captions to file: {e}")
else:
print("\nSRT caption generation failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates how to use the Whisper API to streamline video captioning workflows by directly generating industry-standard SRT (SubRip Text) subtitle files from audio.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup and API key. Crucially, it requires an audio file that has been extracted from the target video. The script includes a note explaining this necessary pre-processing step. - Key API Parameter (
response_format="srt"
):- The core of this use case is requesting the
srt
format directly from theclient.audio.transcriptions.create
endpoint. - This tells Whisper to format the output according to SRT specifications, including sequential numbering, start/end timestamps (e.g.,
00:00:20,000 --> 00:00:24,400
), and the corresponding text chunk. The API handles the timing synchronization automatically.vtt
(WebVTT) is another common format that can be requested similarly.
- The core of this use case is requesting the
- Initialization & Function (
generate_captions_srt
): Standard client setup. The function encapsulates the API call specifically requesting SRT format. Optionallanguage
andprompt
parameters can be used to enhance accuracy. - Output Handling:
- The
response
from the API call (whenresponse_format="srt"
) is directly the complete content of the SRT file as a single string. - The code then saves this string directly into a file with the
.srt
extension.
- The
- Use Case Relevance:
- The explanation highlights how this generated
.srt
file directly serves the needs of content creators, broadcasters, educators, and social media managers. - It can be easily uploaded to video platforms, imported into editing software, or used to meet accessibility and compliance standards. It significantly simplifies the traditionally time-consuming process of manual caption creation.
- The explanation highlights how this generated
- Error Handling: Includes standard checks for API and file system errors.
This example provides a highly practical demonstration of using Whisper for a common video production task, showcasing how to get accurately timestamped captions in a ready-to-use format. Remember the essential first step is to extract the audio from the video you wish to caption.
💡 Tip: Whisper works best when audio is clear, and it supports input files up to 25MB in size per request.
2.2.2 Response Formats You Can Use
Whether you're building a simple transcription service or developing a complex video captioning system, understanding these formats is crucial for effectively implementing Whisper in your applications.
The flexibility of Whisper's response formats allows developers to seamlessly integrate transcription and translation capabilities into different types of applications, from basic text output to more sophisticated JSON-based processing pipelines. Each format serves distinct purposes and offers unique advantages depending on your use case.
Whisper supports multiple formats for your needs:
2.2.3 Practical Applications of Whisper
Whisper's versatility extends far beyond basic transcription, offering numerous practical applications across different industries and use cases. From streamlining business operations to enhancing educational experiences, Whisper's capabilities can be leveraged in innovative ways to solve real-world challenges. Let's explore some of the most impactful applications of this technology and understand how they can benefit different user groups.
These applications demonstrate how Whisper can be integrated into various workflows to improve efficiency, accessibility, and communication across different sectors. Each use case showcases unique advantages and potential implementations that can transform how we handle audio content in professional and personal contexts.
Note-taking Tools
Transform spoken content into written text automatically through advanced speech recognition, enabling quick and accurate documentation of verbal communications. This technology excels at capturing natural speech patterns, technical terminology, and multiple speaker interactions, making it easier to document and review lectures, interviews, and meetings. The automated transcription process maintains context and speaker attribution while converting audio to easily searchable and editable text formats.
This tool is particularly valuable for:
- Students capturing detailed lecture notes while staying engaged in class discussions
- Journalists documenting interviews and press conferences with precise quotations
- Business professionals keeping accurate records of client meetings and team brainstorming sessions
- Researchers conducting and transcribing qualitative interviews
The automated notes can be further enhanced with features like timestamp markers, speaker identification, and keyword highlighting, making it easier to navigate and reference specific parts of the discussion later. This dramatically improves productivity by eliminating the need for manual transcription while ensuring comprehensive documentation of important conversations.
Multilingual Applications
Enable seamless communication across language barriers by converting spoken words from one language to another in real-time. This powerful capability allows for instant translation of spoken content, supporting over 100 languages with high accuracy. The system can detect language automatically and handle various accents and dialects, making it particularly valuable for:
- International Business: Facilitating real-time communication in multinational meetings, negotiations, and presentations without the need for human interpreters
- Travel and Tourism: Enabling travelers to communicate effectively with locals, understand announcements, and navigate foreign environments
- Global Education: Supporting distance learning programs and international student exchanges by breaking down language barriers
- Customer Service: Allowing support teams to assist customers in their preferred language, improving service quality and satisfaction
The technology works across different audio environments and can handle multiple speakers, making it ideal for international business meetings, travel applications, global communication platforms, and cross-cultural exchanges.
Accessibility Solutions
Create real-time captions and transcripts for hearing-impaired individuals, ensuring equal access to audio content. This essential technology serves multiple accessibility purposes:
- Real-time Captioning: Provides immediate text representation of spoken words during live events, meetings, and presentations
- Accurate Transcription: Generates detailed written records of audio content for later reference and study
- Multi-format Support: Offers content in various formats including closed captions, subtitles, and downloadable transcripts
This technology can be integrated into:
- Video Conferencing Platforms: Enabling real-time captioning during virtual meetings and webinars
- Educational Platforms: Supporting students with hearing impairments in both online and traditional classroom settings
- Media Players: Providing synchronized captions for videos, podcasts, and other multimedia content
- Live Events: Offering real-time text displays during conferences, performances, and public speaking events
These accessibility features not only support individuals with hearing impairments but also benefit:
- Non-native speakers who prefer reading along while listening
- People in sound-sensitive environments where audio isn't practical
- Visual learners who process information better through text
- Organizations aiming to comply with accessibility regulations and standards
Podcast Enhancement
Convert audio content into searchable text formats through comprehensive transcription, enabling multiple benefits:
- Content Discovery: Listeners can easily search through episode transcripts to find specific topics, quotes, or discussions of interest
- Content Repurposing: Creators can extract key quotes and insights for social media, blog posts, or newsletters
- Accessibility: Makes content available to hearing-impaired audiences and those who prefer reading
- SEO Benefits: Search engines can index the full content of episodes, improving discoverability and organic traffic
- Enhanced Engagement: Readers can skim content before listening, bookmark important sections, and reference material later
- Analytics Insights: Analyze most-searched terms and popular segments to inform future content strategy
Language Learning Support
Provide immediate feedback to language learners by converting their spoken practice into text, allowing them to check pronunciation, grammar, and vocabulary usage. This technology creates an interactive and effective learning experience through several key mechanisms:
- Real-time Pronunciation Feedback: Learners can compare their spoken words with the transcribed text to identify pronunciation errors and areas for improvement
- Grammar Analysis: The system can highlight grammatical structures and potential errors in the transcribed text, helping learners understand their mistakes
- Vocabulary Enhancement: Students can track their active vocabulary usage and receive suggestions for more varied word choices
The technology particularly benefits:
- Self-directed learners practicing independently
- Language teachers tracking student progress
- Online language learning platforms offering speaking practice
- Language exchange participants wanting to verify their communication
When integrated with language learning applications, this feature can provide structured practice sessions, progress tracking, and personalized feedback that helps learners build confidence in their speaking abilities.
Example:
This example takes an audio file of a learner speaking a specific target language and transcribes it, providing the essential text output needed for pronunciation comparison, grammar analysis, or vocabulary review.
Download the sample audio file here: https://files.cuantum.tech/audio/language_practice_fr.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Orlando, Florida, United States" # User location context
print(f"Running Whisper language learning transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# --- Language Learning Specific Configuration ---
# Define the path to the language practice audio file
# IMPORTANT: Replace 'language_practice_fr.mp3' with the actual filename.
audio_file_path = "language_practice_fr.mp3"
# ** CRUCIAL: Specify the language being spoken in the audio file **
# Use the correct ISO-639-1 code (e.g., "en", "es", "fr", "de", "ja", "zh")
# This tells Whisper how to interpret the sounds.
target_language = "fr" # Example: French
# Optional: Provide context about the practice session (e.g., the expected phrase)
# This can help improve accuracy, especially for specific vocabulary or sentences.
practice_prompt = "The student is practicing ordering food at a French cafe, mentioning croissants and coffee." # Set to None if not needed
# --- Function to Transcribe Language Practice Audio ---
def transcribe_language_practice(client, file_path, language, prompt=None):
"""
Transcribes audio of language practice using the Whisper API.
Specifying the target language is crucial for this use case.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str): The ISO-639-1 code of the language being spoken.
prompt (str, optional): A text prompt providing context. Defaults to None.
Returns:
str: The transcribed text in the target language, or None if an error occurs.
"""
print(f"\nAttempting to transcribe language practice ({language}) from: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
if not language:
print("Error: Target language must be specified for language learning transcription.")
return None
# Optional: File size check
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
except OSError as e:
print(f"Error accessing file properties: {e}")
pass
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print(f"Sending audio ({language}) to Whisper API for transcription...")
# --- Make the API Call for Transcription ---
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language, # Pass the specified target language
prompt=prompt,
response_format="text" # Get plain text for easy comparison/analysis
)
# The response object for "text" format is directly the string
practice_transcription = response
print("Transcription successful.")
return practice_transcription
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
elif "invalid language code" in str(e).lower():
print(f"Hint: Check if '{language}' is a valid ISO-639-1 language code supported by Whisper.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
transcribed_text = transcribe_language_practice(
client,
audio_file_path,
language=target_language,
prompt=practice_prompt
)
if transcribed_text:
print("\n--- Transcribed Practice Text ---")
print(f"(Language: {target_language})")
print(transcribed_text)
print("---------------------------------\n")
# How this helps the use case:
print("This transcription provides the basis for language learning feedback:")
print("- **Pronunciation Check:** Learner compares their speech to the text, identifying discrepancies.")
print("- **Grammar/Vocabulary Analysis:** This text can be compared against expected sentences or analyzed (potentially by another AI like GPT-4o, or specific tools) for grammatical correctness and vocabulary usage.")
print("- **Progress Tracking:** Teachers or platforms can store transcriptions to monitor improvement over time.")
print("- **Self-Correction:** Learners get immediate textual representation of their speech for review.")
print("\nNote: Further analysis (grammar checking, pronunciation scoring, etc.) requires additional logic beyond this transcription step.")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_transcription.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(transcribed_text)
print(f"Transcription saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving transcription to file: {e}")
else:
print("\nLanguage practice transcription failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates using the Whisper API as a tool for language learning support. It transcribes audio recordings of a learner speaking a target language, providing the essential text output needed for various feedback mechanisms.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and an audio file containing the learner's spoken practice in the target language (e.g.,language_practice_fr.mp3
). - Key Parameter (
language
):- Unlike general transcription where auto-detect might suffice, for language learning, explicitly specifying the
language
the learner is attempting to speak is crucial. This is set using thetarget_language
variable (e.g.,"fr"
for French). - This ensures Whisper interprets the audio using the correct phonetic and vocabulary model, providing a more accurate transcription for feedback.
- Unlike general transcription where auto-detect might suffice, for language learning, explicitly specifying the
- Optional Prompt: The
prompt
parameter can be used to give Whisper context about the practice session (e.g., the specific phrase being practiced), which can improve recognition accuracy. - Initialization & Function (
transcribe_language_practice
): Standard client setup. The function requires thelanguage
parameter and performs the transcription usingclient.audio.transcriptions.create
. - Output (
response_format="text"
): Plain text output is requested as it's the most direct format for learners or systems to compare against expected text, analyze grammar, or review vocabulary. - Feedback Mechanism (Important Note): The explanation clearly states that this script only provides the transcription. The actual feedback (pronunciation scoring, grammar correction, vocabulary suggestions) requires additional processing. This transcribed text serves as the input for those subsequent analysis steps, which could involve rule-based checks, comparison algorithms, or even feeding the text to another LLM like GPT-4o for analysis.
- Use Case Relevance: The output section explains how this transcription enables the described benefits: allowing learners to check their pronunciation against text, providing material for grammar/vocabulary analysis, facilitating progress tracking, and supporting self-correction.
This example provides a practical starting point for integrating Whisper into language learning applications, focusing on generating the core textual data needed for effective feedback loops. Remember to use an audio file of someone speaking the specified target_language
for testing.
Summary
In this section, you learned several key aspects of audio processing and understanding:
- Transcribe spoken language into text using Whisper
- Convert various audio formats into accurate text transcriptions
- Handle multiple languages and accents with high accuracy
- Process both short clips and longer recordings effectively
- Translate foreign-language audio to English
- Convert non-English speech directly to English text
- Maintain context and meaning across language barriers
- Support multilingual content processing
- Choose between plain text, JSON, and subtitle outputs
- Select the most appropriate format for your specific use case
- Generate subtitles with precise timestamps
- Structure data in machine-readable JSON format
- Apply these tools in real-world use cases like accessibility, education, and content creation
- Create accessible content with accurate transcriptions
- Support language learning and educational initiatives
- Streamline content production workflows
Whisper is incredibly fast, easy to use, and works across languages—making it one of the most valuable tools in the OpenAI ecosystem. Its versatility and accuracy make it suitable for both individual users and enterprise-scale applications, while its open-source nature allows for custom implementations and modifications to suit specific needs.
2.2 Transcription and Translation with Whisper API
The Whisper API represents a significant advancement in automated speech recognition and translation technology. This section explores the fundamental aspects of working with Whisper, including its capabilities, implementation methods, and practical applications. We'll examine how to effectively utilize the API for both transcription and translation tasks, covering everything from basic setup to advanced features.
Whether you're building applications for content creators, developing educational tools, or creating accessibility solutions, understanding Whisper's functionality is crucial. We'll walk through detailed examples and best practices that demonstrate how to integrate this powerful tool into your projects, while highlighting important considerations for optimal performance.
Throughout this section, you'll learn not just the technical implementation details, but also the strategic considerations for choosing appropriate response formats and handling various audio inputs. This knowledge will enable you to build robust, scalable solutions for audio processing needs.
2.2.1 What Is Whisper?
Whisper is OpenAI's groundbreaking open-source automatic speech recognition (ASR) model that represents a significant advancement in audio processing technology. This sophisticated system excels at handling diverse audio inputs, capable of processing various file formats including .mp3
, .mp4
, .wav
, and .m4a
. What sets Whisper apart is its dual functionality: it not only converts spoken content into highly accurate text transcriptions but also offers powerful translation capabilities, seamlessly converting non-English audio content into fluent English output. The model's robust architecture ensures high accuracy across different accents, speaking styles, and background noise conditions.
Whisper demonstrates exceptional versatility across numerous applications:
Meeting or podcast transcriptions
Converts lengthy discussions and presentations into searchable, shareable text documents with remarkable accuracy. This functionality is particularly valuable for businesses and content creators who need to:
- Archive important meetings for future reference
- Create accessible versions of audio content
- Enable quick searching through hours of recorded content
- Generate written documentation from verbal discussions
- Support compliance requirements for record-keeping
The high accuracy rate ensures that technical terms, proper names, and complex discussions are captured correctly while maintaining the natural flow of conversation.
Example:
Download the audio sample here: https://files.cuantum.tech/audio/meeting_snippet.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt
current_location = "Houston, Texas, United States"
print(f"Running Whisper transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local audio file
# IMPORTANT: Replace 'meeting_snippet.mp3' with the actual filename.
audio_file_path = "meeting_snippet.mp3"
# --- Optional Parameters for potentially better accuracy ---
# Specify language (ISO-639-1 code) if known, otherwise Whisper auto-detects.
# Example: "en" for English, "es" for Spanish, "de" for German
known_language = "en" # Set to None to auto-detect
# Provide a prompt with context, names, or jargon expected in the audio.
# This helps Whisper recognize specific terms accurately.
transcription_prompt = "The discussion involves Project Phoenix, stakeholders like Dr. Evelyn Reed and ACME Corp, and technical terms such as multi-threaded processing and cloud-native architecture." # Set to None if no prompt needed
# --- Function to Transcribe Audio ---
def transcribe_audio(client, file_path, language=None, prompt=None):
"""
Transcribes the given audio file using the Whisper API.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str, optional): The language code (ISO-639-1). Defaults to None (auto-detect).
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
str: The transcription text, or None if an error occurs.
"""
print(f"\nAttempting to transcribe audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Check file size (optional but good practice)
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
print("Consider splitting the file into smaller chunks.")
# You might choose to exit here or attempt anyway
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
return None
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for transcription...")
# --- Make the API Call ---
response = client.audio.transcriptions.create(
model="whisper-1", # Specify the Whisper model
file=audio_file, # The audio file object
language=language, # Optional: Specify language
prompt=prompt, # Optional: Provide context prompt
response_format="text" # Request plain text output
# Other options: 'json', 'verbose_json', 'srt', 'vtt'
)
# The response object for "text" format is directly the string
transcription_text = response
print("Transcription successful.")
return transcription_text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
# Provide hints for common errors
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported (mp3, mp4, mpeg, mpga, m4a, wav, webm).")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit. Please split the file.")
return None
except FileNotFoundError: # Already handled above, but good practice
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
transcription = transcribe_audio(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if transcription:
print("\n--- Transcription Result ---")
print(transcription)
print("----------------------------\n")
# How this helps the use case:
print("This plain text transcription can now be:")
print("- Saved as a text document (.txt) for archiving.")
print("- Indexed and searched easily for specific keywords or names.")
print("- Used to generate meeting minutes or documentation.")
print("- Copied into accessibility tools or documents.")
print("- Stored for compliance record-keeping.")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_transcription.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(transcription)
print(f"Transcription saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving transcription to file: {e}")
else:
print("\nTranscription failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates transcribing audio files like meeting recordings or podcast episodes using OpenAI's Whisper API. The goal is to convert spoken content into accurate, searchable text, addressing needs like archiving, accessibility, and documentation.
- Prerequisites: It requires the
openai
andpython-dotenv
libraries, an OpenAI API key configured in a.env
file, and a sample audio file (meeting_snippet.mp3
). - Audio File Requirements: The script highlights the supported audio formats (MP3, WAV, M4A, etc.) and the crucial 25MB file size limit per API request. It includes a warning and explanation that longer files must be segmented (chunked) before processing, although the code itself handles a single file within the limit.
- Initialization: It sets up the standard
OpenAI
client using the API key. - Transcription Function (
transcribe_audio
):- Takes the
client
,file_path
, and optionallanguage
andprompt
arguments. - Includes a check for file existence and size.
- Opens the audio file in binary read mode (
"rb"
). - API Call: Uses
client.audio.transcriptions.create
with:model="whisper-1"
: Specifies the model known for high accuracy.file=audio_file
: Passes the opened file object.language
: Optionally provides the language code (e.g.,"en"
) to potentially improve accuracy if the language is known. IfNone
, Whisper auto-detects.prompt
: Optionally provides contextual keywords, names (like "Project Phoenix", "Dr. Evelyn Reed"), or jargon. This significantly helps Whisper accurately transcribe specialized terms often found in meetings or technical podcasts, directly addressing the need to capture complex discussions correctly.response_format="text"
: Requests the output directly as a plain text string, which is ideal for immediate use in documents, search indexing, etc. Other formats likeverbose_json
(for timestamps) orsrt
/vtt
(for subtitles) could be requested if needed.
- Error Handling: Includes
try...except
blocks for API errors (providing hints for common issues like file size or format) and file system errors.
- Takes the
- Output and Usefulness:
- The resulting transcription text is printed.
- The code explicitly connects this output back to the use case benefits: creating searchable archives, generating documentation, supporting accessibility, and enabling compliance.
- It includes an optional step to save the transcription directly to a
.txt
file.
This example provides a practical implementation of Whisper for the described use case, emphasizing accuracy features (prompting, language specification) and explaining how the text output facilitates the desired downstream tasks like searching and archiving. Remember to use an actual audio file (within the size limit) for testing.
Voice note conversion
Transforms quick voice memos into organized text notes, making reviewing and archiving spoken thoughts easier. This functionality is particularly valuable for:
- Creating quick reminders and to-do lists while on the go
- Capturing creative ideas or brainstorming sessions without interrupting the flow of thought
- Taking notes during field work or site visits where typing is impractical
- Documenting observations or research findings in real-time
The system maintains the natural flow of speech while organizing the content into clear, readable text that can be easily searched, shared, or integrated into other documents.
Example:
This script focuses on taking a typical voice memo audio file and converting it into searchable, usable text, ideal for capturing ideas, reminders, or field notes.
Download the sample audio here: https://files.cuantum.tech/audio/voice_memo.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Miami, Florida, United States"
print(f"Running Whisper voice note transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local voice note audio file
# IMPORTANT: Replace 'voice_memo.m4a' with the actual filename.
audio_file_path = "voice_memo.mp3"
# --- Optional Parameters (Often not needed for simple voice notes) ---
# Language auto-detection is usually sufficient for voice notes.
known_language = None # Set to "en", "es", etc. if needed
# Prompt is useful if your notes contain specific jargon/names, otherwise leave as None.
transcription_prompt = None # Example: "Remember to mention Project Chimera and the client ZetaCorp."
# --- Function to Transcribe Voice Note ---
def transcribe_voice_note(client, file_path, language=None, prompt=None):
"""
Transcribes the given voice note audio file using the Whisper API.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str, optional): The language code (ISO-639-1). Defaults to None (auto-detect).
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
str: The transcription text, or None if an error occurs.
"""
print(f"\nAttempting to transcribe voice note: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: Check file size (less likely to be an issue for voice notes)
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
# return None # Decide if you want to stop for large files
except OSError as e:
print(f"Error accessing file properties: {e}")
# Continue attempt even if size check fails
pass
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending voice note to Whisper API for transcription...")
# --- Make the API Call ---
response = client.audio.transcriptions.create(
model="whisper-1", # Specify the Whisper model
file=audio_file, # The audio file object
language=language, # Defaults to None (auto-detect)
prompt=prompt, # Defaults to None
response_format="text" # Request plain text output for easy use as notes
)
# The response object for "text" format is directly the string
note_text = response
print("Transcription successful.")
return note_text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported (m4a, mp3, wav, etc.).")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
transcribed_note = transcribe_voice_note(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if transcribed_note:
print("\n--- Transcribed Voice Note Text ---")
print(transcribed_note)
print("-----------------------------------\n")
# How this helps the use case:
print("This transcribed text from your voice note can be easily:")
print("- Copied into reminder apps or to-do lists.")
print("- Saved as a text file for archiving creative ideas or brainstorms.")
print("- Searched later for specific keywords or topics.")
print("- Shared via email or messaging apps.")
print("- Integrated into reports or documentation (e.g., field notes).")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_transcription.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(transcribed_note)
print(f"Transcription saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving transcription to file: {e}")
else:
print("\nVoice note transcription failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates using the Whisper API to transcribe short audio recordings like voice notes or memos. This is ideal for quickly capturing thoughts, ideas, reminders, or field observations and converting them into easily manageable text.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and a sample voice note audio file (common formats like.m4a
,.mp3
,.wav
are supported). - Audio File Handling: The script emphasizes that voice notes are typically well under the 25MB API limit. It opens the file in binary read mode (
"rb"
). - Initialization: Sets up the
OpenAI
client. - Transcription Function (
transcribe_voice_note
):- Similar to the meeting transcription function but tailored for voice notes.
- Defaults: It defaults
language
toNone
(auto-detect) andprompt
toNone
, as these are often sufficient for typical voice memos where specific jargon might be less common. The user can still provide these if needed. - API Call: Uses
client.audio.transcriptions.create
withmodel="whisper-1"
. response_format="text"
: Explicitly requests plain text output, which is the most practical format for notes – easy to read, search, copy, and share.- Error Handling: Includes standard
try...except
blocks for API and file errors.
- Output and Usefulness:
- Prints the resulting text transcription.
- Explicitly connects the output to the benefits mentioned in the use case description: creating reminders/to-dos, capturing ideas, documenting observations, enabling search, and sharing.
- Includes an optional step to save the transcribed note to a
.txt
file.
This example provides a clear and practical implementation for converting voice notes to text using Whisper, highlighting its convenience for capturing information on the go. Remember to use an actual voice note audio file for testing and update the audio_file_path
variable accordingly.
Multilingual speech translation
Breaks down language barriers by providing accurate translations while preserving the original context and meaning. This powerful feature enables:
- Real-time communication across language barriers in international meetings and conferences
- Translation of educational content for global audiences while maintaining academic integrity
- Cross-cultural business negotiations with precise translation of technical terms and cultural nuances
- Documentation translation for multinational organizations with consistent terminology
The system can detect the source language automatically and provides translations that maintain the speaker's original tone, intent, and professional context, making it invaluable for global collaboration.
Example:
This example takes an audio file containing speech in a language other than English and translates it directly into English text.
Download the sample audio here: https://files.cuantum.tech/audio/spanish_speech.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Denver, Colorado, United States"
print(f"Running Whisper speech translation example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local non-English audio file
# IMPORTANT: Replace 'spanish_speech.mp3' with the actual filename.
audio_file_path = "spanish_speech.mp3"
# --- Optional Parameters ---
# Prompt can help guide recognition of specific names/terms in the SOURCE language
# before translation, potentially improving accuracy of the final English text.
translation_prompt = None # Example: "The discussion mentions La Sagrada Familia and Parc Güell."
# --- Function to Translate Speech to English ---
def translate_speech_to_english(client, file_path, prompt=None):
"""
Translates speech from the given audio file into English using the Whisper API.
The source language is automatically detected.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file (non-English speech).
prompt (str, optional): A text prompt to guide source language recognition. Defaults to None.
Returns:
str: The translated English text, or None if an error occurs.
"""
print(f"\nAttempting to translate speech from audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: Check file size
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Translation may fail.")
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
pass # Continue attempt
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for translation to English...")
# --- Make the API Call for Translation ---
# Note: Using the 'translations' endpoint, not 'transcriptions'
response = client.audio.translations.create(
model="whisper-1", # Specify the Whisper model
file=audio_file, # The audio file object
prompt=prompt, # Optional: Provide context prompt
response_format="text" # Request plain text English output
# Other options: 'json', 'verbose_json', 'srt', 'vtt'
)
# The response object for "text" format is directly the translated string
translated_text = response
print("Translation successful.")
return translated_text
except OpenAIError as e:
print(f"OpenAI API Error during translation: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
# Whisper might error if it cannot detect speech or supported language
elif "language could not be detected" in str(e).lower():
print("Hint: Ensure the audio contains detectable speech in a supported language.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during translation: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
english_translation = translate_speech_to_english(
client,
audio_file_path,
prompt=translation_prompt
)
if english_translation:
print("\n--- Translated English Text ---")
print(english_translation)
print("-------------------------------\n")
# How this helps the use case:
print("This English translation enables:")
print("- Understanding discussions from international meetings.")
print("- Making educational content accessible to English-speaking audiences.")
print("- Facilitating cross-cultural business communication.")
print("- Creating English versions of documentation originally recorded in other languages.")
print("- Quick communication across language barriers using voice input.")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_translation_en.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(english_translation)
print(f"Translation saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving translation to file: {e}")
else:
print("\nSpeech translation failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates the Whisper API's capability to perform speech-to-text translation, specifically translating audio from various source languages directly into English text. This addresses the need for breaking down language barriers in global communication scenarios.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and crucially, a sample audio file containing speech in a language other than English (e.g., Spanish, French, German). - Key API Endpoint: This example uses
client.audio.translations.create
, which is distinct from thetranscriptions
endpoint used previously. Thetranslations
endpoint is specifically designed to output English text. - Automatic Language Detection: A key feature highlighted is that the source language of the audio file does not need to be specified; Whisper automatically detects it before translating to English.
- Initialization: Sets up the
OpenAI
client. - Translation Function (
translate_speech_to_english
):- Takes the
client
,file_path
, and an optionalprompt
. - Opens the non-English audio file in binary read mode (
"rb"
). - API Call: Uses
client.audio.translations.create
with:model="whisper-1"
: The standard Whisper model.file=audio_file
: The audio file object.prompt
: Optionally provides context (in the source language or English, often helps with names/terms) to aid accurate recognition before translation, helping to preserve nuances and technical terms.response_format="text"
: Requests plain English text output.
- Error Handling: Includes
try...except
blocks, noting potential errors if speech or a supported language isn't detected in the audio.
- Takes the
- Output and Usefulness:
- Prints the resulting English translation text.
- Explicitly connects this output to the benefits described in the use case: enabling understanding in international meetings, translating educational/business content for global audiences, and facilitating cross-language documentation and communication.
- Shows how to optionally save the English translation to a
.txt
file.
This example effectively showcases Whisper's powerful translation feature, making it invaluable for scenarios requiring communication or content understanding across different languages. Remember to use an audio file with non-English speech for testing and update the audio_file_path
accordingly.
Accessibility tools
Enables better digital inclusion by providing real-time transcription services for hearing-impaired users, offering several key benefits:
- Empowers deaf and hard-of-hearing individuals to participate fully in audio-based content
- Provides instant access to spoken information in professional settings like meetings and conferences
- Supports educational environments by making lectures and discussions accessible to all students
- Enhances media consumption by enabling accurate, real-time captioning for videos and live streams
Example:
This example focuses on using the verbose_json
output format to get segment-level timestamps, which are essential for syncing text with audio or video.
Download the sample audio here: https://files.cuantum.tech/audio/lecture_snippet.mp3
import os
import json # To potentially parse the verbose_json output nicely
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Atlanta, Georgia, United States" # User location context
print(f"Running Whisper accessibility transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local audio file
# IMPORTANT: Replace 'lecture_snippet.mp3' with the actual filename.
audio_file_path = "lecture_snippet.mp3"
# --- Optional Parameters ---
# Specifying language is often good for accuracy in accessibility contexts
known_language = "en"
# Prompt can help with specific terminology in lectures or meetings
transcription_prompt = "The lecture discusses quantum entanglement, superposition, and Bell's theorem." # Set to None if not needed
# --- Function to Transcribe Audio with Timestamps ---
def transcribe_for_accessibility(client, file_path, language="en", prompt=None):
"""
Transcribes the audio file using Whisper, requesting timestamped output
suitable for accessibility applications (e.g., captioning).
Note on 'Real-time': This function processes a complete file. True real-time
captioning requires capturing audio in chunks, sending each chunk to the API
quickly, and displaying the results sequentially. This example generates the
*type* of data needed for such applications from a pre-recorded file.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str): The language code (ISO-639-1). Defaults to "en".
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
dict: The parsed verbose_json response containing text and segments
with timestamps, or None if an error occurs.
"""
print(f"\nAttempting to transcribe for accessibility: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: File size check
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
pass # Continue attempt
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for timestamped transcription...")
# --- Make the API Call for Transcription ---
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language,
prompt=prompt,
# Request detailed JSON output including timestamps
response_format="verbose_json",
# Explicitly request segment-level timestamps
timestamp_granularities=["segment"]
)
# The response object is already a Pydantic model behaving like a dict
timestamped_data = response
print("Timestamped transcription successful.")
# You can access response.text, response.segments etc directly
return timestamped_data
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Function to Format Timestamps ---
def format_timestamp(seconds):
"""Converts seconds to HH:MM:SS.fff format."""
td = datetime.timedelta(seconds=seconds)
total_milliseconds = int(td.total_seconds() * 1000)
hours, remainder = divmod(total_milliseconds, 3600000)
minutes, remainder = divmod(remainder, 60000)
seconds, milliseconds = divmod(remainder, 1000)
return f"{hours:02}:{minutes:02}:{seconds:02}.{milliseconds:03}"
# --- Main Execution ---
if __name__ == "__main__":
transcription_data = transcribe_for_accessibility(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if transcription_data:
print("\n--- Full Transcription Text ---")
# Access the full text directly from the response object
print(transcription_data.text)
print("-------------------------------\n")
print("--- Transcription Segments with Timestamps ---")
# Iterate through segments for timestamped data
for segment in transcription_data.segments:
start_time = format_timestamp(segment['start'])
end_time = format_timestamp(segment['end'])
segment_text = segment['text']
print(f"[{start_time} --> {end_time}] {segment_text}")
print("---------------------------------------------\n")
# How this helps the use case:
print("This timestamped data enables:")
print("- Displaying captions synchronized with video or audio streams.")
print("- Highlighting text in real-time as it's spoken in meetings or lectures.")
print("- Creating accessible versions of educational/media content.")
print("- Allowing users to navigate audio by clicking on text segments.")
print("- Fuller participation for hearing-impaired individuals.")
# Optional: Save the detailed JSON output
output_json_file = os.path.splitext(audio_file_path)[0] + "_timestamps.json"
try:
# The response object can be converted to dict for JSON serialization
with open(output_json_file, "w", encoding="utf-8") as f:
# Use .model_dump_json() for Pydantic V2 models from openai>=1.0.0
# or .dict() for older versions/models
try:
f.write(transcription_data.model_dump_json(indent=2))
except AttributeError:
# Fallback for older versions or different object types
import json
f.write(json.dumps(transcription_data, default=lambda o: o.__dict__, indent=2))
print(f"Detailed timestamp data saved to '{output_json_file}'")
except (IOError, TypeError, AttributeError) as e:
print(f"Error saving timestamp data to JSON file: {e}")
else:
print("\nTranscription for accessibility failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates using the Whisper API to generate highly accurate transcriptions with segment-level timestamps. This output is crucial for accessibility applications, enabling features like real-time captioning, synchronized text highlighting, and navigable transcripts for hearing-impaired users.
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file (lecture_snippet.mp3
). - Key API Parameters:
- Endpoint: Uses
client.audio.transcriptions.create
. response_format="verbose_json"
: This is essential. It requests a detailed JSON object containing not only the full transcription text but also a list of segments, each with start/end times and the corresponding text.timestamp_granularities=["segment"]
: Explicitly requests segment-level timing information (though often included by default withverbose_json
). Word-level timestamps can also be requested if needed (["word"]
), but segments are typically used for captioning.language
/prompt
: Specifying the language (en
) and providing a prompt can enhance accuracy, which is vital for accessibility.
- Endpoint: Uses
- "Real-Time" Consideration: The explanation clarifies that while this code processes a file, the timestamped output it generates is what's needed for real-time applications. True live captioning would involve feeding audio chunks to the API rapidly.
- Initialization & Function (
transcribe_for_accessibility
): Standard client setup. The function encapsulates the API call requestingverbose_json
. - Output Processing:
- The code first prints the full concatenated text (
transcription_data.text
). - It then iterates through the
transcription_data.segments
list. - For each
segment
, it extracts thestart
time,end
time, andtext
. - A helper function (
format_timestamp
) converts the times (in seconds) into a standardHH:MM:SS.fff
format. - It prints each segment with its timing information (e.g.,
[00:00:01.234 --> 00:00:05.678] This is the first segment text.
).
- The code first prints the full concatenated text (
- Use Case Relevance: The output clearly shows how this timestamped data directly enables the benefits described: synchronizing text with audio/video for captions, allowing participation in meetings/lectures, making educational content accessible, and enhancing media consumption.
- Saving Output: Includes an option to save the complete
verbose_json
response to a file for later use or more complex processing. It handles potential differences in serializing the response object from theopenai
library.
This example effectively demonstrates how to obtain the necessary timestamped data from Whisper to power various accessibility features, making audio content more inclusive. Remember to use a relevant audio file for testing.
Video captioning workflows
Streamlines the creation of accurate subtitles and closed captions for video content, supporting multiple output formats including SRT, WebVTT, and other industry-standard caption formats. This capability is essential for:
- Content creators who need to make their videos accessible across multiple platforms
- Broadcasting companies requiring accurate closed captioning for regulatory compliance
- Educational institutions creating accessible video content for diverse student populations
- Social media managers who want to increase video engagement through auto-captioning
The system can automatically detect speaker changes, handle timing synchronization, and format captions according to industry best practices, making it an invaluable tool for professional video production workflows.
Example:
This code example takes an audio file (which you would typically extract from your video first) and uses Whisper to create accurately timestamped captions in the standard .srt
format.
Download the audio sample here: https://files.cuantum.tech/audio/video_audio.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Austin, Texas, United States" # User location context
print(f"Running Whisper caption generation (SRT) example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your local audio file (extracted from video)
# IMPORTANT: Replace 'video_audio.mp3' with the actual filename.
audio_file_path = "video_audio.mp3"
# --- Optional Parameters ---
# Specifying language can improve caption accuracy
known_language = "en"
# Prompt can help with names, brands, or specific terminology in the video
transcription_prompt = "The video features interviews with Dr. Anya Sharma about sustainable agriculture and mentions the company 'TerraGrow'." # Set to None if not needed
# --- Function to Generate SRT Captions ---
def generate_captions_srt(client, file_path, language="en", prompt=None):
"""
Transcribes the audio file using Whisper and returns captions in SRT format.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file (extracted from video).
language (str): The language code (ISO-639-1). Defaults to "en".
prompt (str, optional): A text prompt to guide transcription. Defaults to None.
Returns:
str: The caption data in SRT format string, or None if an error occurs.
"""
print(f"\nAttempting to generate SRT captions for: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Optional: File size check
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
# return None
except OSError as e:
print(f"Error accessing file properties: {e}")
pass # Continue attempt
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print("Sending audio file to Whisper API for SRT caption generation...")
# --- Make the API Call for Transcription ---
# Request 'srt' format directly
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language,
prompt=prompt,
response_format="srt" # Request SRT format output
# Other options include "vtt"
)
# The response object for "srt" format is directly the SRT string
srt_content = response
print("SRT caption generation successful.")
return srt_content
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
srt_captions = generate_captions_srt(
client,
audio_file_path,
language=known_language,
prompt=transcription_prompt
)
if srt_captions:
# --- Save the SRT Content to a File ---
output_srt_file = os.path.splitext(audio_file_path)[0] + ".srt"
try:
with open(output_srt_file, "w", encoding="utf-8") as f:
f.write(srt_captions)
print(f"\nSRT captions saved successfully to '{output_srt_file}'")
# How this helps the use case:
print("\nThis SRT file can be used to:")
print("- Add closed captions or subtitles to video players (like YouTube, Vimeo, VLC).")
print("- Import into video editing software for caption integration.")
print("- Meet accessibility requirements and regulatory compliance (e.g., broadcasting).")
print("- Improve video engagement on social media platforms.")
print("- Make educational video content accessible to more students.")
except IOError as e:
print(f"Error saving SRT captions to file: {e}")
else:
print("\nSRT caption generation failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates how to use the Whisper API to streamline video captioning workflows by directly generating industry-standard SRT (SubRip Text) subtitle files from audio.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup and API key. Crucially, it requires an audio file that has been extracted from the target video. The script includes a note explaining this necessary pre-processing step. - Key API Parameter (
response_format="srt"
):- The core of this use case is requesting the
srt
format directly from theclient.audio.transcriptions.create
endpoint. - This tells Whisper to format the output according to SRT specifications, including sequential numbering, start/end timestamps (e.g.,
00:00:20,000 --> 00:00:24,400
), and the corresponding text chunk. The API handles the timing synchronization automatically.vtt
(WebVTT) is another common format that can be requested similarly.
- The core of this use case is requesting the
- Initialization & Function (
generate_captions_srt
): Standard client setup. The function encapsulates the API call specifically requesting SRT format. Optionallanguage
andprompt
parameters can be used to enhance accuracy. - Output Handling:
- The
response
from the API call (whenresponse_format="srt"
) is directly the complete content of the SRT file as a single string. - The code then saves this string directly into a file with the
.srt
extension.
- The
- Use Case Relevance:
- The explanation highlights how this generated
.srt
file directly serves the needs of content creators, broadcasters, educators, and social media managers. - It can be easily uploaded to video platforms, imported into editing software, or used to meet accessibility and compliance standards. It significantly simplifies the traditionally time-consuming process of manual caption creation.
- The explanation highlights how this generated
- Error Handling: Includes standard checks for API and file system errors.
This example provides a highly practical demonstration of using Whisper for a common video production task, showcasing how to get accurately timestamped captions in a ready-to-use format. Remember the essential first step is to extract the audio from the video you wish to caption.
💡 Tip: Whisper works best when audio is clear, and it supports input files up to 25MB in size per request.
2.2.2 Response Formats You Can Use
Whether you're building a simple transcription service or developing a complex video captioning system, understanding these formats is crucial for effectively implementing Whisper in your applications.
The flexibility of Whisper's response formats allows developers to seamlessly integrate transcription and translation capabilities into different types of applications, from basic text output to more sophisticated JSON-based processing pipelines. Each format serves distinct purposes and offers unique advantages depending on your use case.
Whisper supports multiple formats for your needs:
2.2.3 Practical Applications of Whisper
Whisper's versatility extends far beyond basic transcription, offering numerous practical applications across different industries and use cases. From streamlining business operations to enhancing educational experiences, Whisper's capabilities can be leveraged in innovative ways to solve real-world challenges. Let's explore some of the most impactful applications of this technology and understand how they can benefit different user groups.
These applications demonstrate how Whisper can be integrated into various workflows to improve efficiency, accessibility, and communication across different sectors. Each use case showcases unique advantages and potential implementations that can transform how we handle audio content in professional and personal contexts.
Note-taking Tools
Transform spoken content into written text automatically through advanced speech recognition, enabling quick and accurate documentation of verbal communications. This technology excels at capturing natural speech patterns, technical terminology, and multiple speaker interactions, making it easier to document and review lectures, interviews, and meetings. The automated transcription process maintains context and speaker attribution while converting audio to easily searchable and editable text formats.
This tool is particularly valuable for:
- Students capturing detailed lecture notes while staying engaged in class discussions
- Journalists documenting interviews and press conferences with precise quotations
- Business professionals keeping accurate records of client meetings and team brainstorming sessions
- Researchers conducting and transcribing qualitative interviews
The automated notes can be further enhanced with features like timestamp markers, speaker identification, and keyword highlighting, making it easier to navigate and reference specific parts of the discussion later. This dramatically improves productivity by eliminating the need for manual transcription while ensuring comprehensive documentation of important conversations.
Multilingual Applications
Enable seamless communication across language barriers by converting spoken words from one language to another in real-time. This powerful capability allows for instant translation of spoken content, supporting over 100 languages with high accuracy. The system can detect language automatically and handle various accents and dialects, making it particularly valuable for:
- International Business: Facilitating real-time communication in multinational meetings, negotiations, and presentations without the need for human interpreters
- Travel and Tourism: Enabling travelers to communicate effectively with locals, understand announcements, and navigate foreign environments
- Global Education: Supporting distance learning programs and international student exchanges by breaking down language barriers
- Customer Service: Allowing support teams to assist customers in their preferred language, improving service quality and satisfaction
The technology works across different audio environments and can handle multiple speakers, making it ideal for international business meetings, travel applications, global communication platforms, and cross-cultural exchanges.
Accessibility Solutions
Create real-time captions and transcripts for hearing-impaired individuals, ensuring equal access to audio content. This essential technology serves multiple accessibility purposes:
- Real-time Captioning: Provides immediate text representation of spoken words during live events, meetings, and presentations
- Accurate Transcription: Generates detailed written records of audio content for later reference and study
- Multi-format Support: Offers content in various formats including closed captions, subtitles, and downloadable transcripts
This technology can be integrated into:
- Video Conferencing Platforms: Enabling real-time captioning during virtual meetings and webinars
- Educational Platforms: Supporting students with hearing impairments in both online and traditional classroom settings
- Media Players: Providing synchronized captions for videos, podcasts, and other multimedia content
- Live Events: Offering real-time text displays during conferences, performances, and public speaking events
These accessibility features not only support individuals with hearing impairments but also benefit:
- Non-native speakers who prefer reading along while listening
- People in sound-sensitive environments where audio isn't practical
- Visual learners who process information better through text
- Organizations aiming to comply with accessibility regulations and standards
Podcast Enhancement
Convert audio content into searchable text formats through comprehensive transcription, enabling multiple benefits:
- Content Discovery: Listeners can easily search through episode transcripts to find specific topics, quotes, or discussions of interest
- Content Repurposing: Creators can extract key quotes and insights for social media, blog posts, or newsletters
- Accessibility: Makes content available to hearing-impaired audiences and those who prefer reading
- SEO Benefits: Search engines can index the full content of episodes, improving discoverability and organic traffic
- Enhanced Engagement: Readers can skim content before listening, bookmark important sections, and reference material later
- Analytics Insights: Analyze most-searched terms and popular segments to inform future content strategy
Language Learning Support
Provide immediate feedback to language learners by converting their spoken practice into text, allowing them to check pronunciation, grammar, and vocabulary usage. This technology creates an interactive and effective learning experience through several key mechanisms:
- Real-time Pronunciation Feedback: Learners can compare their spoken words with the transcribed text to identify pronunciation errors and areas for improvement
- Grammar Analysis: The system can highlight grammatical structures and potential errors in the transcribed text, helping learners understand their mistakes
- Vocabulary Enhancement: Students can track their active vocabulary usage and receive suggestions for more varied word choices
The technology particularly benefits:
- Self-directed learners practicing independently
- Language teachers tracking student progress
- Online language learning platforms offering speaking practice
- Language exchange participants wanting to verify their communication
When integrated with language learning applications, this feature can provide structured practice sessions, progress tracking, and personalized feedback that helps learners build confidence in their speaking abilities.
Example:
This example takes an audio file of a learner speaking a specific target language and transcribes it, providing the essential text output needed for pronunciation comparison, grammar analysis, or vocabulary review.
Download the sample audio file here: https://files.cuantum.tech/audio/language_practice_fr.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
# Location context from user prompt history/context block
current_location = "Orlando, Florida, United States" # User location context
print(f"Running Whisper language learning transcription example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in your .env file or environment.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# --- Language Learning Specific Configuration ---
# Define the path to the language practice audio file
# IMPORTANT: Replace 'language_practice_fr.mp3' with the actual filename.
audio_file_path = "language_practice_fr.mp3"
# ** CRUCIAL: Specify the language being spoken in the audio file **
# Use the correct ISO-639-1 code (e.g., "en", "es", "fr", "de", "ja", "zh")
# This tells Whisper how to interpret the sounds.
target_language = "fr" # Example: French
# Optional: Provide context about the practice session (e.g., the expected phrase)
# This can help improve accuracy, especially for specific vocabulary or sentences.
practice_prompt = "The student is practicing ordering food at a French cafe, mentioning croissants and coffee." # Set to None if not needed
# --- Function to Transcribe Language Practice Audio ---
def transcribe_language_practice(client, file_path, language, prompt=None):
"""
Transcribes audio of language practice using the Whisper API.
Specifying the target language is crucial for this use case.
Args:
client: The initialized OpenAI client.
file_path (str): The path to the audio file.
language (str): The ISO-639-1 code of the language being spoken.
prompt (str, optional): A text prompt providing context. Defaults to None.
Returns:
str: The transcribed text in the target language, or None if an error occurs.
"""
print(f"\nAttempting to transcribe language practice ({language}) from: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
if not language:
print("Error: Target language must be specified for language learning transcription.")
return None
# Optional: File size check
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")
if file_size_mb > 25:
print("Warning: File size exceeds 25MB limit. Transcription may fail.")
except OSError as e:
print(f"Error accessing file properties: {e}")
pass
try:
# Open the audio file in binary read mode
with open(file_path, "rb") as audio_file:
print(f"Sending audio ({language}) to Whisper API for transcription...")
# --- Make the API Call for Transcription ---
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language, # Pass the specified target language
prompt=prompt,
response_format="text" # Get plain text for easy comparison/analysis
)
# The response object for "text" format is directly the string
practice_transcription = response
print("Transcription successful.")
return practice_transcription
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
if "invalid_file_format" in str(e):
print("Hint: Ensure the audio file format is supported.")
elif "maximum file size" in str(e).lower():
print("Hint: The audio file exceeds the 25MB size limit.")
elif "invalid language code" in str(e).lower():
print(f"Hint: Check if '{language}' is a valid ISO-639-1 language code supported by Whisper.")
return None
except FileNotFoundError:
print(f"Error: Audio file not found at '{file_path}'")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
transcribed_text = transcribe_language_practice(
client,
audio_file_path,
language=target_language,
prompt=practice_prompt
)
if transcribed_text:
print("\n--- Transcribed Practice Text ---")
print(f"(Language: {target_language})")
print(transcribed_text)
print("---------------------------------\n")
# How this helps the use case:
print("This transcription provides the basis for language learning feedback:")
print("- **Pronunciation Check:** Learner compares their speech to the text, identifying discrepancies.")
print("- **Grammar/Vocabulary Analysis:** This text can be compared against expected sentences or analyzed (potentially by another AI like GPT-4o, or specific tools) for grammatical correctness and vocabulary usage.")
print("- **Progress Tracking:** Teachers or platforms can store transcriptions to monitor improvement over time.")
print("- **Self-Correction:** Learners get immediate textual representation of their speech for review.")
print("\nNote: Further analysis (grammar checking, pronunciation scoring, etc.) requires additional logic beyond this transcription step.")
# Optional: Save to a file
output_txt_file = os.path.splitext(audio_file_path)[0] + "_transcription.txt"
try:
with open(output_txt_file, "w", encoding="utf-8") as f:
f.write(transcribed_text)
print(f"Transcription saved to '{output_txt_file}'")
except IOError as e:
print(f"Error saving transcription to file: {e}")
else:
print("\nLanguage practice transcription failed. Please check error messages above.")
Code breakdown:
- Context: This code demonstrates using the Whisper API as a tool for language learning support. It transcribes audio recordings of a learner speaking a target language, providing the essential text output needed for various feedback mechanisms.
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and an audio file containing the learner's spoken practice in the target language (e.g.,language_practice_fr.mp3
). - Key Parameter (
language
):- Unlike general transcription where auto-detect might suffice, for language learning, explicitly specifying the
language
the learner is attempting to speak is crucial. This is set using thetarget_language
variable (e.g.,"fr"
for French). - This ensures Whisper interprets the audio using the correct phonetic and vocabulary model, providing a more accurate transcription for feedback.
- Unlike general transcription where auto-detect might suffice, for language learning, explicitly specifying the
- Optional Prompt: The
prompt
parameter can be used to give Whisper context about the practice session (e.g., the specific phrase being practiced), which can improve recognition accuracy. - Initialization & Function (
transcribe_language_practice
): Standard client setup. The function requires thelanguage
parameter and performs the transcription usingclient.audio.transcriptions.create
. - Output (
response_format="text"
): Plain text output is requested as it's the most direct format for learners or systems to compare against expected text, analyze grammar, or review vocabulary. - Feedback Mechanism (Important Note): The explanation clearly states that this script only provides the transcription. The actual feedback (pronunciation scoring, grammar correction, vocabulary suggestions) requires additional processing. This transcribed text serves as the input for those subsequent analysis steps, which could involve rule-based checks, comparison algorithms, or even feeding the text to another LLM like GPT-4o for analysis.
- Use Case Relevance: The output section explains how this transcription enables the described benefits: allowing learners to check their pronunciation against text, providing material for grammar/vocabulary analysis, facilitating progress tracking, and supporting self-correction.
This example provides a practical starting point for integrating Whisper into language learning applications, focusing on generating the core textual data needed for effective feedback loops. Remember to use an audio file of someone speaking the specified target_language
for testing.
Summary
In this section, you learned several key aspects of audio processing and understanding:
- Transcribe spoken language into text using Whisper
- Convert various audio formats into accurate text transcriptions
- Handle multiple languages and accents with high accuracy
- Process both short clips and longer recordings effectively
- Translate foreign-language audio to English
- Convert non-English speech directly to English text
- Maintain context and meaning across language barriers
- Support multilingual content processing
- Choose between plain text, JSON, and subtitle outputs
- Select the most appropriate format for your specific use case
- Generate subtitles with precise timestamps
- Structure data in machine-readable JSON format
- Apply these tools in real-world use cases like accessibility, education, and content creation
- Create accessible content with accurate transcriptions
- Support language learning and educational initiatives
- Streamline content production workflows
Whisper is incredibly fast, easy to use, and works across languages—making it one of the most valuable tools in the OpenAI ecosystem. Its versatility and accuracy make it suitable for both individual users and enterprise-scale applications, while its open-source nature allows for custom implementations and modifications to suit specific needs.