Chapter 2: Audio Understanding and Generation with Whisper and GPT-4o
2.3 Speech Understanding in GPT-4o
In this section, you'll discover how to work with advanced audio processing capabilities that go beyond basic transcription. GPT-4o introduces a revolutionary approach to audio understanding by allowing direct integration of audio files alongside textual prompts. This creates a seamless multimodal interaction system where both audio and text inputs are processed simultaneously. The system can analyze various aspects of speech, including tone, context, and semantic meaning, enabling you to build sophisticated smart assistants that can listen, understand, and respond naturally within any given context.
The technology represents a significant advancement in audio processing by combining Whisper-style transcription with GPT-4o's advanced reasoning capabilities. While Whisper excels at converting speech to text, GPT-4o takes this further by performing deep analysis of the transcribed content. This integration happens in one fluid interaction, eliminating the need for separate processing steps. For example, when processing a business meeting recording, GPT-4o can simultaneously transcribe the speech, identify speakers, extract action items, and generate summaries - all while maintaining context and understanding subtle nuances in communication.
This powerful combination opens up unprecedented possibilities for creating more intuitive and responsive AI applications. These applications can not only process and understand spoken language but can also interpret context, emotion, and intent in ways that were previously not possible. Whether it's analyzing customer service calls, processing educational lectures, or facilitating multilingual communication, the system provides a comprehensive understanding of spoken content that goes far beyond simple transcription.
2.3.1 Why GPT-4o for Speech?
While the Whisper API excels at converting spoken language into written text, GPT-4o represents a revolutionary leap forward in audio processing capabilities. To understand the distinction, imagine Whisper as a highly skilled transcriptionist who can accurately write down every word spoken, while GPT-4o functions more like an experienced analyst with deep contextual understanding.
GPT-4o's capabilities extend far beyond basic transcription. It can understand and process speech at multiple levels simultaneously:
Semantic Understanding
Comprehends the actual meaning behind the words, going beyond simple word-for-word translation. This advanced capability allows GPT-4o to process language at multiple levels simultaneously, understanding not only the literal meaning but also the deeper semantic layers, cultural context, and intended message. This includes understanding idioms, metaphors, cultural references, and regional expressions within the speech, as well as detecting subtle nuances in communication that might be lost in simple transcription.
For example, when someone says "it's raining cats and dogs," GPT-4o understands this means heavy rainfall rather than literally interpreting animals falling from the sky. Similarly, when processing phrases like "break a leg" before a performance or "piece of cake" to describe an easy task, the system correctly interprets these idiomatic expressions within their cultural context.
It can also grasp complex concepts like sarcasm ("Oh, great, another meeting"), humor ("Why did the GPT model cross the road?"), and rhetorical questions ("Who wouldn't want that?"), making it capable of truly understanding human communication in its full context. This sophisticated understanding extends to cultural-specific references, professional jargon, and even regional dialectical variations, ensuring accurate interpretation regardless of the speaker's background or communication style.
Example:
Since the standard OpenAI API interaction for this typically involves first converting speech to text (using Whisper) and then analyzing that text for deeper meaning (using GPT-4o), the code example will demonstrate this two-step process.
This script will:
- Transcribe an audio file containing potentially nuanced language using Whisper.
- Send the transcribed text to GPT-4o with a prompt asking for semantic interpretation.
Download the audio sample: https://files.cuantum.tech/audio/idiom_speech.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 7:37 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-04-21 19:37:00 CDT"
current_location = "Dallas, Texas, United States"
print(f"Running GPT-4o semantic speech understanding example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with nuanced speech
# IMPORTANT: Replace 'idiom_speech.mp3' with the actual filename.
# Good examples for audio content: "Wow, that presentation just knocked my socks off!",
# "Sure, I'd LOVE to attend another three-hour meeting.", "He really spilled the beans."
audio_file_path = "idiom_speech.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Text for Semantic Meaning using GPT-4o ---
def analyze_text_meaning(client, text_to_analyze):
"""Sends transcribed text to GPT-4o for semantic analysis."""
print(f"\nStep 2: Analyzing text for semantic meaning: \"{text_to_analyze}\"")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Construct prompt to ask for deeper meaning
system_prompt = "You are an expert in linguistics and communication."
user_prompt = (
f"Analyze the following phrase or sentence:\n\n'{text_to_analyze}'\n\n"
"Explain its likely intended meaning, considering context, idioms, "
"metaphors, sarcasm, humor, cultural references, or other nuances. "
"Go beyond a literal, word-for-word interpretation."
)
try:
print("Sending text to GPT-4o for analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for its strong understanding capabilities
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=250, # Adjust as needed
temperature=0.5 # Lower temperature for more focused analysis
)
analysis = response.choices[0].message.content
print("Semantic analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio
transcribed_text = transcribe_speech(client, audio_file_path)
if transcribed_text:
print(f"\nTranscription Result: {transcribed_text}")
# Step 2: Analyze the transcription for meaning
semantic_analysis = analyze_text_meaning(client, transcribed_text)
if semantic_analysis:
print("\n--- Semantic Analysis Result ---")
print(semantic_analysis)
print("--------------------------------\n")
print("This demonstrates GPT-4o understanding nuances beyond literal text.")
else:
print("\nSemantic analysis failed.")
else:
print("\nTranscription failed, cannot proceed to analysis.")
Code breakdown:
- Context: This code demonstrates GPT-4o's advanced semantic understanding of speech. It goes beyond simple transcription by interpreting the meaning, including nuances like idioms, sarcasm, or context-dependent phrases.
- Two-Step Process: The example uses a standard two-step API approach:
- Step 1 (Whisper): The audio file is first converted into text using the Whisper API (
client.audio.transcriptions.create
). This captures the spoken words accurately. - Step 2 (GPT-4o): The transcribed text is then sent to the GPT-4o model (
client.chat.completions.create
) with a specific prompt asking it to analyze the meaning behind the words, considering non-literal interpretations.
- Step 1 (Whisper): The audio file is first converted into text using the Whisper API (
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and crucially, an audio file containing speech that has some nuance (e.g., includes an idiom like "spill the beans", a sarcastic remark like "Oh great, another meeting", or a culturally specific phrase). - Transcription Function (
transcribe_speech
): This function handles Step 1, taking the audio file path and returning the plain text transcription from Whisper. - Semantic Analysis Function (
analyze_text_meaning
):- This function handles Step 2. It takes the transcribed text.
- Prompt Design: It constructs a prompt specifically asking GPT-4o to act as a linguistic expert and explain the intended meaning, considering idioms, sarcasm, context, etc., explicitly requesting analysis beyond the literal interpretation.
- Uses
gpt-4o
as the model for its strong reasoning and understanding capabilities. - Returns the analysis provided by GPT-4o.
- Main Execution: The script first transcribes the audio. If successful, it passes the text to the analysis function. Finally, it prints both the literal transcription and GPT-4o's semantic interpretation.
- Use Case Relevance: This example clearly shows how combining Whisper and GPT-4o allows for a deeper understanding of spoken language than transcription alone. It demonstrates the capability described – comprehending idioms ("raining cats and dogs"), sarcasm, humor, and context – making AI interaction more aligned with human communication.
Remember to use an audio file containing non-literal language for testing to best showcase the semantic analysis step. Replace 'idiom_speech.mp3'
with your actual file path.
Contextual Analysis
Interprets statements within their broader context, taking into account surrounding information, previous discussions, cultural references, and situational factors. This includes understanding how time, place, speaker relationships, and prior conversations influence meaning. The analysis considers multiple layers of context:
- Temporal Context: When something is said (time of day, day of week, season, or historical period)
- Social Context: The relationships between speakers, power dynamics, and social norms
- Physical Context: The location and environment where communication occurs
- Cultural Context: Shared knowledge, beliefs, and customs that influence interpretation
For example, the phrase "it's getting late" could mean different things in different contexts:
- During a workday meeting: A polite suggestion to wrap up the discussion
- At a social gathering: An indication that someone needs to leave
- From a parent to a child: A reminder about bedtime
- In a project discussion: Concern about approaching deadlines
GPT-4o analyzes these contextual clues along with additional factors such as tone of voice, speech patterns, and conversation history to provide more accurate and nuanced interpretations of spoken communication. This deep contextual understanding allows the system to capture the true intended meaning behind words, rather than just their literal interpretation.
Example:
This use case focuses on GPT-4o's ability to interpret transcribed speech within its broader context (temporal, social, physical, cultural). Like the semantic understanding example, this typically involves a two-step process: transcribing the speech with Whisper, then analyzing the text with GPT-4o, but this time explicitly providing contextual information to GPT-4o.
This code example will:
- Transcribe a simple, context-dependent phrase from an audio file using Whisper.
- Send the transcribed text to GPT-4o multiple times, each time providing a different context description.
- Show how GPT-4o's interpretation of the same phrase changes based on the provided context.
Download the sample audio: https://files.cuantum.tech/audio/context_phrase.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 7:44 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-02-11 11:44:00 CDT" # Updated time
current_location = "Miami, Florida, United States"
print(f"Running GPT-4o contextual speech analysis example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with the context-dependent phrase
# IMPORTANT: Replace 'context_phrase.mp3' with the actual filename.
# The audio content should ideally be just "It's getting late."
audio_file_path = "context_phrase.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from the previous example (gpt4o_speech_semantic_py)
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Text for Meaning WITHIN a Given Context using GPT-4o ---
def analyze_text_with_context(client, text_to_analyze, context_description):
"""Sends transcribed text and context description to GPT-4o for analysis."""
print(f"\nStep 2: Analyzing text \"{text_to_analyze}\" within context...")
print(f"Context Provided: {context_description}")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
if not context_description:
print("Error: Context description must be provided for this analysis.")
return None
# Construct prompt asking for interpretation based on context
system_prompt = "You are an expert in analyzing communication and understanding context."
user_prompt = (
f"Consider the phrase: '{text_to_analyze}'\n\n"
f"Now, consider the specific context in which it was said: '{context_description}'\n\n"
"Based *only* on this context, explain the likely intended meaning, implication, "
"or function of the phrase in this situation."
)
try:
print("Sending text and context to GPT-4o for analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong contextual reasoning
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=200, # Adjust as needed
temperature=0.3 # Lower temperature for more focused contextual interpretation
)
analysis = response.choices[0].message.content
print("Contextual analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio phrase
transcribed_phrase = transcribe_speech(client, audio_file_path)
if transcribed_phrase:
print(f"\nTranscription Result: \"{transcribed_phrase}\"")
# Define different contexts for the same phrase
contexts = [
"Said during a business meeting scheduled to end at 5:00 PM, spoken at 4:55 PM.",
"Said by a guest at a social party around 1:00 AM.",
"Said by a parent to a young child at 9:00 PM on a school night.",
"Said during a critical project discussion about an upcoming deadline, spoken late in the evening.",
"Said by someone looking out the window on a short winter afternoon."
]
print("\n--- Analyzing Phrase in Different Contexts ---")
# Step 2: Analyze the phrase within each context
for i, context in enumerate(contexts):
print(f"\n--- Analysis for Context {i+1} ---")
contextual_meaning = analyze_text_with_context(
client,
transcribed_phrase,
context
)
if contextual_meaning:
print(f"Meaning in Context: {contextual_meaning}")
else:
print("Contextual analysis failed for this context.")
print("------------------------------------")
print("\nThis demonstrates how GPT-4o interprets the same phrase differently based on provided context.")
else:
print("\nTranscription failed, cannot proceed to contextual analysis.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability for contextual analysis of speech. It shows how the interpretation of a spoken phrase can change dramatically depending on the surrounding situation (temporal, social, situational factors).
- Two-Step Process with Context Injection:
- Step 1 (Whisper): The audio file containing a context-dependent phrase (e.g., "It's getting late.") is transcribed into text using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): The transcribed text is then sent to GPT-4o (
client.chat.completions.create
), but crucially, the prompt now includes a description of the specific context in which the phrase was spoken.
- Step 1 (Whisper): The audio file containing a context-dependent phrase (e.g., "It's getting late.") is transcribed into text using
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and an audio file containing a simple phrase whose meaning heavily depends on context (the example uses "It's getting late."). - Transcription Function (
transcribe_speech
): This function (reused from the previous example) handles Step 1. - Contextual Analysis Function (
analyze_text_with_context
):- This function handles Step 2 and now accepts an additional argument:
context_description
. - Prompt Design: The prompt explicitly provides both the transcribed phrase and the
context_description
to GPT-4o, asking it to interpret the phrase within that specific situation. - Uses
gpt-4o
for its ability to reason based on provided context.
- This function handles Step 2 and now accepts an additional argument:
- Demonstrating Context Dependency (Main Execution):
- The script first transcribes the phrase (e.g., "It's getting late.").
- It then defines a list of different context descriptions (meeting ending, late-night party, bedtime, project deadline, short winter day).
- It calls the
analyze_text_with_context
function repeatedly, using the same transcribed phrase but providing a different context description each time. - By printing the analysis result for each context, the script clearly shows how GPT-4o's interpretation shifts based on the context provided (e.g., suggesting wrapping up vs. indicating tiredness vs. noting dwindling daylight).
- Use Case Relevance: This highlights GPT-4o's sophisticated understanding, moving beyond literal words to grasp intended meaning influenced by temporal, social, and situational factors. This is vital for applications needing accurate interpretation of real-world communication in business, social interactions, or any context-rich environment. It shows how developers can provide relevant context alongside transcribed text to get more accurate and nuanced interpretations from the AI.
For testing this code effectively, either create an audio file containing just the phrase "It's getting late" (or another context-dependent phrase), or download the provided sample file. Remember to update the 'context_phrase.mp3'
path to match your file location.
Summary Generation
GPT-4o's summary generation capabilities represent a significant advancement in AI-powered content analysis. The system creates concise, meaningful summaries of complex discussions by intelligently distilling key information from lengthy conversations, meetings, or presentations. Using advanced natural language processing and contextual understanding, GPT-4o can identify main themes, critical points, and essential takeaways while maintaining the core meaning and context of the original discussion.
The system employs several sophisticated techniques:
- Pattern Recognition: Identifies recurring themes and important discussion points across long conversations
- Contextual Analysis: Understands the broader context and relationships between different parts of the discussion
- Priority Detection: Automatically determines which information is most crucial for the summary
- Semantic Understanding: Captures underlying meanings and implications beyond just surface-level content
The generated summaries can be customized for different purposes and audiences:
- Executive Briefings: Focused on strategic insights and high-level decisions
- Meeting Minutes: Detailed documentation of discussions and action items
- Quick Overviews: Condensed highlights for rapid information consumption
- Technical Summaries: Emphasis on specific technical details and specifications
What sets GPT-4o apart is its ability to preserve important details while significantly reducing information overload, making it an invaluable tool for modern business communication and knowledge management.
Example:
This example focuses on GPT-4o's ability to generate concise and meaningful summaries from potentially lengthy spoken content obtained via Whisper.
This involves the familiar two-step process: first, transcribing the audio with Whisper to get the full text, and second, using GPT-4o's language understanding capabilities to analyze and summarize that text according to specific needs. This example will demonstrate generating different types of summaries from the same transcription.
Download the sample audio: https://files.cuantum.tech/audio/discussion_audio.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 7:59 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-04-10 15:59:00 CDT" # Updated time
current_location = "Houston, Texas, United States"
print(f"Running GPT-4o speech summarization example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'discussion_audio.mp3' with the actual filename.
audio_file_path = "discussion_audio.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before summarization.")
except OSError:
pass # Ignore size check error, proceed with transcription attempt
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Generate Summary from Text using GPT-4o ---
def summarize_text(client, text_to_summarize, summary_style="concise overview"):
"""Sends transcribed text to GPT-4o for summarization."""
print(f"\nStep 2: Generating '{summary_style}' summary...")
if not text_to_summarize:
print("Error: No text provided for summarization.")
return None
# Tailor the prompt based on the desired summary style
system_prompt = "You are an expert meeting summarizer and information distiller."
user_prompt = f"""Please generate a {summary_style} of the following discussion transcription.
Focus on accurately capturing the key information relevant to a {summary_style}. For example:
- For an 'executive briefing', focus on strategic points, decisions, and outcomes.
- For 'detailed meeting minutes', include main topics, key arguments, decisions, and action items.
- For a 'concise overview', provide the absolute main points and purpose.
- For a 'technical summary', emphasize technical details, specifications, or findings.
Transcription Text:
---
{text_to_summarize}
---
Generate the {summary_style}:
"""
try:
print(f"Sending text to GPT-4o for {summary_style}...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong summarization
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=400, # Adjust based on expected summary length
temperature=0.5 # Balance creativity and focus
)
summary = response.choices[0].message.content
print(f"'{summary_style}' generation successful.")
return summary.strip()
except OpenAIError as e:
print(f"OpenAI API Error during summarization: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during summarization: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("--------------------------")
# Step 2: Generate summaries in different styles
summary_styles_to_generate = [
"concise overview",
"detailed meeting minutes with action items",
"executive briefing focusing on decisions",
# "technical summary" # Add if relevant to your audio content
]
print("\n--- Generating Summaries ---")
for style in summary_styles_to_generate:
print(f"\n--- Summary Style: {style} ---")
summary_result = summarize_text(
client,
full_transcription,
summary_style=style
)
if summary_result:
print(summary_result)
else:
print(f"Failed to generate '{style}'.")
print("------------------------------------")
print("\nThis demonstrates GPT-4o generating different summaries from the same transcription based on the prompt.")
else:
print("\nTranscription failed, cannot proceed to summarization.")
Code breakdown:
- Context: This code demonstrates GPT-4o's advanced capability for summary generation from spoken content. It leverages the two-step process: transcribing audio with Whisper and then using GPT-4o to intelligently distill the key information from the transcription into a concise summary.
- Handling Lengthy Audio (Crucial Note): The prerequisites and code comments explicitly address the 25MB limit of the Whisper API. For real-world long meetings or presentations, the audio must be chunked, each chunk transcribed separately, and the resulting texts concatenated before being passed to the summarization step. The code example itself processes a single audio file for simplicity but highlights this essential workflow for longer content.
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file representing the discussion to be summarized (discussion_audio.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1, converting the input audio (or audio chunk) into plain text using Whisper. - Summarization Function (
summarize_text
):- Handles Step 2, taking the full transcribed text as input.
- Customizable Summaries: Accepts a
summary_style
argument (e.g., "executive briefing", "detailed meeting minutes"). - Prompt Engineering: The prompt sent to GPT-4o is dynamically constructed based on the requested
summary_style
. It instructs GPT-4o to act as an expert summarizer and tailor the output (focusing on strategic points, action items, technical details, etc.) according to the desired style. - Uses
gpt-4o
for its advanced understanding and summarization skills.
- Demonstrating Different Summary Types (Main Execution):
- The script first gets the full transcription.
- It then defines a list of different
summary_styles_to_generate
. - It calls the
summarize_text
function multiple times, passing the same full transcription each time but varying thesummary_style
argument. - By printing each resulting summary, the script clearly shows how GPT-4o adapts the level of detail and focus based on the prompt, generating distinct outputs (e.g., a brief overview vs. detailed minutes) from the identical source text.
- Use Case Relevance: This directly addresses the "Summary Generation" capability. It shows how combining Whisper and GPT-4o can transform lengthy spoken discussions into various useful formats (executive briefings, meeting minutes, quick overviews), saving time and improving knowledge management in business, education, and content creation.
Key Point Extraction
Identifies and highlights crucial information by leveraging GPT-4o's advanced natural language processing capabilities. Through sophisticated algorithms and contextual understanding, the model analyzes spoken content to extract meaningful insights. The model can:
- Extract core concepts and main arguments from spoken content - This involves identifying the fundamental ideas, key messages, and supporting evidence presented in conversations, presentations, or discussions. The model distinguishes between primary and secondary points, ensuring that essential information is captured.
- Identify critical decision points and action items - By analyzing conversation flow and context, GPT-4o recognizes moments when decisions are made, commitments are established, or tasks are assigned. This includes detecting both explicit assignments ("John will handle this") and implicit ones ("We should look into this further").
- Prioritize information based on context and relevance - The model evaluates the significance of different pieces of information within their specific context, considering factors such as urgency, impact, and relationship to overall objectives. This helps in creating hierarchical summaries that emphasize what matters most.
- Track key themes and recurring topics across conversations - GPT-4o maintains awareness of discussion patterns, identifying when certain subjects resurface and how they evolve over time. This capability is particularly valuable for long-term project monitoring or tracking ongoing concerns across multiple meetings.
Example:
This example focuses on using GPT-4o to extract specific, crucial information—key points, decisions, action items—from transcribed speech, going beyond a general summary.
This again uses the two-step approach: Whisper transcribes the audio, and then GPT-4o analyzes the text based on a prompt designed for extraction.
Download the audio sample: https://files.cuantum.tech/audio/meeting_for_extraction.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:07 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-21 22:07:00 CDT" # Updated time
current_location = "Austin, Texas, United States"
print(f"Running GPT-4o key point extraction from speech example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'meeting_for_extraction.mp3' with the actual filename.
audio_file_path = "meeting_for_extraction.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before extraction.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Extract Key Points, Decisions, Actions using GPT-4o ---
def extract_key_points(client, text_to_analyze):
"""Sends transcribed text to GPT-4o for key point extraction."""
print("\nStep 2: Extracting key points, decisions, and actions...")
if not text_to_analyze:
print("Error: No text provided for extraction.")
return None
# Prompt designed specifically for extraction
system_prompt = "You are an expert meeting analyst. Your task is to carefully read the provided transcript and extract specific types of information."
user_prompt = f"""Analyze the following meeting or discussion transcription. Identify and extract the following information, presenting each under a clear heading:
1. **Key Points / Core Concepts:** List the main topics, arguments, or fundamental ideas discussed.
2. **Decisions Made:** List any clear decisions that were reached during the discussion.
3. **Action Items:** List specific tasks assigned to individuals or the group. If possible, note who is responsible and any mentioned deadlines.
If any category has no relevant items, state "None identified".
Transcription Text:
---
{text_to_analyze}
---
Extracted Information:
"""
try:
print("Sending text to GPT-4o for extraction...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong analytical capabilities
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=600, # Adjust based on expected length of extracted info
temperature=0.2 # Lower temperature for more factual extraction
)
extracted_info = response.choices[0].message.content
print("Extraction successful.")
return extracted_info.strip()
except OpenAIError as e:
print(f"OpenAI API Error during extraction: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during extraction: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("------------------------------------")
# Step 2: Extract Key Information
extracted_details = extract_key_points(
client,
full_transcription
)
if extracted_details:
print("\n--- Extracted Key Information ---")
print(extracted_details)
print("---------------------------------")
print("\nThis demonstrates GPT-4o identifying and structuring key takeaways from the discussion.")
else:
print("\nFailed to extract key information.")
else:
print("\nTranscription failed, cannot proceed to key point extraction.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability for Key Point Extraction from spoken content. After transcribing audio using Whisper, GPT-4o analyzes the text to identify and isolate crucial information like core concepts, decisions made, and action items assigned.
- Two-Step Process: Like summarization, this relies on:
- Step 1 (Whisper): Transcribing the audio (
client.audio.transcriptions.create
) to get the full text. The critical note about handling audio files larger than 25MB via chunking and concatenation still applies. - Step 2 (GPT-4o): Analyzing the complete transcription using
client.chat.completions.create
with a prompt specifically designed for extraction.
- Step 1 (Whisper): Transcribing the audio (
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file from a meeting or discussion where key information is likely present (meeting_for_extraction.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1, returning the plain text transcription. - Extraction Function (
extract_key_points
):- Handles Step 2, taking the full transcription text.
- Prompt Engineering for Extraction: This is key. The prompt explicitly instructs GPT-4o to act as an analyst and extract information under specific headings: "Key Points / Core Concepts," "Decisions Made," and "Action Items." This structured request guides GPT-4o to identify and categorize the relevant information accurately. A lower
temperature
(e.g., 0.2) is suggested to encourage more factual, less creative output suitable for extraction. - Uses
gpt-4o
for its advanced analytical skills.
- Output: The function returns a text string containing the extracted information, ideally structured under the requested headings.
- Main Execution: The script transcribes the audio, then passes the text to the extraction function, and finally prints the structured output.
- Use Case Relevance: This directly addresses the "Key Point Extraction" capability. It shows how AI can automatically process lengthy discussions to pull out the most important concepts, track decisions, and list actionable tasks, saving significant time in reviewing recordings or generating meeting follow-ups. It highlights GPT-4o's ability to understand conversational flow and identify significant moments (decisions, assignments) within the text.
Emotional Intelligence
Detects tone, sentiment, and emotional undertones in spoken communication through GPT-4o's advanced natural language processing capabilities. This sophisticated system performs deep analysis of speech patterns and contextual elements to understand the emotional layers of communication. The model can identify subtle emotional cues such as:
- Voice inflections and patterns that indicate excitement, hesitation, or concern - Including pitch variations, speech rhythm changes, and vocal stress patterns that humans naturally use to convey emotions
- Changes in speaking tempo and volume that suggest emotional states - For example, rapid speech might indicate excitement or anxiety, while slower speech could suggest thoughtfulness or uncertainty
- Contextual emotional markers like laughter, sighs, or pauses - The model recognizes non-verbal sounds and silence that carry significant emotional meaning in conversation
- Cultural and situational nuances that affect emotional expression - Understanding how different cultures express emotions differently and how context influences emotional interpretation
This emotional awareness enables GPT-4o to provide more nuanced and context-appropriate responses, making it particularly valuable for applications in customer service (where understanding customer frustration or satisfaction is crucial), therapeutic conversations (where emotional support and understanding are paramount), and personal coaching (where motivation and emotional growth are key objectives). The system's ability to detect these subtle emotional signals allows for more empathetic and effective communication across various professional and personal contexts.
Example:
This example explores using GPT-4o for "Emotional Intelligence" – detecting tone, sentiment, and emotional undertones in speech.
It's important to understand how this works with current standard OpenAI APIs. While GPT-4o excels at understanding emotion from text, directly analyzing audio features like pitch, tone variance, tempo, sighs, or laughter as audio isn't a primary function of the standard Whisper transcription or the Chat Completions API endpoint when processing transcribed text.
Therefore, the most practical way to demonstrate this concept using these APIs is a two-step process:
- Transcribe Speech to Text: Use Whisper to get the words spoken.
- Analyze Text for Emotion: Use GPT-4o to analyze the transcribed text for indicators of emotion, sentiment, or tone based on word choice, phrasing, and context described in the text.
Download the sample audio: https://files.cuantum.tech/audio/emotional_speech.mp3
This code example implements this two-step, text-based analysis approach.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:13 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-21 20:13:00 CDT" # Updated time
current_location = "Atlanta, Georgia, United States"
print(f"Running GPT-4o speech emotion analysis (text-based) example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with potentially emotional speech
# IMPORTANT: Replace 'emotional_speech.mp3' with the actual filename.
audio_file_path = "emotional_speech.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Transcribed Text for Emotion/Sentiment using GPT-4o ---
def analyze_text_emotion(client, text_to_analyze):
"""
Sends transcribed text to GPT-4o for emotion and sentiment analysis.
Note: This analyzes the text content, not acoustic features of the original audio.
"""
print("\nStep 2: Analyzing transcribed text for emotion/sentiment...")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Prompt designed for text-based emotion/sentiment analysis
system_prompt = "You are an expert in communication analysis, skilled at detecting sentiment, tone, and potential underlying emotions from text."
user_prompt = f"""Analyze the following text for emotional indicators:
Text:
---
{text_to_analyze}
---
Based *only* on the words, phrasing, and punctuation in the text provided:
1. What is the overall sentiment (e.g., Positive, Negative, Neutral, Mixed)?
2. What is the likely emotional tone (e.g., Frustrated, Excited, Calm, Anxious, Sarcastic, Happy, Sad)?
3. Are there specific words or phrases that indicate these emotions? Explain briefly.
Provide the analysis:
"""
try:
print("Sending text to GPT-4o for emotion analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for nuanced understanding
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=300, # Adjust as needed
temperature=0.4 # Slightly lower temp for more grounded analysis
)
analysis = response.choices[0].message.content
print("Emotion analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio
transcribed_text = transcribe_speech(client, audio_file_path)
if transcribed_text:
print(f"\n--- Transcription Result ---")
print(transcribed_text)
print("----------------------------")
# Step 2: Analyze the transcription for emotion/sentiment
emotion_analysis = analyze_text_emotion(
client,
transcribed_text
)
if emotion_analysis:
print("\n--- Emotion/Sentiment Analysis (from Text) ---")
print(emotion_analysis)
print("----------------------------------------------")
print("\nNote: This analysis is based on the transcribed text content. It does not directly analyze acoustic features like tone of voice from the original audio.")
else:
print("\nEmotion analysis failed.")
else:
print("\nTranscription failed, cannot proceed to emotion analysis.")
# --- End of Code Example ---
Code breakdown:
- Context: This code demonstrates how GPT-4o can be used to infer emotional tone and sentiment from spoken language. It utilizes a two-step process common for this type of analysis with current APIs.
- Two-Step Process & Limitation:
- Step 1 (Whisper): The audio is first transcribed into text using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): The resulting text is then analyzed by GPT-4o (
client.chat.completions.create
) using a prompt specifically designed to identify sentiment and emotional indicators within the text. - Important Limitation: The explanation (and code comments) must clearly state that this method analyzes the linguistic content (words, phrasing) provided by Whisper. It does not directly analyze acoustic features of the original audio like pitch, tempo, or specific non-verbal sounds (sighs, laughter) unless those happen to be transcribed by Whisper (which is often not the case for subtle cues). True acoustic emotion detection would require different tools or APIs.
- Step 1 (Whisper): The audio is first transcribed into text using
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file where the speaker's words might suggest an emotion (emotional_speech.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1, returning plain text. - Emotion Analysis Function (
analyze_text_emotion
):- Handles Step 2, taking the transcribed text.
- Prompt Design: The prompt explicitly asks GPT-4o to analyze the provided text for overall sentiment (Positive/Negative/Neutral), likely emotional tone (Frustrated, Excited, etc.), and supporting textual evidence. It clarifies the analysis should be based only on the text.
- Uses
gpt-4o
for its sophisticated language understanding.
- Output: The function returns GPT-4o's textual analysis of the inferred emotion and sentiment.
- Main Execution: The script transcribes the audio, passes the text for analysis, prints both results, and reiterates the limitation regarding acoustic features.
- Use Case Relevance: While not analyzing acoustics directly, this text-based approach is still valuable for applications like customer service (detecting frustration/satisfaction from word choice), analyzing feedback, or getting a general sense of sentiment from spoken interactions, complementing other forms of analysis. It showcases GPT-4o's ability to interpret emotional language.
Remember to use an audio file where the spoken words convey some emotion for this example to be effective. Replace 'emotional_speech.mp3'
with your file path.
Implicit Understanding
GPT-4o demonstrates remarkable capabilities in understanding the deeper layers of human communication, going far beyond simple word recognition to grasp the intricate nuances of speech. The model's sophisticated comprehension abilities include:
- Detect underlying context and assumptions
- Understands implicit knowledge shared between speakers
- Recognizes unstated but commonly accepted facts within specific domains
- Identifies hidden premises in conversations
- Understand cultural references and idiomatic expressions
- Processes region-specific sayings and colloquialisms
- Recognizes cultural-specific metaphors and analogies
- Adapts understanding based on cultural context
- Interpret rhetorical devices
Example:
Similar to the previous examples involving deeper understanding (Semantic, Contextual, Emotional), this typically uses the two-step approach: Whisper transcribes the words, and then GPT-4o analyzes the resulting text, this time specifically prompted to look for implicit layers.
Download the sample audio: https://files.cuantum.tech/audio/implicit_speech.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:21 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-12 16:21:00 CDT" # Updated time
current_location = "Dallas, Texas, United States"
print(f"Running GPT-4o implicit speech understanding example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with implicit meaning
# IMPORTANT: Replace 'implicit_speech.mp3' with the actual filename.
audio_file_path = "implicit_speech.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Transcribed Text for Implicit Meaning using GPT-4o ---
def analyze_implicit_meaning(client, text_to_analyze):
"""
Sends transcribed text to GPT-4o to analyze implicit meanings,
assumptions, references, or rhetorical devices.
"""
print("\nStep 2: Analyzing transcribed text for implicit meaning...")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Prompt designed for identifying implicit communication layers
system_prompt = "You are an expert analyst of human communication, skilled at identifying meaning that is implied but not explicitly stated."
user_prompt = f"""Analyze the following statement or question:
Statement/Question:
---
{text_to_analyze}
---
Based on common knowledge, cultural context, and conversational patterns, please explain:
1. Any underlying assumptions the speaker might be making.
2. Any implicit meanings or suggestions conveyed beyond the literal words.
3. Any cultural references, idioms, or sayings being used or alluded to.
4. If it's a rhetorical question, what point is likely being made?
Provide a breakdown of the implicit layers of communication present:
"""
try:
print("Sending text to GPT-4o for implicit meaning analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for deep understanding
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=400, # Adjust as needed
temperature=0.5 # Allow for some interpretation
)
analysis = response.choices[0].message.content
print("Implicit meaning analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio
transcribed_text = transcribe_speech(client, audio_file_path)
if transcribed_text:
print(f"\n--- Transcription Result ---")
print(transcribed_text)
print("----------------------------")
# Step 2: Analyze the transcription for implicit meaning
implicit_analysis = analyze_implicit_meaning(
client,
transcribed_text
)
if implicit_analysis:
print("\n--- Implicit Meaning Analysis ---")
print(implicit_analysis)
print("-------------------------------")
print("\nThis demonstrates GPT-4o identifying meaning beyond the literal text, based on common knowledge and context.")
else:
print("\nImplicit meaning analysis failed.")
else:
print("\nTranscription failed, cannot proceed to implicit meaning analysis.")
Code breakdown:
- Context: This code example demonstrates GPT-4o's capability for Implicit Understanding – grasping the unstated assumptions, references, and meanings embedded within spoken language.
- Two-Step Process: It follows the established pattern:
- Step 1 (Whisper): Transcribe the audio containing the implicitly meaningful speech into text using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): Analyze the transcribed text using
client.chat.completions.create
, with a prompt specifically designed to uncover hidden layers of meaning.
- Step 1 (Whisper): Transcribe the audio containing the implicitly meaningful speech into text using
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file where the meaning relies on shared knowledge, cultural context, or isn't fully literal (e.g., using an idiom, a rhetorical question, or making an assumption clear only through context).implicit_speech.mp3
is used as the placeholder. - Transcription Function (
transcribe_speech
): Handles Step 1, returning the plain text transcription. - Implicit Analysis Function (
analyze_implicit_meaning
):- Handles Step 2, taking the transcribed text.
- Prompt Engineering for Implicit Meaning: The prompt is key here. It instructs GPT-4o to look beyond the literal words and identify underlying assumptions, implied suggestions, cultural references/idioms, and the purpose behind rhetorical questions.
- Uses
gpt-4o
for its extensive knowledge base and reasoning ability needed to infer these implicit elements.
- Output: The function returns GPT-4o's textual analysis of the unstated meanings detected in the input text.
- Main Execution: The script transcribes the audio, passes the text for implicit analysis, and prints both the literal transcription and GPT-4o's interpretation of the hidden meanings.
- Use Case Relevance: This demonstrates how GPT-4o can process communication more like a human, understanding not just what was said, but also what was meant or assumed. This is crucial for applications requiring deep comprehension, such as analyzing user feedback, understanding nuanced dialogue in meetings, or interpreting culturally rich content.
Remember to use an audio file containing speech that requires some level of inference or background knowledge to fully understand for testing this code effectively. Replace 'implicit_speech.mp3'
with your file path.
From Transcription to Comprehensive Understanding
This advance marks a revolutionary transformation in AI's ability to process human speech. While traditional systems like Whisper excel at transcription - the mechanical process of converting spoken words into written text - modern AI systems like GPT-4o achieve true comprehension, understanding not just the words themselves but their deeper meaning, context, and implications. This leap forward enables AI to process human communication in ways that are remarkably similar to how humans naturally understand conversation, including subtle nuances, implied meanings, and contextual relevance.
To illustrate this transformative evolution in capability, let's examine a detailed example that highlights the stark contrast between simple transcription and advanced comprehension:
- Consider this statement: "I think we should delay the product launch until next quarter." A traditional transcription system like Whisper would perfectly capture these words, but that's where its understanding ends - it simply converts speech to text with high accuracy.
- GPT-4o, however, demonstrates a sophisticated level of understanding that mirrors human comprehension:
- Primary Message Analysis: Beyond just identifying the suggestion to reschedule, it understands this as a strategic proposal that requires careful consideration
- Business Impact Evaluation: Comprehensively assesses how this delay would affect various aspects of the business, from resource allocation to team scheduling to budget implications
- Strategic Market Analysis: Examines the broader market context, including competitor movements, market trends, and potential windows of opportunity
- Comprehensive Risk Assessment: Evaluates both immediate and long-term consequences, considering everything from technical readiness to market positioning
What makes GPT-4o truly remarkable is its ability to engage in nuanced analytical discussions about the content, addressing complex strategic questions that require deep understanding:
- External Factors: What specific market conditions, competitive pressures, or industry trends might have motivated this delay suggestion?
- Stakeholder Impact: How would this timeline adjustment affect relationships with investors, partners, and customers? What communication strategies might be needed?
- Strategic Opportunities: What potential advantages could emerge from this delay, such as additional feature development or market timing optimization?
2.3.2 What Can GPT-4o Do with Speech Input?
GPT-4o represents a significant advancement in audio processing technology, offering a comprehensive suite of capabilities that transform how we interact with and understand spoken content. As a cutting-edge language model with multimodal processing abilities, it combines sophisticated speech recognition with deep contextual understanding to deliver powerful audio analysis features. Let's explore GPT-4o's some other functions and capabilities:
Action Item Extraction
Prompt example: "List all the tasks mentioned in this voice note."
GPT-4o excels at identifying and extracting action items from spoken content through sophisticated natural language processing. The model can:
- Parse complex conversations to detect both explicit ("Please do X") and implicit ("We should consider Y") tasks
- Distinguish between hypothetical discussions and actual commitments
- Categorize tasks by priority, deadline, and assignee
- Identify dependencies between different action items
- Flag follow-up requirements and recurring tasks
This capability transforms unstructured audio discussions into structured, actionable task lists, significantly improving meeting productivity and follow-through. By automatically maintaining a comprehensive record of commitments, it ensures accountability while reducing the cognitive load on participants who would otherwise need to manually track these items. The system can also integrate with popular task management tools, making it seamless to convert spoken assignments into trackable tickets or to-dos.
Example:
This script uses the familiar two-step process: first transcribing the audio with Whisper, then analyzing the text with GPT-4o using a prompt specifically designed to identify and structure action items.
Download the audio sample: https://files.cuantum.tech/audio/meeting_tasks.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:39 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-24 10:29:00 CDT" # Updated time
current_location = "Plano, Texas, United States"
print(f"Running GPT-4o action item extraction from speech example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'meeting_tasks.mp3' with the actual filename.
audio_file_path = "meeting_tasks.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before extraction.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Extract Action Items from Text using GPT-4o ---
def extract_action_items(client, text_to_analyze):
"""Sends transcribed text to GPT-4o for action item extraction."""
print("\nStep 2: Extracting action items...")
if not text_to_analyze:
print("Error: No text provided for extraction.")
return None
# Prompt designed specifically for extracting structured action items
system_prompt = "You are an expert meeting analyst focused on identifying actionable tasks."
user_prompt = f"""Analyze the following meeting or discussion transcription. Identify and extract all specific action items mentioned.
For each action item, provide:
- A clear description of the task.
- The person assigned (if mentioned, otherwise state 'Unassigned' or 'Group').
- Any deadline mentioned (if mentioned, otherwise state 'No deadline mentioned').
Distinguish between definite commitments/tasks and mere suggestions or hypothetical possibilities. Only list items that sound like actual tasks or commitments.
Format the output as a numbered list.
Transcription Text:
---
{text_to_analyze}
---
Extracted Action Items:
"""
try:
print("Sending text to GPT-4o for action item extraction...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong analytical capabilities
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=500, # Adjust based on expected number of action items
temperature=0.1 # Very low temperature for factual extraction
)
extracted_actions = response.choices[0].message.content
print("Action item extraction successful.")
return extracted_actions.strip()
except OpenAIError as e:
print(f"OpenAI API Error during extraction: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during extraction: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("------------------------------------")
# Step 2: Extract Action Items
action_items_list = extract_action_items(
client,
full_transcription
)
if action_items_list:
print("\n--- Extracted Action Items ---")
print(action_items_list)
print("------------------------------")
print("\nThis demonstrates GPT-4o identifying and structuring actionable tasks from the discussion.")
else:
print("\nFailed to extract action items.")
else:
print("\nTranscription failed, cannot proceed to action item extraction.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability for Action Item Extraction from spoken content. After transcribing audio with Whisper, GPT-4o analyzes the text to identify specific tasks, assignments, and deadlines discussed.
- Two-Step Process: It uses the standard workflow:
- Step 1 (Whisper): Transcribe the meeting/discussion audio (
client.audio.transcriptions.create
) into text. The note about handling audio files > 25MB via chunking/concatenation remains critical for real-world use. - Step 2 (GPT-4o): Analyze the complete transcription using
client.chat.completions.create
with a prompt tailored for task extraction.
- Step 1 (Whisper): Transcribe the meeting/discussion audio (
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file from a meeting where tasks were assigned (meeting_tasks.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1. - Action Item Extraction Function (
extract_action_items
):- Handles Step 2, taking the full transcription text.
- Prompt Engineering for Tasks: This is the core. The prompt explicitly instructs GPT-4o to identify action items, distinguish them from mere suggestions, and extract the task description, assigned person (if mentioned), and deadline (if mentioned). It requests a structured, numbered list format. A very low
temperature
(e.g., 0.1) is recommended to keep the output focused on factual extraction. - Uses
gpt-4o
for its ability to understand conversational context and identify commitments.
- Output: The function returns a text string containing the structured list of extracted action items.
- Main Execution: The script transcribes the audio, passes the text to the extraction function, and prints the resulting list of tasks.
- Use Case Relevance: This directly addresses the "Action Item Extraction" capability. It shows how AI can automatically convert unstructured verbal discussions into organized, actionable task lists. This significantly boosts productivity by ensuring follow-through, clarifying responsibilities, and reducing the manual effort of tracking commitments made during meetings. It highlights GPT-4o's ability to parse complex conversations and identify both explicit and implicit task assignments.
Q&A about the Audio
Prompt Example: "What did the speaker say about the budget?"
GPT-4o's advanced query capabilities allow for natural conversations about audio content, enabling users to ask specific questions and receive contextually relevant answers. The model can:
- Extract precise information from specific segments
- Understand context and references across the entire audio
- Handle follow-up questions about previously discussed topics
- Provide time-stamped references to relevant portions
- Cross-reference information from multiple parts of the recording
This functionality transforms how we interact with audio content, making it as searchable and queryable as text documents. Instead of manually scrubbing through recordings, users can simply ask questions in natural language and receive accurate, concise responses. The system is particularly valuable for:
- Meeting participants who need to verify specific details
- Researchers analyzing interview recordings
- Students reviewing lecture content
- Professionals fact-checking client conversations
- Teams seeking to understand historical discussions
Example:
This script first transcribes an audio file using Whisper and then uses GPT-4o to answer a specific question asked by the user about the content of that transcription.
Download the audio sample: https://files.cuantum.tech/audio/meeting_for_qa.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:47 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-01-11 11:47:00 CDT" # Updated time
current_location = "Orlando, Florida, United States"
print(f"Running GPT-4o Q&A about audio example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'meeting_for_qa.mp3' with the actual filename.
audio_file_path = "meeting_for_qa.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before Q&A.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Answer Question Based on Text using GPT-4o ---
def answer_question_about_text(client, full_text, question):
"""Sends transcribed text and a question to GPT-4o to get an answer."""
print(f"\nStep 2: Answering question about the transcription...")
print(f"Question: \"{question}\"")
if not full_text:
print("Error: No transcription text provided to answer questions about.")
return None
if not question:
print("Error: No question provided.")
return None
# Prompt designed specifically for answering questions based on provided text
system_prompt = "You are an AI assistant specialized in answering questions based *only* on the provided text transcription. Do not use outside knowledge."
user_prompt = f"""Based *solely* on the following transcription text, please answer the question below. If the answer is not found in the text, state that clearly.
Transcription Text:
---
{full_text}
---
Question: {question}
Answer:
"""
try:
print("Sending transcription and question to GPT-4o...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong comprehension and answering
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=300, # Adjust based on expected answer length
temperature=0.1 # Low temperature for factual answers based on text
)
answer = response.choices[0].message.content
print("Answer generation successful.")
return answer.strip()
except OpenAIError as e:
print(f"OpenAI API Error during Q&A: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during Q&A: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
transcription = transcribe_speech(client, audio_file_path)
if transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(transcription[:1000] + "..." if len(transcription) > 1000 else transcription)
print("------------------------------------")
# --- Ask Questions about the Transcription ---
# Define the question(s) you want to ask
user_question = "What was decided about the email marketing CTA button?"
# user_question = "Who is responsible for the A/B test on Platform B?"
# user_question = "What was the engagement increase on Platform A?"
print(f"\n--- Answering Question ---")
# Step 2: Get the answer from GPT-4o
answer = answer_question_about_text(
client,
transcription,
user_question
)
if answer:
print(f"\nAnswer to '{user_question}':")
print(answer)
print("------------------------------")
print("\nThis demonstrates GPT-4o answering specific questions based on the transcribed audio content.")
else:
print(f"\nFailed to get an answer for the question: '{user_question}'")
else:
print("\nTranscription failed, cannot proceed to Q&A.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability to function as a Q&A system for audio content. After transcribing speech with Whisper, users can ask specific questions in natural language, and GPT-4o will provide answers based on the information contained within the transcription.
- Two-Step Process: The workflow involves:
- Step 1 (Whisper): Transcribe the relevant audio file (or concatenated text from chunks of a longer file) using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): Send the complete transcription along with the user's specific question to
client.chat.completions.create
.
- Step 1 (Whisper): Transcribe the relevant audio file (or concatenated text from chunks of a longer file) using
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file containing the discussion or information the user might ask questions about (meeting_for_qa.mp3
). The critical note about handling audio > 25MB via chunking/concatenation before the Q&A step remains essential. - Transcription Function (
transcribe_speech
): Handles Step 1. - Q&A Function (
answer_question_about_text
):- Handles Step 2, taking both the
full_text
transcription and thequestion
as input. - Prompt Engineering for Q&A: The prompt is crucial. It instructs GPT-4o to act as a specialized assistant that answers questions based only on the provided transcription text, explicitly telling it not to use external knowledge and to state if the answer isn't found in the text. This grounding is important for accuracy. A low
temperature
(e.g., 0.1) helps ensure factual answers derived directly from the source text. - Uses
gpt-4o
for its excellent reading comprehension and question-answering abilities.
- Handles Step 2, taking both the
- Output: The function returns GPT-4o's answer to the specific question asked.
- Main Execution: The script transcribes the audio, defines a sample
user_question
, passes the transcription and question to the Q&A function, and prints the resulting answer. - Use Case Relevance: This directly addresses the "Q&A about the Audio" capability. It transforms audio recordings from passive archives into interactive knowledge sources. Users can quickly find specific details, verify facts, or understand parts of a discussion without manually searching through the audio, making it invaluable for reviewing meetings, lectures, interviews, or any recorded conversation.
Remember to use an audio file containing information relevant to potential questions for testing (you can use the sample audio provided). Modify the user_question
variable to test different queries against the transcribed content.
Highlight Key Moments
Prompt example: "Identify the most important statements made in this audio."
GPT-4o excels at identifying and extracting crucial moments from audio content through its advanced natural language understanding capabilities. The model can:
- Identify key decisions and action items
- Extract important quotes and statements
- Highlight strategic discussions and conclusions
- Pinpoint critical transitions in conversations
This feature is particularly valuable for:
- Meeting participants who need to quickly review important takeaways
- Executives scanning long recordings for decision points
- Teams tracking project milestones discussed in calls
- Researchers identifying significant moments in interviews
The model provides timestamps and contextual summaries for each highlighted moment, making it easier to navigate directly to the most relevant parts of the recording without reviewing the entire audio file.
Example:
This script follows the established two-step pattern: transcribing the audio with Whisper and then analyzing the text with GPT-4o using a prompt designed to identify significant statements, decisions, or conclusions.
Download the sample audio: https://files.cuantum.tech/audio/key_discussion.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:52 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-02-14 15:52:00 CDT" # Updated time
current_location = "Tampa, Florida, United States"
print(f"Running GPT-4o key moment highlighting example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'key_discussion.mp3' with the actual filename.
audio_file_path = "key_discussion.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before highlighting.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Highlight Key Moments from Text using GPT-4o ---
def highlight_key_moments(client, text_to_analyze):
"""Sends transcribed text to GPT-4o to identify and extract key moments."""
print("\nStep 2: Identifying key moments from transcription...")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Prompt designed specifically for identifying key moments/statements
system_prompt = "You are an expert analyst skilled at identifying the most significant parts of a discussion or presentation."
user_prompt = f"""Analyze the following transcription text. Identify and extract the key moments, which could include:
- Important decisions made
- Critical conclusions reached
- Significant statements or impactful quotes
- Major topic shifts or transitions
- Key questions asked or answered
For each key moment identified, provide the relevant quote or a concise summary of the moment. Present the output as a list.
Transcription Text:
---
{text_to_analyze}
---
Key Moments:
"""
try:
print("Sending text to GPT-4o for key moment identification...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong comprehension
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=700, # Adjust based on expected number/length of key moments
temperature=0.3 # Lean towards factual identification
)
key_moments = response.choices[0].message.content
print("Key moment identification successful.")
return key_moments.strip()
except OpenAIError as e:
print(f"OpenAI API Error during highlighting: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during highlighting: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("------------------------------------")
# Step 2: Highlight Key Moments
highlights = highlight_key_moments(
client,
full_transcription
)
if highlights:
print("\n--- Identified Key Moments ---")
print(highlights)
print("----------------------------")
print("\nThis demonstrates GPT-4o extracting significant parts from the discussion.")
print("\nNote: Adding precise timestamps to these moments requires further processing using Whisper's 'verbose_json' output and correlating the text.")
else:
print("\nFailed to identify key moments.")
else:
print("\nTranscription failed, cannot proceed to highlight key moments.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability to Highlight Key Moments from spoken content. After transcription via Whisper, GPT-4o analyzes the text to pinpoint and extract the most significant parts, such as crucial decisions, important statements, or major topic shifts.
- Two-Step Process:
- Step 1 (Whisper): Transcribe the audio (
client.audio.transcriptions.create
) to get the full text. The necessity of chunking/concatenating for audio files > 25MB is reiterated. - Step 2 (GPT-4o): Analyze the complete transcription using
client.chat.completions.create
with a prompt specifically asking for key moments.
- Step 1 (Whisper): Transcribe the audio (
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file containing a discussion or presentation where significant moments occur (key_discussion.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1. - Highlighting Function (
highlight_key_moments
):- Handles Step 2, taking the full transcription text.
- Prompt Engineering for Highlights: The prompt instructs GPT-4o to act as an analyst and identify various types of key moments (decisions, conclusions, impactful quotes, transitions). It asks for the relevant quote or a concise summary for each identified moment, formatted as a list.
- Uses
gpt-4o
for its ability to discern importance and context within text.
- Output: The function returns a text string containing the list of identified key moments.
- Timestamp Note: The explanation and code output explicitly mention that while this process identifies the text of key moments, adding precise timestamps would require additional steps. This involves using Whisper's
verbose_json
output format (which includes segment timestamps) and then correlating the text identified by GPT-4o back to those specific timed segments – a more complex task not covered in this basic example. - Main Execution: The script transcribes the audio, passes the text to the highlighting function, and prints the resulting list of key moments.
- Use Case Relevance: This addresses the "Highlight Key Moments" capability by showing how AI can quickly sift through potentially long recordings to surface the most critical parts. This is highly valuable for efficient review of meetings, interviews, or lectures, allowing users to focus on what matters most without listening to the entire audio.
For testing purposes, use an audio file that contains a relevant discussion with clear, identifiable key segments (you can use the sample audio file provided).
2.3.3 Real-World Use Cases
The modern business landscape increasingly relies on audio communication across various sectors, from sales and customer service to education and personal development. Understanding and effectively utilizing these audio interactions has become crucial for organizations seeking to improve their operations, enhance customer relationships, and drive better outcomes. This section explores several key applications where advanced audio processing and analysis can create significant value, demonstrating how AI-powered tools can transform raw audio data into actionable insights.
From analyzing sales conversations to enhancing educational experiences, these use cases showcase the versatility and power of audio understanding technologies in addressing real-world challenges. Each application represents a unique opportunity to leverage voice data for improved decision-making, process optimization, and better user experiences.
1. Sales Enablement
Advanced analysis of sales call recordings provides a comprehensive toolkit for sales teams to optimize their performance. The system can identify key objections raised by prospects, allowing teams to develop better counter-arguments and prepare responses in advance. It tracks successful closing techniques by analyzing patterns in successful deals, revealing which approaches work best for different customer segments and situations.
The system also measures crucial metrics like conversion rates, call duration, talk-to-listen ratios, and key phrase usage. This data helps sales teams understand which behaviors correlate with successful outcomes. By analyzing customer responses and reaction patterns, teams can refine their pitch timing, improve their questioning techniques, and better understand buying signals.
This technology also enables sales managers to document and share effective approaches across the team, creating a knowledge base of best practices for common challenges. This institutional knowledge can be particularly valuable for onboarding new team members and maintaining consistent sales excellence across the organization.
2. Meeting Intelligence
Comprehensive meeting analysis transforms how organizations capture and utilize meeting content. The system goes beyond basic transcription by:
- Identifying and categorizing key discussion points for easy reference
- Automatically detecting and extracting action items from conversations
- Assigning responsibilities to specific team members based on verbal commitments
- Creating structured timelines and tracking deadlines mentioned during meetings
- Generating automated task lists with clear ownership and due dates
- Highlighting decision points and meeting outcomes
- Providing searchable meeting archives for future reference
The system employs advanced natural language processing to understand context, relationships, and commitments expressed during conversations. This enables automatic task creation and assignment, ensuring nothing falls through the cracks. Integration with project management tools allows for seamless workflow automation, while smart reminders help keep team members accountable for their commitments.
3. Customer Support
Deep analysis of customer service interactions provides comprehensive insights into customer experience and support team performance. The system can:
- Evaluate customer sentiment in real-time by analyzing tone, word choice, and conversation flow
- Automatically categorize and prioritize urgent issues based on keyword detection and context analysis
- Generate detailed satisfaction metrics through conversation analysis and customer feedback
- Track key performance indicators like first-response time and resolution time
- Identify common pain points and recurring issues across multiple interactions
- Monitor support agent performance and consistency in service delivery
This enables support teams to improve response times, identify trending problems, and maintain consistent service quality across all interactions. The system can also provide automated coaching suggestions for support agents and generate insights for product improvement based on customer feedback patterns.
4. Personal Journaling
Transform voice memos into structured reflections with emotional context analysis. Using advanced natural language processing, the system analyzes voice recordings to detect emotional states, stress levels, and overall sentiment through tone of voice, word choice, and speaking patterns. This creates a rich, multi-dimensional journal entry that captures not just what was said, but how it was expressed.
The system's mood tracking capabilities go beyond simple positive/negative classifications, identifying nuanced emotional states like excitement, uncertainty, confidence, or concern. By analyzing these patterns over time, users can gain valuable insights into their emotional well-being and identify triggers or patterns that affect their mental state.
For personal goal tracking, the system can automatically categorize and tag mentions of objectives, progress updates, and setbacks. It can generate progress reports showing momentum toward specific goals, highlight common obstacles, and even suggest potential solutions based on past successful strategies. The behavioral trend analysis examines patterns in decision-making, habit formation, and personal growth, providing users with actionable insights for self-improvement.
5. Education & Language Practice
Comprehensive language learning support revolutionizes how students practice and improve their language skills. The system provides several key benefits:
- Speech Analysis: Advanced algorithms analyze pronunciation patterns, detecting subtle variations in phonemes, stress patterns, and intonation. This helps learners understand exactly where their pronunciation differs from native speakers.
- Error Detection: The system identifies not just pronunciation errors, but also grammatical mistakes, incorrect word usage, and syntactical issues in real-time. This immediate feedback helps prevent the formation of bad habits.
- Personalized Feedback: Instead of generic corrections, the system provides context-aware feedback that considers the learner's proficiency level, native language, and common interference patterns specific to their language background.
- Progress Tracking: Sophisticated metrics track various aspects of language development, including vocabulary range, speaking fluency, grammar accuracy, and pronunciation improvement over time. Visual progress reports help motivate learners and identify areas needing focus.
- Adaptive Learning: Based on performance analysis, the system creates customized exercise plans targeting specific weaknesses. These might include focused pronunciation drills, grammar exercises, or vocabulary building activities tailored to the learner's needs.
The system can track improvement over time and suggest targeted exercises for areas needing improvement, creating a dynamic and responsive learning environment that adapts to each student's progress.
2.3.4 Privacy Considerations
Privacy is paramount when handling audio recordings. First and foremost, obtaining consent before analyzing third-party voice recordings is a crucial legal and ethical requirement. It's essential to secure written or documented permission from all participants before processing any voice recordings, whether they're from meetings, interviews, calls, or other audio content involving third parties. Organizations should implement a formal consent process that clearly outlines how the audio will be used and analyzed.
Security measures must be implemented throughout the processing workflow. After analysis is complete, it's critical to use openai.files.delete(file_id)
to remove audio files from OpenAI's servers. This practice minimizes data exposure and helps prevent unauthorized access and potential data breaches. Organizations should establish automated cleanup procedures to ensure consistent deletion of processed files.
Long-term storage of voice data requires special consideration. Never store sensitive voice recordings without explicit approval from all parties involved. Organizations should implement strict data handling policies that clearly specify storage duration, security measures, and intended use. Extra caution should be taken with recordings containing personal information, business secrets, or confidential discussions. Best practices include implementing encryption for stored audio files and maintaining detailed access logs.
2.3 Speech Understanding in GPT-4o
In this section, you'll discover how to work with advanced audio processing capabilities that go beyond basic transcription. GPT-4o introduces a revolutionary approach to audio understanding by allowing direct integration of audio files alongside textual prompts. This creates a seamless multimodal interaction system where both audio and text inputs are processed simultaneously. The system can analyze various aspects of speech, including tone, context, and semantic meaning, enabling you to build sophisticated smart assistants that can listen, understand, and respond naturally within any given context.
The technology represents a significant advancement in audio processing by combining Whisper-style transcription with GPT-4o's advanced reasoning capabilities. While Whisper excels at converting speech to text, GPT-4o takes this further by performing deep analysis of the transcribed content. This integration happens in one fluid interaction, eliminating the need for separate processing steps. For example, when processing a business meeting recording, GPT-4o can simultaneously transcribe the speech, identify speakers, extract action items, and generate summaries - all while maintaining context and understanding subtle nuances in communication.
This powerful combination opens up unprecedented possibilities for creating more intuitive and responsive AI applications. These applications can not only process and understand spoken language but can also interpret context, emotion, and intent in ways that were previously not possible. Whether it's analyzing customer service calls, processing educational lectures, or facilitating multilingual communication, the system provides a comprehensive understanding of spoken content that goes far beyond simple transcription.
2.3.1 Why GPT-4o for Speech?
While the Whisper API excels at converting spoken language into written text, GPT-4o represents a revolutionary leap forward in audio processing capabilities. To understand the distinction, imagine Whisper as a highly skilled transcriptionist who can accurately write down every word spoken, while GPT-4o functions more like an experienced analyst with deep contextual understanding.
GPT-4o's capabilities extend far beyond basic transcription. It can understand and process speech at multiple levels simultaneously:
Semantic Understanding
Comprehends the actual meaning behind the words, going beyond simple word-for-word translation. This advanced capability allows GPT-4o to process language at multiple levels simultaneously, understanding not only the literal meaning but also the deeper semantic layers, cultural context, and intended message. This includes understanding idioms, metaphors, cultural references, and regional expressions within the speech, as well as detecting subtle nuances in communication that might be lost in simple transcription.
For example, when someone says "it's raining cats and dogs," GPT-4o understands this means heavy rainfall rather than literally interpreting animals falling from the sky. Similarly, when processing phrases like "break a leg" before a performance or "piece of cake" to describe an easy task, the system correctly interprets these idiomatic expressions within their cultural context.
It can also grasp complex concepts like sarcasm ("Oh, great, another meeting"), humor ("Why did the GPT model cross the road?"), and rhetorical questions ("Who wouldn't want that?"), making it capable of truly understanding human communication in its full context. This sophisticated understanding extends to cultural-specific references, professional jargon, and even regional dialectical variations, ensuring accurate interpretation regardless of the speaker's background or communication style.
Example:
Since the standard OpenAI API interaction for this typically involves first converting speech to text (using Whisper) and then analyzing that text for deeper meaning (using GPT-4o), the code example will demonstrate this two-step process.
This script will:
- Transcribe an audio file containing potentially nuanced language using Whisper.
- Send the transcribed text to GPT-4o with a prompt asking for semantic interpretation.
Download the audio sample: https://files.cuantum.tech/audio/idiom_speech.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 7:37 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-04-21 19:37:00 CDT"
current_location = "Dallas, Texas, United States"
print(f"Running GPT-4o semantic speech understanding example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with nuanced speech
# IMPORTANT: Replace 'idiom_speech.mp3' with the actual filename.
# Good examples for audio content: "Wow, that presentation just knocked my socks off!",
# "Sure, I'd LOVE to attend another three-hour meeting.", "He really spilled the beans."
audio_file_path = "idiom_speech.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Text for Semantic Meaning using GPT-4o ---
def analyze_text_meaning(client, text_to_analyze):
"""Sends transcribed text to GPT-4o for semantic analysis."""
print(f"\nStep 2: Analyzing text for semantic meaning: \"{text_to_analyze}\"")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Construct prompt to ask for deeper meaning
system_prompt = "You are an expert in linguistics and communication."
user_prompt = (
f"Analyze the following phrase or sentence:\n\n'{text_to_analyze}'\n\n"
"Explain its likely intended meaning, considering context, idioms, "
"metaphors, sarcasm, humor, cultural references, or other nuances. "
"Go beyond a literal, word-for-word interpretation."
)
try:
print("Sending text to GPT-4o for analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for its strong understanding capabilities
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=250, # Adjust as needed
temperature=0.5 # Lower temperature for more focused analysis
)
analysis = response.choices[0].message.content
print("Semantic analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio
transcribed_text = transcribe_speech(client, audio_file_path)
if transcribed_text:
print(f"\nTranscription Result: {transcribed_text}")
# Step 2: Analyze the transcription for meaning
semantic_analysis = analyze_text_meaning(client, transcribed_text)
if semantic_analysis:
print("\n--- Semantic Analysis Result ---")
print(semantic_analysis)
print("--------------------------------\n")
print("This demonstrates GPT-4o understanding nuances beyond literal text.")
else:
print("\nSemantic analysis failed.")
else:
print("\nTranscription failed, cannot proceed to analysis.")
Code breakdown:
- Context: This code demonstrates GPT-4o's advanced semantic understanding of speech. It goes beyond simple transcription by interpreting the meaning, including nuances like idioms, sarcasm, or context-dependent phrases.
- Two-Step Process: The example uses a standard two-step API approach:
- Step 1 (Whisper): The audio file is first converted into text using the Whisper API (
client.audio.transcriptions.create
). This captures the spoken words accurately. - Step 2 (GPT-4o): The transcribed text is then sent to the GPT-4o model (
client.chat.completions.create
) with a specific prompt asking it to analyze the meaning behind the words, considering non-literal interpretations.
- Step 1 (Whisper): The audio file is first converted into text using the Whisper API (
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and crucially, an audio file containing speech that has some nuance (e.g., includes an idiom like "spill the beans", a sarcastic remark like "Oh great, another meeting", or a culturally specific phrase). - Transcription Function (
transcribe_speech
): This function handles Step 1, taking the audio file path and returning the plain text transcription from Whisper. - Semantic Analysis Function (
analyze_text_meaning
):- This function handles Step 2. It takes the transcribed text.
- Prompt Design: It constructs a prompt specifically asking GPT-4o to act as a linguistic expert and explain the intended meaning, considering idioms, sarcasm, context, etc., explicitly requesting analysis beyond the literal interpretation.
- Uses
gpt-4o
as the model for its strong reasoning and understanding capabilities. - Returns the analysis provided by GPT-4o.
- Main Execution: The script first transcribes the audio. If successful, it passes the text to the analysis function. Finally, it prints both the literal transcription and GPT-4o's semantic interpretation.
- Use Case Relevance: This example clearly shows how combining Whisper and GPT-4o allows for a deeper understanding of spoken language than transcription alone. It demonstrates the capability described – comprehending idioms ("raining cats and dogs"), sarcasm, humor, and context – making AI interaction more aligned with human communication.
Remember to use an audio file containing non-literal language for testing to best showcase the semantic analysis step. Replace 'idiom_speech.mp3'
with your actual file path.
Contextual Analysis
Interprets statements within their broader context, taking into account surrounding information, previous discussions, cultural references, and situational factors. This includes understanding how time, place, speaker relationships, and prior conversations influence meaning. The analysis considers multiple layers of context:
- Temporal Context: When something is said (time of day, day of week, season, or historical period)
- Social Context: The relationships between speakers, power dynamics, and social norms
- Physical Context: The location and environment where communication occurs
- Cultural Context: Shared knowledge, beliefs, and customs that influence interpretation
For example, the phrase "it's getting late" could mean different things in different contexts:
- During a workday meeting: A polite suggestion to wrap up the discussion
- At a social gathering: An indication that someone needs to leave
- From a parent to a child: A reminder about bedtime
- In a project discussion: Concern about approaching deadlines
GPT-4o analyzes these contextual clues along with additional factors such as tone of voice, speech patterns, and conversation history to provide more accurate and nuanced interpretations of spoken communication. This deep contextual understanding allows the system to capture the true intended meaning behind words, rather than just their literal interpretation.
Example:
This use case focuses on GPT-4o's ability to interpret transcribed speech within its broader context (temporal, social, physical, cultural). Like the semantic understanding example, this typically involves a two-step process: transcribing the speech with Whisper, then analyzing the text with GPT-4o, but this time explicitly providing contextual information to GPT-4o.
This code example will:
- Transcribe a simple, context-dependent phrase from an audio file using Whisper.
- Send the transcribed text to GPT-4o multiple times, each time providing a different context description.
- Show how GPT-4o's interpretation of the same phrase changes based on the provided context.
Download the sample audio: https://files.cuantum.tech/audio/context_phrase.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 7:44 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-02-11 11:44:00 CDT" # Updated time
current_location = "Miami, Florida, United States"
print(f"Running GPT-4o contextual speech analysis example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with the context-dependent phrase
# IMPORTANT: Replace 'context_phrase.mp3' with the actual filename.
# The audio content should ideally be just "It's getting late."
audio_file_path = "context_phrase.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from the previous example (gpt4o_speech_semantic_py)
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Text for Meaning WITHIN a Given Context using GPT-4o ---
def analyze_text_with_context(client, text_to_analyze, context_description):
"""Sends transcribed text and context description to GPT-4o for analysis."""
print(f"\nStep 2: Analyzing text \"{text_to_analyze}\" within context...")
print(f"Context Provided: {context_description}")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
if not context_description:
print("Error: Context description must be provided for this analysis.")
return None
# Construct prompt asking for interpretation based on context
system_prompt = "You are an expert in analyzing communication and understanding context."
user_prompt = (
f"Consider the phrase: '{text_to_analyze}'\n\n"
f"Now, consider the specific context in which it was said: '{context_description}'\n\n"
"Based *only* on this context, explain the likely intended meaning, implication, "
"or function of the phrase in this situation."
)
try:
print("Sending text and context to GPT-4o for analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong contextual reasoning
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=200, # Adjust as needed
temperature=0.3 # Lower temperature for more focused contextual interpretation
)
analysis = response.choices[0].message.content
print("Contextual analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio phrase
transcribed_phrase = transcribe_speech(client, audio_file_path)
if transcribed_phrase:
print(f"\nTranscription Result: \"{transcribed_phrase}\"")
# Define different contexts for the same phrase
contexts = [
"Said during a business meeting scheduled to end at 5:00 PM, spoken at 4:55 PM.",
"Said by a guest at a social party around 1:00 AM.",
"Said by a parent to a young child at 9:00 PM on a school night.",
"Said during a critical project discussion about an upcoming deadline, spoken late in the evening.",
"Said by someone looking out the window on a short winter afternoon."
]
print("\n--- Analyzing Phrase in Different Contexts ---")
# Step 2: Analyze the phrase within each context
for i, context in enumerate(contexts):
print(f"\n--- Analysis for Context {i+1} ---")
contextual_meaning = analyze_text_with_context(
client,
transcribed_phrase,
context
)
if contextual_meaning:
print(f"Meaning in Context: {contextual_meaning}")
else:
print("Contextual analysis failed for this context.")
print("------------------------------------")
print("\nThis demonstrates how GPT-4o interprets the same phrase differently based on provided context.")
else:
print("\nTranscription failed, cannot proceed to contextual analysis.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability for contextual analysis of speech. It shows how the interpretation of a spoken phrase can change dramatically depending on the surrounding situation (temporal, social, situational factors).
- Two-Step Process with Context Injection:
- Step 1 (Whisper): The audio file containing a context-dependent phrase (e.g., "It's getting late.") is transcribed into text using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): The transcribed text is then sent to GPT-4o (
client.chat.completions.create
), but crucially, the prompt now includes a description of the specific context in which the phrase was spoken.
- Step 1 (Whisper): The audio file containing a context-dependent phrase (e.g., "It's getting late.") is transcribed into text using
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and an audio file containing a simple phrase whose meaning heavily depends on context (the example uses "It's getting late."). - Transcription Function (
transcribe_speech
): This function (reused from the previous example) handles Step 1. - Contextual Analysis Function (
analyze_text_with_context
):- This function handles Step 2 and now accepts an additional argument:
context_description
. - Prompt Design: The prompt explicitly provides both the transcribed phrase and the
context_description
to GPT-4o, asking it to interpret the phrase within that specific situation. - Uses
gpt-4o
for its ability to reason based on provided context.
- This function handles Step 2 and now accepts an additional argument:
- Demonstrating Context Dependency (Main Execution):
- The script first transcribes the phrase (e.g., "It's getting late.").
- It then defines a list of different context descriptions (meeting ending, late-night party, bedtime, project deadline, short winter day).
- It calls the
analyze_text_with_context
function repeatedly, using the same transcribed phrase but providing a different context description each time. - By printing the analysis result for each context, the script clearly shows how GPT-4o's interpretation shifts based on the context provided (e.g., suggesting wrapping up vs. indicating tiredness vs. noting dwindling daylight).
- Use Case Relevance: This highlights GPT-4o's sophisticated understanding, moving beyond literal words to grasp intended meaning influenced by temporal, social, and situational factors. This is vital for applications needing accurate interpretation of real-world communication in business, social interactions, or any context-rich environment. It shows how developers can provide relevant context alongside transcribed text to get more accurate and nuanced interpretations from the AI.
For testing this code effectively, either create an audio file containing just the phrase "It's getting late" (or another context-dependent phrase), or download the provided sample file. Remember to update the 'context_phrase.mp3'
path to match your file location.
Summary Generation
GPT-4o's summary generation capabilities represent a significant advancement in AI-powered content analysis. The system creates concise, meaningful summaries of complex discussions by intelligently distilling key information from lengthy conversations, meetings, or presentations. Using advanced natural language processing and contextual understanding, GPT-4o can identify main themes, critical points, and essential takeaways while maintaining the core meaning and context of the original discussion.
The system employs several sophisticated techniques:
- Pattern Recognition: Identifies recurring themes and important discussion points across long conversations
- Contextual Analysis: Understands the broader context and relationships between different parts of the discussion
- Priority Detection: Automatically determines which information is most crucial for the summary
- Semantic Understanding: Captures underlying meanings and implications beyond just surface-level content
The generated summaries can be customized for different purposes and audiences:
- Executive Briefings: Focused on strategic insights and high-level decisions
- Meeting Minutes: Detailed documentation of discussions and action items
- Quick Overviews: Condensed highlights for rapid information consumption
- Technical Summaries: Emphasis on specific technical details and specifications
What sets GPT-4o apart is its ability to preserve important details while significantly reducing information overload, making it an invaluable tool for modern business communication and knowledge management.
Example:
This example focuses on GPT-4o's ability to generate concise and meaningful summaries from potentially lengthy spoken content obtained via Whisper.
This involves the familiar two-step process: first, transcribing the audio with Whisper to get the full text, and second, using GPT-4o's language understanding capabilities to analyze and summarize that text according to specific needs. This example will demonstrate generating different types of summaries from the same transcription.
Download the sample audio: https://files.cuantum.tech/audio/discussion_audio.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 7:59 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-04-10 15:59:00 CDT" # Updated time
current_location = "Houston, Texas, United States"
print(f"Running GPT-4o speech summarization example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'discussion_audio.mp3' with the actual filename.
audio_file_path = "discussion_audio.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before summarization.")
except OSError:
pass # Ignore size check error, proceed with transcription attempt
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Generate Summary from Text using GPT-4o ---
def summarize_text(client, text_to_summarize, summary_style="concise overview"):
"""Sends transcribed text to GPT-4o for summarization."""
print(f"\nStep 2: Generating '{summary_style}' summary...")
if not text_to_summarize:
print("Error: No text provided for summarization.")
return None
# Tailor the prompt based on the desired summary style
system_prompt = "You are an expert meeting summarizer and information distiller."
user_prompt = f"""Please generate a {summary_style} of the following discussion transcription.
Focus on accurately capturing the key information relevant to a {summary_style}. For example:
- For an 'executive briefing', focus on strategic points, decisions, and outcomes.
- For 'detailed meeting minutes', include main topics, key arguments, decisions, and action items.
- For a 'concise overview', provide the absolute main points and purpose.
- For a 'technical summary', emphasize technical details, specifications, or findings.
Transcription Text:
---
{text_to_summarize}
---
Generate the {summary_style}:
"""
try:
print(f"Sending text to GPT-4o for {summary_style}...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong summarization
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=400, # Adjust based on expected summary length
temperature=0.5 # Balance creativity and focus
)
summary = response.choices[0].message.content
print(f"'{summary_style}' generation successful.")
return summary.strip()
except OpenAIError as e:
print(f"OpenAI API Error during summarization: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during summarization: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("--------------------------")
# Step 2: Generate summaries in different styles
summary_styles_to_generate = [
"concise overview",
"detailed meeting minutes with action items",
"executive briefing focusing on decisions",
# "technical summary" # Add if relevant to your audio content
]
print("\n--- Generating Summaries ---")
for style in summary_styles_to_generate:
print(f"\n--- Summary Style: {style} ---")
summary_result = summarize_text(
client,
full_transcription,
summary_style=style
)
if summary_result:
print(summary_result)
else:
print(f"Failed to generate '{style}'.")
print("------------------------------------")
print("\nThis demonstrates GPT-4o generating different summaries from the same transcription based on the prompt.")
else:
print("\nTranscription failed, cannot proceed to summarization.")
Code breakdown:
- Context: This code demonstrates GPT-4o's advanced capability for summary generation from spoken content. It leverages the two-step process: transcribing audio with Whisper and then using GPT-4o to intelligently distill the key information from the transcription into a concise summary.
- Handling Lengthy Audio (Crucial Note): The prerequisites and code comments explicitly address the 25MB limit of the Whisper API. For real-world long meetings or presentations, the audio must be chunked, each chunk transcribed separately, and the resulting texts concatenated before being passed to the summarization step. The code example itself processes a single audio file for simplicity but highlights this essential workflow for longer content.
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file representing the discussion to be summarized (discussion_audio.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1, converting the input audio (or audio chunk) into plain text using Whisper. - Summarization Function (
summarize_text
):- Handles Step 2, taking the full transcribed text as input.
- Customizable Summaries: Accepts a
summary_style
argument (e.g., "executive briefing", "detailed meeting minutes"). - Prompt Engineering: The prompt sent to GPT-4o is dynamically constructed based on the requested
summary_style
. It instructs GPT-4o to act as an expert summarizer and tailor the output (focusing on strategic points, action items, technical details, etc.) according to the desired style. - Uses
gpt-4o
for its advanced understanding and summarization skills.
- Demonstrating Different Summary Types (Main Execution):
- The script first gets the full transcription.
- It then defines a list of different
summary_styles_to_generate
. - It calls the
summarize_text
function multiple times, passing the same full transcription each time but varying thesummary_style
argument. - By printing each resulting summary, the script clearly shows how GPT-4o adapts the level of detail and focus based on the prompt, generating distinct outputs (e.g., a brief overview vs. detailed minutes) from the identical source text.
- Use Case Relevance: This directly addresses the "Summary Generation" capability. It shows how combining Whisper and GPT-4o can transform lengthy spoken discussions into various useful formats (executive briefings, meeting minutes, quick overviews), saving time and improving knowledge management in business, education, and content creation.
Key Point Extraction
Identifies and highlights crucial information by leveraging GPT-4o's advanced natural language processing capabilities. Through sophisticated algorithms and contextual understanding, the model analyzes spoken content to extract meaningful insights. The model can:
- Extract core concepts and main arguments from spoken content - This involves identifying the fundamental ideas, key messages, and supporting evidence presented in conversations, presentations, or discussions. The model distinguishes between primary and secondary points, ensuring that essential information is captured.
- Identify critical decision points and action items - By analyzing conversation flow and context, GPT-4o recognizes moments when decisions are made, commitments are established, or tasks are assigned. This includes detecting both explicit assignments ("John will handle this") and implicit ones ("We should look into this further").
- Prioritize information based on context and relevance - The model evaluates the significance of different pieces of information within their specific context, considering factors such as urgency, impact, and relationship to overall objectives. This helps in creating hierarchical summaries that emphasize what matters most.
- Track key themes and recurring topics across conversations - GPT-4o maintains awareness of discussion patterns, identifying when certain subjects resurface and how they evolve over time. This capability is particularly valuable for long-term project monitoring or tracking ongoing concerns across multiple meetings.
Example:
This example focuses on using GPT-4o to extract specific, crucial information—key points, decisions, action items—from transcribed speech, going beyond a general summary.
This again uses the two-step approach: Whisper transcribes the audio, and then GPT-4o analyzes the text based on a prompt designed for extraction.
Download the audio sample: https://files.cuantum.tech/audio/meeting_for_extraction.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:07 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-21 22:07:00 CDT" # Updated time
current_location = "Austin, Texas, United States"
print(f"Running GPT-4o key point extraction from speech example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'meeting_for_extraction.mp3' with the actual filename.
audio_file_path = "meeting_for_extraction.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before extraction.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Extract Key Points, Decisions, Actions using GPT-4o ---
def extract_key_points(client, text_to_analyze):
"""Sends transcribed text to GPT-4o for key point extraction."""
print("\nStep 2: Extracting key points, decisions, and actions...")
if not text_to_analyze:
print("Error: No text provided for extraction.")
return None
# Prompt designed specifically for extraction
system_prompt = "You are an expert meeting analyst. Your task is to carefully read the provided transcript and extract specific types of information."
user_prompt = f"""Analyze the following meeting or discussion transcription. Identify and extract the following information, presenting each under a clear heading:
1. **Key Points / Core Concepts:** List the main topics, arguments, or fundamental ideas discussed.
2. **Decisions Made:** List any clear decisions that were reached during the discussion.
3. **Action Items:** List specific tasks assigned to individuals or the group. If possible, note who is responsible and any mentioned deadlines.
If any category has no relevant items, state "None identified".
Transcription Text:
---
{text_to_analyze}
---
Extracted Information:
"""
try:
print("Sending text to GPT-4o for extraction...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong analytical capabilities
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=600, # Adjust based on expected length of extracted info
temperature=0.2 # Lower temperature for more factual extraction
)
extracted_info = response.choices[0].message.content
print("Extraction successful.")
return extracted_info.strip()
except OpenAIError as e:
print(f"OpenAI API Error during extraction: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during extraction: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("------------------------------------")
# Step 2: Extract Key Information
extracted_details = extract_key_points(
client,
full_transcription
)
if extracted_details:
print("\n--- Extracted Key Information ---")
print(extracted_details)
print("---------------------------------")
print("\nThis demonstrates GPT-4o identifying and structuring key takeaways from the discussion.")
else:
print("\nFailed to extract key information.")
else:
print("\nTranscription failed, cannot proceed to key point extraction.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability for Key Point Extraction from spoken content. After transcribing audio using Whisper, GPT-4o analyzes the text to identify and isolate crucial information like core concepts, decisions made, and action items assigned.
- Two-Step Process: Like summarization, this relies on:
- Step 1 (Whisper): Transcribing the audio (
client.audio.transcriptions.create
) to get the full text. The critical note about handling audio files larger than 25MB via chunking and concatenation still applies. - Step 2 (GPT-4o): Analyzing the complete transcription using
client.chat.completions.create
with a prompt specifically designed for extraction.
- Step 1 (Whisper): Transcribing the audio (
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file from a meeting or discussion where key information is likely present (meeting_for_extraction.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1, returning the plain text transcription. - Extraction Function (
extract_key_points
):- Handles Step 2, taking the full transcription text.
- Prompt Engineering for Extraction: This is key. The prompt explicitly instructs GPT-4o to act as an analyst and extract information under specific headings: "Key Points / Core Concepts," "Decisions Made," and "Action Items." This structured request guides GPT-4o to identify and categorize the relevant information accurately. A lower
temperature
(e.g., 0.2) is suggested to encourage more factual, less creative output suitable for extraction. - Uses
gpt-4o
for its advanced analytical skills.
- Output: The function returns a text string containing the extracted information, ideally structured under the requested headings.
- Main Execution: The script transcribes the audio, then passes the text to the extraction function, and finally prints the structured output.
- Use Case Relevance: This directly addresses the "Key Point Extraction" capability. It shows how AI can automatically process lengthy discussions to pull out the most important concepts, track decisions, and list actionable tasks, saving significant time in reviewing recordings or generating meeting follow-ups. It highlights GPT-4o's ability to understand conversational flow and identify significant moments (decisions, assignments) within the text.
Emotional Intelligence
Detects tone, sentiment, and emotional undertones in spoken communication through GPT-4o's advanced natural language processing capabilities. This sophisticated system performs deep analysis of speech patterns and contextual elements to understand the emotional layers of communication. The model can identify subtle emotional cues such as:
- Voice inflections and patterns that indicate excitement, hesitation, or concern - Including pitch variations, speech rhythm changes, and vocal stress patterns that humans naturally use to convey emotions
- Changes in speaking tempo and volume that suggest emotional states - For example, rapid speech might indicate excitement or anxiety, while slower speech could suggest thoughtfulness or uncertainty
- Contextual emotional markers like laughter, sighs, or pauses - The model recognizes non-verbal sounds and silence that carry significant emotional meaning in conversation
- Cultural and situational nuances that affect emotional expression - Understanding how different cultures express emotions differently and how context influences emotional interpretation
This emotional awareness enables GPT-4o to provide more nuanced and context-appropriate responses, making it particularly valuable for applications in customer service (where understanding customer frustration or satisfaction is crucial), therapeutic conversations (where emotional support and understanding are paramount), and personal coaching (where motivation and emotional growth are key objectives). The system's ability to detect these subtle emotional signals allows for more empathetic and effective communication across various professional and personal contexts.
Example:
This example explores using GPT-4o for "Emotional Intelligence" – detecting tone, sentiment, and emotional undertones in speech.
It's important to understand how this works with current standard OpenAI APIs. While GPT-4o excels at understanding emotion from text, directly analyzing audio features like pitch, tone variance, tempo, sighs, or laughter as audio isn't a primary function of the standard Whisper transcription or the Chat Completions API endpoint when processing transcribed text.
Therefore, the most practical way to demonstrate this concept using these APIs is a two-step process:
- Transcribe Speech to Text: Use Whisper to get the words spoken.
- Analyze Text for Emotion: Use GPT-4o to analyze the transcribed text for indicators of emotion, sentiment, or tone based on word choice, phrasing, and context described in the text.
Download the sample audio: https://files.cuantum.tech/audio/emotional_speech.mp3
This code example implements this two-step, text-based analysis approach.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:13 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-21 20:13:00 CDT" # Updated time
current_location = "Atlanta, Georgia, United States"
print(f"Running GPT-4o speech emotion analysis (text-based) example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with potentially emotional speech
# IMPORTANT: Replace 'emotional_speech.mp3' with the actual filename.
audio_file_path = "emotional_speech.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Transcribed Text for Emotion/Sentiment using GPT-4o ---
def analyze_text_emotion(client, text_to_analyze):
"""
Sends transcribed text to GPT-4o for emotion and sentiment analysis.
Note: This analyzes the text content, not acoustic features of the original audio.
"""
print("\nStep 2: Analyzing transcribed text for emotion/sentiment...")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Prompt designed for text-based emotion/sentiment analysis
system_prompt = "You are an expert in communication analysis, skilled at detecting sentiment, tone, and potential underlying emotions from text."
user_prompt = f"""Analyze the following text for emotional indicators:
Text:
---
{text_to_analyze}
---
Based *only* on the words, phrasing, and punctuation in the text provided:
1. What is the overall sentiment (e.g., Positive, Negative, Neutral, Mixed)?
2. What is the likely emotional tone (e.g., Frustrated, Excited, Calm, Anxious, Sarcastic, Happy, Sad)?
3. Are there specific words or phrases that indicate these emotions? Explain briefly.
Provide the analysis:
"""
try:
print("Sending text to GPT-4o for emotion analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for nuanced understanding
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=300, # Adjust as needed
temperature=0.4 # Slightly lower temp for more grounded analysis
)
analysis = response.choices[0].message.content
print("Emotion analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio
transcribed_text = transcribe_speech(client, audio_file_path)
if transcribed_text:
print(f"\n--- Transcription Result ---")
print(transcribed_text)
print("----------------------------")
# Step 2: Analyze the transcription for emotion/sentiment
emotion_analysis = analyze_text_emotion(
client,
transcribed_text
)
if emotion_analysis:
print("\n--- Emotion/Sentiment Analysis (from Text) ---")
print(emotion_analysis)
print("----------------------------------------------")
print("\nNote: This analysis is based on the transcribed text content. It does not directly analyze acoustic features like tone of voice from the original audio.")
else:
print("\nEmotion analysis failed.")
else:
print("\nTranscription failed, cannot proceed to emotion analysis.")
# --- End of Code Example ---
Code breakdown:
- Context: This code demonstrates how GPT-4o can be used to infer emotional tone and sentiment from spoken language. It utilizes a two-step process common for this type of analysis with current APIs.
- Two-Step Process & Limitation:
- Step 1 (Whisper): The audio is first transcribed into text using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): The resulting text is then analyzed by GPT-4o (
client.chat.completions.create
) using a prompt specifically designed to identify sentiment and emotional indicators within the text. - Important Limitation: The explanation (and code comments) must clearly state that this method analyzes the linguistic content (words, phrasing) provided by Whisper. It does not directly analyze acoustic features of the original audio like pitch, tempo, or specific non-verbal sounds (sighs, laughter) unless those happen to be transcribed by Whisper (which is often not the case for subtle cues). True acoustic emotion detection would require different tools or APIs.
- Step 1 (Whisper): The audio is first transcribed into text using
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file where the speaker's words might suggest an emotion (emotional_speech.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1, returning plain text. - Emotion Analysis Function (
analyze_text_emotion
):- Handles Step 2, taking the transcribed text.
- Prompt Design: The prompt explicitly asks GPT-4o to analyze the provided text for overall sentiment (Positive/Negative/Neutral), likely emotional tone (Frustrated, Excited, etc.), and supporting textual evidence. It clarifies the analysis should be based only on the text.
- Uses
gpt-4o
for its sophisticated language understanding.
- Output: The function returns GPT-4o's textual analysis of the inferred emotion and sentiment.
- Main Execution: The script transcribes the audio, passes the text for analysis, prints both results, and reiterates the limitation regarding acoustic features.
- Use Case Relevance: While not analyzing acoustics directly, this text-based approach is still valuable for applications like customer service (detecting frustration/satisfaction from word choice), analyzing feedback, or getting a general sense of sentiment from spoken interactions, complementing other forms of analysis. It showcases GPT-4o's ability to interpret emotional language.
Remember to use an audio file where the spoken words convey some emotion for this example to be effective. Replace 'emotional_speech.mp3'
with your file path.
Implicit Understanding
GPT-4o demonstrates remarkable capabilities in understanding the deeper layers of human communication, going far beyond simple word recognition to grasp the intricate nuances of speech. The model's sophisticated comprehension abilities include:
- Detect underlying context and assumptions
- Understands implicit knowledge shared between speakers
- Recognizes unstated but commonly accepted facts within specific domains
- Identifies hidden premises in conversations
- Understand cultural references and idiomatic expressions
- Processes region-specific sayings and colloquialisms
- Recognizes cultural-specific metaphors and analogies
- Adapts understanding based on cultural context
- Interpret rhetorical devices
Example:
Similar to the previous examples involving deeper understanding (Semantic, Contextual, Emotional), this typically uses the two-step approach: Whisper transcribes the words, and then GPT-4o analyzes the resulting text, this time specifically prompted to look for implicit layers.
Download the sample audio: https://files.cuantum.tech/audio/implicit_speech.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:21 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-12 16:21:00 CDT" # Updated time
current_location = "Dallas, Texas, United States"
print(f"Running GPT-4o implicit speech understanding example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with implicit meaning
# IMPORTANT: Replace 'implicit_speech.mp3' with the actual filename.
audio_file_path = "implicit_speech.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Transcribed Text for Implicit Meaning using GPT-4o ---
def analyze_implicit_meaning(client, text_to_analyze):
"""
Sends transcribed text to GPT-4o to analyze implicit meanings,
assumptions, references, or rhetorical devices.
"""
print("\nStep 2: Analyzing transcribed text for implicit meaning...")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Prompt designed for identifying implicit communication layers
system_prompt = "You are an expert analyst of human communication, skilled at identifying meaning that is implied but not explicitly stated."
user_prompt = f"""Analyze the following statement or question:
Statement/Question:
---
{text_to_analyze}
---
Based on common knowledge, cultural context, and conversational patterns, please explain:
1. Any underlying assumptions the speaker might be making.
2. Any implicit meanings or suggestions conveyed beyond the literal words.
3. Any cultural references, idioms, or sayings being used or alluded to.
4. If it's a rhetorical question, what point is likely being made?
Provide a breakdown of the implicit layers of communication present:
"""
try:
print("Sending text to GPT-4o for implicit meaning analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for deep understanding
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=400, # Adjust as needed
temperature=0.5 # Allow for some interpretation
)
analysis = response.choices[0].message.content
print("Implicit meaning analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio
transcribed_text = transcribe_speech(client, audio_file_path)
if transcribed_text:
print(f"\n--- Transcription Result ---")
print(transcribed_text)
print("----------------------------")
# Step 2: Analyze the transcription for implicit meaning
implicit_analysis = analyze_implicit_meaning(
client,
transcribed_text
)
if implicit_analysis:
print("\n--- Implicit Meaning Analysis ---")
print(implicit_analysis)
print("-------------------------------")
print("\nThis demonstrates GPT-4o identifying meaning beyond the literal text, based on common knowledge and context.")
else:
print("\nImplicit meaning analysis failed.")
else:
print("\nTranscription failed, cannot proceed to implicit meaning analysis.")
Code breakdown:
- Context: This code example demonstrates GPT-4o's capability for Implicit Understanding – grasping the unstated assumptions, references, and meanings embedded within spoken language.
- Two-Step Process: It follows the established pattern:
- Step 1 (Whisper): Transcribe the audio containing the implicitly meaningful speech into text using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): Analyze the transcribed text using
client.chat.completions.create
, with a prompt specifically designed to uncover hidden layers of meaning.
- Step 1 (Whisper): Transcribe the audio containing the implicitly meaningful speech into text using
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file where the meaning relies on shared knowledge, cultural context, or isn't fully literal (e.g., using an idiom, a rhetorical question, or making an assumption clear only through context).implicit_speech.mp3
is used as the placeholder. - Transcription Function (
transcribe_speech
): Handles Step 1, returning the plain text transcription. - Implicit Analysis Function (
analyze_implicit_meaning
):- Handles Step 2, taking the transcribed text.
- Prompt Engineering for Implicit Meaning: The prompt is key here. It instructs GPT-4o to look beyond the literal words and identify underlying assumptions, implied suggestions, cultural references/idioms, and the purpose behind rhetorical questions.
- Uses
gpt-4o
for its extensive knowledge base and reasoning ability needed to infer these implicit elements.
- Output: The function returns GPT-4o's textual analysis of the unstated meanings detected in the input text.
- Main Execution: The script transcribes the audio, passes the text for implicit analysis, and prints both the literal transcription and GPT-4o's interpretation of the hidden meanings.
- Use Case Relevance: This demonstrates how GPT-4o can process communication more like a human, understanding not just what was said, but also what was meant or assumed. This is crucial for applications requiring deep comprehension, such as analyzing user feedback, understanding nuanced dialogue in meetings, or interpreting culturally rich content.
Remember to use an audio file containing speech that requires some level of inference or background knowledge to fully understand for testing this code effectively. Replace 'implicit_speech.mp3'
with your file path.
From Transcription to Comprehensive Understanding
This advance marks a revolutionary transformation in AI's ability to process human speech. While traditional systems like Whisper excel at transcription - the mechanical process of converting spoken words into written text - modern AI systems like GPT-4o achieve true comprehension, understanding not just the words themselves but their deeper meaning, context, and implications. This leap forward enables AI to process human communication in ways that are remarkably similar to how humans naturally understand conversation, including subtle nuances, implied meanings, and contextual relevance.
To illustrate this transformative evolution in capability, let's examine a detailed example that highlights the stark contrast between simple transcription and advanced comprehension:
- Consider this statement: "I think we should delay the product launch until next quarter." A traditional transcription system like Whisper would perfectly capture these words, but that's where its understanding ends - it simply converts speech to text with high accuracy.
- GPT-4o, however, demonstrates a sophisticated level of understanding that mirrors human comprehension:
- Primary Message Analysis: Beyond just identifying the suggestion to reschedule, it understands this as a strategic proposal that requires careful consideration
- Business Impact Evaluation: Comprehensively assesses how this delay would affect various aspects of the business, from resource allocation to team scheduling to budget implications
- Strategic Market Analysis: Examines the broader market context, including competitor movements, market trends, and potential windows of opportunity
- Comprehensive Risk Assessment: Evaluates both immediate and long-term consequences, considering everything from technical readiness to market positioning
What makes GPT-4o truly remarkable is its ability to engage in nuanced analytical discussions about the content, addressing complex strategic questions that require deep understanding:
- External Factors: What specific market conditions, competitive pressures, or industry trends might have motivated this delay suggestion?
- Stakeholder Impact: How would this timeline adjustment affect relationships with investors, partners, and customers? What communication strategies might be needed?
- Strategic Opportunities: What potential advantages could emerge from this delay, such as additional feature development or market timing optimization?
2.3.2 What Can GPT-4o Do with Speech Input?
GPT-4o represents a significant advancement in audio processing technology, offering a comprehensive suite of capabilities that transform how we interact with and understand spoken content. As a cutting-edge language model with multimodal processing abilities, it combines sophisticated speech recognition with deep contextual understanding to deliver powerful audio analysis features. Let's explore GPT-4o's some other functions and capabilities:
Action Item Extraction
Prompt example: "List all the tasks mentioned in this voice note."
GPT-4o excels at identifying and extracting action items from spoken content through sophisticated natural language processing. The model can:
- Parse complex conversations to detect both explicit ("Please do X") and implicit ("We should consider Y") tasks
- Distinguish between hypothetical discussions and actual commitments
- Categorize tasks by priority, deadline, and assignee
- Identify dependencies between different action items
- Flag follow-up requirements and recurring tasks
This capability transforms unstructured audio discussions into structured, actionable task lists, significantly improving meeting productivity and follow-through. By automatically maintaining a comprehensive record of commitments, it ensures accountability while reducing the cognitive load on participants who would otherwise need to manually track these items. The system can also integrate with popular task management tools, making it seamless to convert spoken assignments into trackable tickets or to-dos.
Example:
This script uses the familiar two-step process: first transcribing the audio with Whisper, then analyzing the text with GPT-4o using a prompt specifically designed to identify and structure action items.
Download the audio sample: https://files.cuantum.tech/audio/meeting_tasks.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:39 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-24 10:29:00 CDT" # Updated time
current_location = "Plano, Texas, United States"
print(f"Running GPT-4o action item extraction from speech example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'meeting_tasks.mp3' with the actual filename.
audio_file_path = "meeting_tasks.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before extraction.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Extract Action Items from Text using GPT-4o ---
def extract_action_items(client, text_to_analyze):
"""Sends transcribed text to GPT-4o for action item extraction."""
print("\nStep 2: Extracting action items...")
if not text_to_analyze:
print("Error: No text provided for extraction.")
return None
# Prompt designed specifically for extracting structured action items
system_prompt = "You are an expert meeting analyst focused on identifying actionable tasks."
user_prompt = f"""Analyze the following meeting or discussion transcription. Identify and extract all specific action items mentioned.
For each action item, provide:
- A clear description of the task.
- The person assigned (if mentioned, otherwise state 'Unassigned' or 'Group').
- Any deadline mentioned (if mentioned, otherwise state 'No deadline mentioned').
Distinguish between definite commitments/tasks and mere suggestions or hypothetical possibilities. Only list items that sound like actual tasks or commitments.
Format the output as a numbered list.
Transcription Text:
---
{text_to_analyze}
---
Extracted Action Items:
"""
try:
print("Sending text to GPT-4o for action item extraction...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong analytical capabilities
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=500, # Adjust based on expected number of action items
temperature=0.1 # Very low temperature for factual extraction
)
extracted_actions = response.choices[0].message.content
print("Action item extraction successful.")
return extracted_actions.strip()
except OpenAIError as e:
print(f"OpenAI API Error during extraction: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during extraction: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("------------------------------------")
# Step 2: Extract Action Items
action_items_list = extract_action_items(
client,
full_transcription
)
if action_items_list:
print("\n--- Extracted Action Items ---")
print(action_items_list)
print("------------------------------")
print("\nThis demonstrates GPT-4o identifying and structuring actionable tasks from the discussion.")
else:
print("\nFailed to extract action items.")
else:
print("\nTranscription failed, cannot proceed to action item extraction.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability for Action Item Extraction from spoken content. After transcribing audio with Whisper, GPT-4o analyzes the text to identify specific tasks, assignments, and deadlines discussed.
- Two-Step Process: It uses the standard workflow:
- Step 1 (Whisper): Transcribe the meeting/discussion audio (
client.audio.transcriptions.create
) into text. The note about handling audio files > 25MB via chunking/concatenation remains critical for real-world use. - Step 2 (GPT-4o): Analyze the complete transcription using
client.chat.completions.create
with a prompt tailored for task extraction.
- Step 1 (Whisper): Transcribe the meeting/discussion audio (
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file from a meeting where tasks were assigned (meeting_tasks.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1. - Action Item Extraction Function (
extract_action_items
):- Handles Step 2, taking the full transcription text.
- Prompt Engineering for Tasks: This is the core. The prompt explicitly instructs GPT-4o to identify action items, distinguish them from mere suggestions, and extract the task description, assigned person (if mentioned), and deadline (if mentioned). It requests a structured, numbered list format. A very low
temperature
(e.g., 0.1) is recommended to keep the output focused on factual extraction. - Uses
gpt-4o
for its ability to understand conversational context and identify commitments.
- Output: The function returns a text string containing the structured list of extracted action items.
- Main Execution: The script transcribes the audio, passes the text to the extraction function, and prints the resulting list of tasks.
- Use Case Relevance: This directly addresses the "Action Item Extraction" capability. It shows how AI can automatically convert unstructured verbal discussions into organized, actionable task lists. This significantly boosts productivity by ensuring follow-through, clarifying responsibilities, and reducing the manual effort of tracking commitments made during meetings. It highlights GPT-4o's ability to parse complex conversations and identify both explicit and implicit task assignments.
Q&A about the Audio
Prompt Example: "What did the speaker say about the budget?"
GPT-4o's advanced query capabilities allow for natural conversations about audio content, enabling users to ask specific questions and receive contextually relevant answers. The model can:
- Extract precise information from specific segments
- Understand context and references across the entire audio
- Handle follow-up questions about previously discussed topics
- Provide time-stamped references to relevant portions
- Cross-reference information from multiple parts of the recording
This functionality transforms how we interact with audio content, making it as searchable and queryable as text documents. Instead of manually scrubbing through recordings, users can simply ask questions in natural language and receive accurate, concise responses. The system is particularly valuable for:
- Meeting participants who need to verify specific details
- Researchers analyzing interview recordings
- Students reviewing lecture content
- Professionals fact-checking client conversations
- Teams seeking to understand historical discussions
Example:
This script first transcribes an audio file using Whisper and then uses GPT-4o to answer a specific question asked by the user about the content of that transcription.
Download the audio sample: https://files.cuantum.tech/audio/meeting_for_qa.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:47 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-01-11 11:47:00 CDT" # Updated time
current_location = "Orlando, Florida, United States"
print(f"Running GPT-4o Q&A about audio example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'meeting_for_qa.mp3' with the actual filename.
audio_file_path = "meeting_for_qa.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before Q&A.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Answer Question Based on Text using GPT-4o ---
def answer_question_about_text(client, full_text, question):
"""Sends transcribed text and a question to GPT-4o to get an answer."""
print(f"\nStep 2: Answering question about the transcription...")
print(f"Question: \"{question}\"")
if not full_text:
print("Error: No transcription text provided to answer questions about.")
return None
if not question:
print("Error: No question provided.")
return None
# Prompt designed specifically for answering questions based on provided text
system_prompt = "You are an AI assistant specialized in answering questions based *only* on the provided text transcription. Do not use outside knowledge."
user_prompt = f"""Based *solely* on the following transcription text, please answer the question below. If the answer is not found in the text, state that clearly.
Transcription Text:
---
{full_text}
---
Question: {question}
Answer:
"""
try:
print("Sending transcription and question to GPT-4o...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong comprehension and answering
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=300, # Adjust based on expected answer length
temperature=0.1 # Low temperature for factual answers based on text
)
answer = response.choices[0].message.content
print("Answer generation successful.")
return answer.strip()
except OpenAIError as e:
print(f"OpenAI API Error during Q&A: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during Q&A: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
transcription = transcribe_speech(client, audio_file_path)
if transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(transcription[:1000] + "..." if len(transcription) > 1000 else transcription)
print("------------------------------------")
# --- Ask Questions about the Transcription ---
# Define the question(s) you want to ask
user_question = "What was decided about the email marketing CTA button?"
# user_question = "Who is responsible for the A/B test on Platform B?"
# user_question = "What was the engagement increase on Platform A?"
print(f"\n--- Answering Question ---")
# Step 2: Get the answer from GPT-4o
answer = answer_question_about_text(
client,
transcription,
user_question
)
if answer:
print(f"\nAnswer to '{user_question}':")
print(answer)
print("------------------------------")
print("\nThis demonstrates GPT-4o answering specific questions based on the transcribed audio content.")
else:
print(f"\nFailed to get an answer for the question: '{user_question}'")
else:
print("\nTranscription failed, cannot proceed to Q&A.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability to function as a Q&A system for audio content. After transcribing speech with Whisper, users can ask specific questions in natural language, and GPT-4o will provide answers based on the information contained within the transcription.
- Two-Step Process: The workflow involves:
- Step 1 (Whisper): Transcribe the relevant audio file (or concatenated text from chunks of a longer file) using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): Send the complete transcription along with the user's specific question to
client.chat.completions.create
.
- Step 1 (Whisper): Transcribe the relevant audio file (or concatenated text from chunks of a longer file) using
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file containing the discussion or information the user might ask questions about (meeting_for_qa.mp3
). The critical note about handling audio > 25MB via chunking/concatenation before the Q&A step remains essential. - Transcription Function (
transcribe_speech
): Handles Step 1. - Q&A Function (
answer_question_about_text
):- Handles Step 2, taking both the
full_text
transcription and thequestion
as input. - Prompt Engineering for Q&A: The prompt is crucial. It instructs GPT-4o to act as a specialized assistant that answers questions based only on the provided transcription text, explicitly telling it not to use external knowledge and to state if the answer isn't found in the text. This grounding is important for accuracy. A low
temperature
(e.g., 0.1) helps ensure factual answers derived directly from the source text. - Uses
gpt-4o
for its excellent reading comprehension and question-answering abilities.
- Handles Step 2, taking both the
- Output: The function returns GPT-4o's answer to the specific question asked.
- Main Execution: The script transcribes the audio, defines a sample
user_question
, passes the transcription and question to the Q&A function, and prints the resulting answer. - Use Case Relevance: This directly addresses the "Q&A about the Audio" capability. It transforms audio recordings from passive archives into interactive knowledge sources. Users can quickly find specific details, verify facts, or understand parts of a discussion without manually searching through the audio, making it invaluable for reviewing meetings, lectures, interviews, or any recorded conversation.
Remember to use an audio file containing information relevant to potential questions for testing (you can use the sample audio provided). Modify the user_question
variable to test different queries against the transcribed content.
Highlight Key Moments
Prompt example: "Identify the most important statements made in this audio."
GPT-4o excels at identifying and extracting crucial moments from audio content through its advanced natural language understanding capabilities. The model can:
- Identify key decisions and action items
- Extract important quotes and statements
- Highlight strategic discussions and conclusions
- Pinpoint critical transitions in conversations
This feature is particularly valuable for:
- Meeting participants who need to quickly review important takeaways
- Executives scanning long recordings for decision points
- Teams tracking project milestones discussed in calls
- Researchers identifying significant moments in interviews
The model provides timestamps and contextual summaries for each highlighted moment, making it easier to navigate directly to the most relevant parts of the recording without reviewing the entire audio file.
Example:
This script follows the established two-step pattern: transcribing the audio with Whisper and then analyzing the text with GPT-4o using a prompt designed to identify significant statements, decisions, or conclusions.
Download the sample audio: https://files.cuantum.tech/audio/key_discussion.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:52 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-02-14 15:52:00 CDT" # Updated time
current_location = "Tampa, Florida, United States"
print(f"Running GPT-4o key moment highlighting example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'key_discussion.mp3' with the actual filename.
audio_file_path = "key_discussion.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before highlighting.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Highlight Key Moments from Text using GPT-4o ---
def highlight_key_moments(client, text_to_analyze):
"""Sends transcribed text to GPT-4o to identify and extract key moments."""
print("\nStep 2: Identifying key moments from transcription...")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Prompt designed specifically for identifying key moments/statements
system_prompt = "You are an expert analyst skilled at identifying the most significant parts of a discussion or presentation."
user_prompt = f"""Analyze the following transcription text. Identify and extract the key moments, which could include:
- Important decisions made
- Critical conclusions reached
- Significant statements or impactful quotes
- Major topic shifts or transitions
- Key questions asked or answered
For each key moment identified, provide the relevant quote or a concise summary of the moment. Present the output as a list.
Transcription Text:
---
{text_to_analyze}
---
Key Moments:
"""
try:
print("Sending text to GPT-4o for key moment identification...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong comprehension
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=700, # Adjust based on expected number/length of key moments
temperature=0.3 # Lean towards factual identification
)
key_moments = response.choices[0].message.content
print("Key moment identification successful.")
return key_moments.strip()
except OpenAIError as e:
print(f"OpenAI API Error during highlighting: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during highlighting: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("------------------------------------")
# Step 2: Highlight Key Moments
highlights = highlight_key_moments(
client,
full_transcription
)
if highlights:
print("\n--- Identified Key Moments ---")
print(highlights)
print("----------------------------")
print("\nThis demonstrates GPT-4o extracting significant parts from the discussion.")
print("\nNote: Adding precise timestamps to these moments requires further processing using Whisper's 'verbose_json' output and correlating the text.")
else:
print("\nFailed to identify key moments.")
else:
print("\nTranscription failed, cannot proceed to highlight key moments.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability to Highlight Key Moments from spoken content. After transcription via Whisper, GPT-4o analyzes the text to pinpoint and extract the most significant parts, such as crucial decisions, important statements, or major topic shifts.
- Two-Step Process:
- Step 1 (Whisper): Transcribe the audio (
client.audio.transcriptions.create
) to get the full text. The necessity of chunking/concatenating for audio files > 25MB is reiterated. - Step 2 (GPT-4o): Analyze the complete transcription using
client.chat.completions.create
with a prompt specifically asking for key moments.
- Step 1 (Whisper): Transcribe the audio (
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file containing a discussion or presentation where significant moments occur (key_discussion.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1. - Highlighting Function (
highlight_key_moments
):- Handles Step 2, taking the full transcription text.
- Prompt Engineering for Highlights: The prompt instructs GPT-4o to act as an analyst and identify various types of key moments (decisions, conclusions, impactful quotes, transitions). It asks for the relevant quote or a concise summary for each identified moment, formatted as a list.
- Uses
gpt-4o
for its ability to discern importance and context within text.
- Output: The function returns a text string containing the list of identified key moments.
- Timestamp Note: The explanation and code output explicitly mention that while this process identifies the text of key moments, adding precise timestamps would require additional steps. This involves using Whisper's
verbose_json
output format (which includes segment timestamps) and then correlating the text identified by GPT-4o back to those specific timed segments – a more complex task not covered in this basic example. - Main Execution: The script transcribes the audio, passes the text to the highlighting function, and prints the resulting list of key moments.
- Use Case Relevance: This addresses the "Highlight Key Moments" capability by showing how AI can quickly sift through potentially long recordings to surface the most critical parts. This is highly valuable for efficient review of meetings, interviews, or lectures, allowing users to focus on what matters most without listening to the entire audio.
For testing purposes, use an audio file that contains a relevant discussion with clear, identifiable key segments (you can use the sample audio file provided).
2.3.3 Real-World Use Cases
The modern business landscape increasingly relies on audio communication across various sectors, from sales and customer service to education and personal development. Understanding and effectively utilizing these audio interactions has become crucial for organizations seeking to improve their operations, enhance customer relationships, and drive better outcomes. This section explores several key applications where advanced audio processing and analysis can create significant value, demonstrating how AI-powered tools can transform raw audio data into actionable insights.
From analyzing sales conversations to enhancing educational experiences, these use cases showcase the versatility and power of audio understanding technologies in addressing real-world challenges. Each application represents a unique opportunity to leverage voice data for improved decision-making, process optimization, and better user experiences.
1. Sales Enablement
Advanced analysis of sales call recordings provides a comprehensive toolkit for sales teams to optimize their performance. The system can identify key objections raised by prospects, allowing teams to develop better counter-arguments and prepare responses in advance. It tracks successful closing techniques by analyzing patterns in successful deals, revealing which approaches work best for different customer segments and situations.
The system also measures crucial metrics like conversion rates, call duration, talk-to-listen ratios, and key phrase usage. This data helps sales teams understand which behaviors correlate with successful outcomes. By analyzing customer responses and reaction patterns, teams can refine their pitch timing, improve their questioning techniques, and better understand buying signals.
This technology also enables sales managers to document and share effective approaches across the team, creating a knowledge base of best practices for common challenges. This institutional knowledge can be particularly valuable for onboarding new team members and maintaining consistent sales excellence across the organization.
2. Meeting Intelligence
Comprehensive meeting analysis transforms how organizations capture and utilize meeting content. The system goes beyond basic transcription by:
- Identifying and categorizing key discussion points for easy reference
- Automatically detecting and extracting action items from conversations
- Assigning responsibilities to specific team members based on verbal commitments
- Creating structured timelines and tracking deadlines mentioned during meetings
- Generating automated task lists with clear ownership and due dates
- Highlighting decision points and meeting outcomes
- Providing searchable meeting archives for future reference
The system employs advanced natural language processing to understand context, relationships, and commitments expressed during conversations. This enables automatic task creation and assignment, ensuring nothing falls through the cracks. Integration with project management tools allows for seamless workflow automation, while smart reminders help keep team members accountable for their commitments.
3. Customer Support
Deep analysis of customer service interactions provides comprehensive insights into customer experience and support team performance. The system can:
- Evaluate customer sentiment in real-time by analyzing tone, word choice, and conversation flow
- Automatically categorize and prioritize urgent issues based on keyword detection and context analysis
- Generate detailed satisfaction metrics through conversation analysis and customer feedback
- Track key performance indicators like first-response time and resolution time
- Identify common pain points and recurring issues across multiple interactions
- Monitor support agent performance and consistency in service delivery
This enables support teams to improve response times, identify trending problems, and maintain consistent service quality across all interactions. The system can also provide automated coaching suggestions for support agents and generate insights for product improvement based on customer feedback patterns.
4. Personal Journaling
Transform voice memos into structured reflections with emotional context analysis. Using advanced natural language processing, the system analyzes voice recordings to detect emotional states, stress levels, and overall sentiment through tone of voice, word choice, and speaking patterns. This creates a rich, multi-dimensional journal entry that captures not just what was said, but how it was expressed.
The system's mood tracking capabilities go beyond simple positive/negative classifications, identifying nuanced emotional states like excitement, uncertainty, confidence, or concern. By analyzing these patterns over time, users can gain valuable insights into their emotional well-being and identify triggers or patterns that affect their mental state.
For personal goal tracking, the system can automatically categorize and tag mentions of objectives, progress updates, and setbacks. It can generate progress reports showing momentum toward specific goals, highlight common obstacles, and even suggest potential solutions based on past successful strategies. The behavioral trend analysis examines patterns in decision-making, habit formation, and personal growth, providing users with actionable insights for self-improvement.
5. Education & Language Practice
Comprehensive language learning support revolutionizes how students practice and improve their language skills. The system provides several key benefits:
- Speech Analysis: Advanced algorithms analyze pronunciation patterns, detecting subtle variations in phonemes, stress patterns, and intonation. This helps learners understand exactly where their pronunciation differs from native speakers.
- Error Detection: The system identifies not just pronunciation errors, but also grammatical mistakes, incorrect word usage, and syntactical issues in real-time. This immediate feedback helps prevent the formation of bad habits.
- Personalized Feedback: Instead of generic corrections, the system provides context-aware feedback that considers the learner's proficiency level, native language, and common interference patterns specific to their language background.
- Progress Tracking: Sophisticated metrics track various aspects of language development, including vocabulary range, speaking fluency, grammar accuracy, and pronunciation improvement over time. Visual progress reports help motivate learners and identify areas needing focus.
- Adaptive Learning: Based on performance analysis, the system creates customized exercise plans targeting specific weaknesses. These might include focused pronunciation drills, grammar exercises, or vocabulary building activities tailored to the learner's needs.
The system can track improvement over time and suggest targeted exercises for areas needing improvement, creating a dynamic and responsive learning environment that adapts to each student's progress.
2.3.4 Privacy Considerations
Privacy is paramount when handling audio recordings. First and foremost, obtaining consent before analyzing third-party voice recordings is a crucial legal and ethical requirement. It's essential to secure written or documented permission from all participants before processing any voice recordings, whether they're from meetings, interviews, calls, or other audio content involving third parties. Organizations should implement a formal consent process that clearly outlines how the audio will be used and analyzed.
Security measures must be implemented throughout the processing workflow. After analysis is complete, it's critical to use openai.files.delete(file_id)
to remove audio files from OpenAI's servers. This practice minimizes data exposure and helps prevent unauthorized access and potential data breaches. Organizations should establish automated cleanup procedures to ensure consistent deletion of processed files.
Long-term storage of voice data requires special consideration. Never store sensitive voice recordings without explicit approval from all parties involved. Organizations should implement strict data handling policies that clearly specify storage duration, security measures, and intended use. Extra caution should be taken with recordings containing personal information, business secrets, or confidential discussions. Best practices include implementing encryption for stored audio files and maintaining detailed access logs.
2.3 Speech Understanding in GPT-4o
In this section, you'll discover how to work with advanced audio processing capabilities that go beyond basic transcription. GPT-4o introduces a revolutionary approach to audio understanding by allowing direct integration of audio files alongside textual prompts. This creates a seamless multimodal interaction system where both audio and text inputs are processed simultaneously. The system can analyze various aspects of speech, including tone, context, and semantic meaning, enabling you to build sophisticated smart assistants that can listen, understand, and respond naturally within any given context.
The technology represents a significant advancement in audio processing by combining Whisper-style transcription with GPT-4o's advanced reasoning capabilities. While Whisper excels at converting speech to text, GPT-4o takes this further by performing deep analysis of the transcribed content. This integration happens in one fluid interaction, eliminating the need for separate processing steps. For example, when processing a business meeting recording, GPT-4o can simultaneously transcribe the speech, identify speakers, extract action items, and generate summaries - all while maintaining context and understanding subtle nuances in communication.
This powerful combination opens up unprecedented possibilities for creating more intuitive and responsive AI applications. These applications can not only process and understand spoken language but can also interpret context, emotion, and intent in ways that were previously not possible. Whether it's analyzing customer service calls, processing educational lectures, or facilitating multilingual communication, the system provides a comprehensive understanding of spoken content that goes far beyond simple transcription.
2.3.1 Why GPT-4o for Speech?
While the Whisper API excels at converting spoken language into written text, GPT-4o represents a revolutionary leap forward in audio processing capabilities. To understand the distinction, imagine Whisper as a highly skilled transcriptionist who can accurately write down every word spoken, while GPT-4o functions more like an experienced analyst with deep contextual understanding.
GPT-4o's capabilities extend far beyond basic transcription. It can understand and process speech at multiple levels simultaneously:
Semantic Understanding
Comprehends the actual meaning behind the words, going beyond simple word-for-word translation. This advanced capability allows GPT-4o to process language at multiple levels simultaneously, understanding not only the literal meaning but also the deeper semantic layers, cultural context, and intended message. This includes understanding idioms, metaphors, cultural references, and regional expressions within the speech, as well as detecting subtle nuances in communication that might be lost in simple transcription.
For example, when someone says "it's raining cats and dogs," GPT-4o understands this means heavy rainfall rather than literally interpreting animals falling from the sky. Similarly, when processing phrases like "break a leg" before a performance or "piece of cake" to describe an easy task, the system correctly interprets these idiomatic expressions within their cultural context.
It can also grasp complex concepts like sarcasm ("Oh, great, another meeting"), humor ("Why did the GPT model cross the road?"), and rhetorical questions ("Who wouldn't want that?"), making it capable of truly understanding human communication in its full context. This sophisticated understanding extends to cultural-specific references, professional jargon, and even regional dialectical variations, ensuring accurate interpretation regardless of the speaker's background or communication style.
Example:
Since the standard OpenAI API interaction for this typically involves first converting speech to text (using Whisper) and then analyzing that text for deeper meaning (using GPT-4o), the code example will demonstrate this two-step process.
This script will:
- Transcribe an audio file containing potentially nuanced language using Whisper.
- Send the transcribed text to GPT-4o with a prompt asking for semantic interpretation.
Download the audio sample: https://files.cuantum.tech/audio/idiom_speech.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 7:37 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-04-21 19:37:00 CDT"
current_location = "Dallas, Texas, United States"
print(f"Running GPT-4o semantic speech understanding example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with nuanced speech
# IMPORTANT: Replace 'idiom_speech.mp3' with the actual filename.
# Good examples for audio content: "Wow, that presentation just knocked my socks off!",
# "Sure, I'd LOVE to attend another three-hour meeting.", "He really spilled the beans."
audio_file_path = "idiom_speech.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Text for Semantic Meaning using GPT-4o ---
def analyze_text_meaning(client, text_to_analyze):
"""Sends transcribed text to GPT-4o for semantic analysis."""
print(f"\nStep 2: Analyzing text for semantic meaning: \"{text_to_analyze}\"")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Construct prompt to ask for deeper meaning
system_prompt = "You are an expert in linguistics and communication."
user_prompt = (
f"Analyze the following phrase or sentence:\n\n'{text_to_analyze}'\n\n"
"Explain its likely intended meaning, considering context, idioms, "
"metaphors, sarcasm, humor, cultural references, or other nuances. "
"Go beyond a literal, word-for-word interpretation."
)
try:
print("Sending text to GPT-4o for analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for its strong understanding capabilities
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=250, # Adjust as needed
temperature=0.5 # Lower temperature for more focused analysis
)
analysis = response.choices[0].message.content
print("Semantic analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio
transcribed_text = transcribe_speech(client, audio_file_path)
if transcribed_text:
print(f"\nTranscription Result: {transcribed_text}")
# Step 2: Analyze the transcription for meaning
semantic_analysis = analyze_text_meaning(client, transcribed_text)
if semantic_analysis:
print("\n--- Semantic Analysis Result ---")
print(semantic_analysis)
print("--------------------------------\n")
print("This demonstrates GPT-4o understanding nuances beyond literal text.")
else:
print("\nSemantic analysis failed.")
else:
print("\nTranscription failed, cannot proceed to analysis.")
Code breakdown:
- Context: This code demonstrates GPT-4o's advanced semantic understanding of speech. It goes beyond simple transcription by interpreting the meaning, including nuances like idioms, sarcasm, or context-dependent phrases.
- Two-Step Process: The example uses a standard two-step API approach:
- Step 1 (Whisper): The audio file is first converted into text using the Whisper API (
client.audio.transcriptions.create
). This captures the spoken words accurately. - Step 2 (GPT-4o): The transcribed text is then sent to the GPT-4o model (
client.chat.completions.create
) with a specific prompt asking it to analyze the meaning behind the words, considering non-literal interpretations.
- Step 1 (Whisper): The audio file is first converted into text using the Whisper API (
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and crucially, an audio file containing speech that has some nuance (e.g., includes an idiom like "spill the beans", a sarcastic remark like "Oh great, another meeting", or a culturally specific phrase). - Transcription Function (
transcribe_speech
): This function handles Step 1, taking the audio file path and returning the plain text transcription from Whisper. - Semantic Analysis Function (
analyze_text_meaning
):- This function handles Step 2. It takes the transcribed text.
- Prompt Design: It constructs a prompt specifically asking GPT-4o to act as a linguistic expert and explain the intended meaning, considering idioms, sarcasm, context, etc., explicitly requesting analysis beyond the literal interpretation.
- Uses
gpt-4o
as the model for its strong reasoning and understanding capabilities. - Returns the analysis provided by GPT-4o.
- Main Execution: The script first transcribes the audio. If successful, it passes the text to the analysis function. Finally, it prints both the literal transcription and GPT-4o's semantic interpretation.
- Use Case Relevance: This example clearly shows how combining Whisper and GPT-4o allows for a deeper understanding of spoken language than transcription alone. It demonstrates the capability described – comprehending idioms ("raining cats and dogs"), sarcasm, humor, and context – making AI interaction more aligned with human communication.
Remember to use an audio file containing non-literal language for testing to best showcase the semantic analysis step. Replace 'idiom_speech.mp3'
with your actual file path.
Contextual Analysis
Interprets statements within their broader context, taking into account surrounding information, previous discussions, cultural references, and situational factors. This includes understanding how time, place, speaker relationships, and prior conversations influence meaning. The analysis considers multiple layers of context:
- Temporal Context: When something is said (time of day, day of week, season, or historical period)
- Social Context: The relationships between speakers, power dynamics, and social norms
- Physical Context: The location and environment where communication occurs
- Cultural Context: Shared knowledge, beliefs, and customs that influence interpretation
For example, the phrase "it's getting late" could mean different things in different contexts:
- During a workday meeting: A polite suggestion to wrap up the discussion
- At a social gathering: An indication that someone needs to leave
- From a parent to a child: A reminder about bedtime
- In a project discussion: Concern about approaching deadlines
GPT-4o analyzes these contextual clues along with additional factors such as tone of voice, speech patterns, and conversation history to provide more accurate and nuanced interpretations of spoken communication. This deep contextual understanding allows the system to capture the true intended meaning behind words, rather than just their literal interpretation.
Example:
This use case focuses on GPT-4o's ability to interpret transcribed speech within its broader context (temporal, social, physical, cultural). Like the semantic understanding example, this typically involves a two-step process: transcribing the speech with Whisper, then analyzing the text with GPT-4o, but this time explicitly providing contextual information to GPT-4o.
This code example will:
- Transcribe a simple, context-dependent phrase from an audio file using Whisper.
- Send the transcribed text to GPT-4o multiple times, each time providing a different context description.
- Show how GPT-4o's interpretation of the same phrase changes based on the provided context.
Download the sample audio: https://files.cuantum.tech/audio/context_phrase.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 7:44 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-02-11 11:44:00 CDT" # Updated time
current_location = "Miami, Florida, United States"
print(f"Running GPT-4o contextual speech analysis example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with the context-dependent phrase
# IMPORTANT: Replace 'context_phrase.mp3' with the actual filename.
# The audio content should ideally be just "It's getting late."
audio_file_path = "context_phrase.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from the previous example (gpt4o_speech_semantic_py)
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Text for Meaning WITHIN a Given Context using GPT-4o ---
def analyze_text_with_context(client, text_to_analyze, context_description):
"""Sends transcribed text and context description to GPT-4o for analysis."""
print(f"\nStep 2: Analyzing text \"{text_to_analyze}\" within context...")
print(f"Context Provided: {context_description}")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
if not context_description:
print("Error: Context description must be provided for this analysis.")
return None
# Construct prompt asking for interpretation based on context
system_prompt = "You are an expert in analyzing communication and understanding context."
user_prompt = (
f"Consider the phrase: '{text_to_analyze}'\n\n"
f"Now, consider the specific context in which it was said: '{context_description}'\n\n"
"Based *only* on this context, explain the likely intended meaning, implication, "
"or function of the phrase in this situation."
)
try:
print("Sending text and context to GPT-4o for analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong contextual reasoning
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=200, # Adjust as needed
temperature=0.3 # Lower temperature for more focused contextual interpretation
)
analysis = response.choices[0].message.content
print("Contextual analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio phrase
transcribed_phrase = transcribe_speech(client, audio_file_path)
if transcribed_phrase:
print(f"\nTranscription Result: \"{transcribed_phrase}\"")
# Define different contexts for the same phrase
contexts = [
"Said during a business meeting scheduled to end at 5:00 PM, spoken at 4:55 PM.",
"Said by a guest at a social party around 1:00 AM.",
"Said by a parent to a young child at 9:00 PM on a school night.",
"Said during a critical project discussion about an upcoming deadline, spoken late in the evening.",
"Said by someone looking out the window on a short winter afternoon."
]
print("\n--- Analyzing Phrase in Different Contexts ---")
# Step 2: Analyze the phrase within each context
for i, context in enumerate(contexts):
print(f"\n--- Analysis for Context {i+1} ---")
contextual_meaning = analyze_text_with_context(
client,
transcribed_phrase,
context
)
if contextual_meaning:
print(f"Meaning in Context: {contextual_meaning}")
else:
print("Contextual analysis failed for this context.")
print("------------------------------------")
print("\nThis demonstrates how GPT-4o interprets the same phrase differently based on provided context.")
else:
print("\nTranscription failed, cannot proceed to contextual analysis.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability for contextual analysis of speech. It shows how the interpretation of a spoken phrase can change dramatically depending on the surrounding situation (temporal, social, situational factors).
- Two-Step Process with Context Injection:
- Step 1 (Whisper): The audio file containing a context-dependent phrase (e.g., "It's getting late.") is transcribed into text using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): The transcribed text is then sent to GPT-4o (
client.chat.completions.create
), but crucially, the prompt now includes a description of the specific context in which the phrase was spoken.
- Step 1 (Whisper): The audio file containing a context-dependent phrase (e.g., "It's getting late.") is transcribed into text using
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and an audio file containing a simple phrase whose meaning heavily depends on context (the example uses "It's getting late."). - Transcription Function (
transcribe_speech
): This function (reused from the previous example) handles Step 1. - Contextual Analysis Function (
analyze_text_with_context
):- This function handles Step 2 and now accepts an additional argument:
context_description
. - Prompt Design: The prompt explicitly provides both the transcribed phrase and the
context_description
to GPT-4o, asking it to interpret the phrase within that specific situation. - Uses
gpt-4o
for its ability to reason based on provided context.
- This function handles Step 2 and now accepts an additional argument:
- Demonstrating Context Dependency (Main Execution):
- The script first transcribes the phrase (e.g., "It's getting late.").
- It then defines a list of different context descriptions (meeting ending, late-night party, bedtime, project deadline, short winter day).
- It calls the
analyze_text_with_context
function repeatedly, using the same transcribed phrase but providing a different context description each time. - By printing the analysis result for each context, the script clearly shows how GPT-4o's interpretation shifts based on the context provided (e.g., suggesting wrapping up vs. indicating tiredness vs. noting dwindling daylight).
- Use Case Relevance: This highlights GPT-4o's sophisticated understanding, moving beyond literal words to grasp intended meaning influenced by temporal, social, and situational factors. This is vital for applications needing accurate interpretation of real-world communication in business, social interactions, or any context-rich environment. It shows how developers can provide relevant context alongside transcribed text to get more accurate and nuanced interpretations from the AI.
For testing this code effectively, either create an audio file containing just the phrase "It's getting late" (or another context-dependent phrase), or download the provided sample file. Remember to update the 'context_phrase.mp3'
path to match your file location.
Summary Generation
GPT-4o's summary generation capabilities represent a significant advancement in AI-powered content analysis. The system creates concise, meaningful summaries of complex discussions by intelligently distilling key information from lengthy conversations, meetings, or presentations. Using advanced natural language processing and contextual understanding, GPT-4o can identify main themes, critical points, and essential takeaways while maintaining the core meaning and context of the original discussion.
The system employs several sophisticated techniques:
- Pattern Recognition: Identifies recurring themes and important discussion points across long conversations
- Contextual Analysis: Understands the broader context and relationships between different parts of the discussion
- Priority Detection: Automatically determines which information is most crucial for the summary
- Semantic Understanding: Captures underlying meanings and implications beyond just surface-level content
The generated summaries can be customized for different purposes and audiences:
- Executive Briefings: Focused on strategic insights and high-level decisions
- Meeting Minutes: Detailed documentation of discussions and action items
- Quick Overviews: Condensed highlights for rapid information consumption
- Technical Summaries: Emphasis on specific technical details and specifications
What sets GPT-4o apart is its ability to preserve important details while significantly reducing information overload, making it an invaluable tool for modern business communication and knowledge management.
Example:
This example focuses on GPT-4o's ability to generate concise and meaningful summaries from potentially lengthy spoken content obtained via Whisper.
This involves the familiar two-step process: first, transcribing the audio with Whisper to get the full text, and second, using GPT-4o's language understanding capabilities to analyze and summarize that text according to specific needs. This example will demonstrate generating different types of summaries from the same transcription.
Download the sample audio: https://files.cuantum.tech/audio/discussion_audio.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 7:59 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-04-10 15:59:00 CDT" # Updated time
current_location = "Houston, Texas, United States"
print(f"Running GPT-4o speech summarization example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'discussion_audio.mp3' with the actual filename.
audio_file_path = "discussion_audio.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before summarization.")
except OSError:
pass # Ignore size check error, proceed with transcription attempt
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Generate Summary from Text using GPT-4o ---
def summarize_text(client, text_to_summarize, summary_style="concise overview"):
"""Sends transcribed text to GPT-4o for summarization."""
print(f"\nStep 2: Generating '{summary_style}' summary...")
if not text_to_summarize:
print("Error: No text provided for summarization.")
return None
# Tailor the prompt based on the desired summary style
system_prompt = "You are an expert meeting summarizer and information distiller."
user_prompt = f"""Please generate a {summary_style} of the following discussion transcription.
Focus on accurately capturing the key information relevant to a {summary_style}. For example:
- For an 'executive briefing', focus on strategic points, decisions, and outcomes.
- For 'detailed meeting minutes', include main topics, key arguments, decisions, and action items.
- For a 'concise overview', provide the absolute main points and purpose.
- For a 'technical summary', emphasize technical details, specifications, or findings.
Transcription Text:
---
{text_to_summarize}
---
Generate the {summary_style}:
"""
try:
print(f"Sending text to GPT-4o for {summary_style}...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong summarization
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=400, # Adjust based on expected summary length
temperature=0.5 # Balance creativity and focus
)
summary = response.choices[0].message.content
print(f"'{summary_style}' generation successful.")
return summary.strip()
except OpenAIError as e:
print(f"OpenAI API Error during summarization: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during summarization: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("--------------------------")
# Step 2: Generate summaries in different styles
summary_styles_to_generate = [
"concise overview",
"detailed meeting minutes with action items",
"executive briefing focusing on decisions",
# "technical summary" # Add if relevant to your audio content
]
print("\n--- Generating Summaries ---")
for style in summary_styles_to_generate:
print(f"\n--- Summary Style: {style} ---")
summary_result = summarize_text(
client,
full_transcription,
summary_style=style
)
if summary_result:
print(summary_result)
else:
print(f"Failed to generate '{style}'.")
print("------------------------------------")
print("\nThis demonstrates GPT-4o generating different summaries from the same transcription based on the prompt.")
else:
print("\nTranscription failed, cannot proceed to summarization.")
Code breakdown:
- Context: This code demonstrates GPT-4o's advanced capability for summary generation from spoken content. It leverages the two-step process: transcribing audio with Whisper and then using GPT-4o to intelligently distill the key information from the transcription into a concise summary.
- Handling Lengthy Audio (Crucial Note): The prerequisites and code comments explicitly address the 25MB limit of the Whisper API. For real-world long meetings or presentations, the audio must be chunked, each chunk transcribed separately, and the resulting texts concatenated before being passed to the summarization step. The code example itself processes a single audio file for simplicity but highlights this essential workflow for longer content.
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file representing the discussion to be summarized (discussion_audio.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1, converting the input audio (or audio chunk) into plain text using Whisper. - Summarization Function (
summarize_text
):- Handles Step 2, taking the full transcribed text as input.
- Customizable Summaries: Accepts a
summary_style
argument (e.g., "executive briefing", "detailed meeting minutes"). - Prompt Engineering: The prompt sent to GPT-4o is dynamically constructed based on the requested
summary_style
. It instructs GPT-4o to act as an expert summarizer and tailor the output (focusing on strategic points, action items, technical details, etc.) according to the desired style. - Uses
gpt-4o
for its advanced understanding and summarization skills.
- Demonstrating Different Summary Types (Main Execution):
- The script first gets the full transcription.
- It then defines a list of different
summary_styles_to_generate
. - It calls the
summarize_text
function multiple times, passing the same full transcription each time but varying thesummary_style
argument. - By printing each resulting summary, the script clearly shows how GPT-4o adapts the level of detail and focus based on the prompt, generating distinct outputs (e.g., a brief overview vs. detailed minutes) from the identical source text.
- Use Case Relevance: This directly addresses the "Summary Generation" capability. It shows how combining Whisper and GPT-4o can transform lengthy spoken discussions into various useful formats (executive briefings, meeting minutes, quick overviews), saving time and improving knowledge management in business, education, and content creation.
Key Point Extraction
Identifies and highlights crucial information by leveraging GPT-4o's advanced natural language processing capabilities. Through sophisticated algorithms and contextual understanding, the model analyzes spoken content to extract meaningful insights. The model can:
- Extract core concepts and main arguments from spoken content - This involves identifying the fundamental ideas, key messages, and supporting evidence presented in conversations, presentations, or discussions. The model distinguishes between primary and secondary points, ensuring that essential information is captured.
- Identify critical decision points and action items - By analyzing conversation flow and context, GPT-4o recognizes moments when decisions are made, commitments are established, or tasks are assigned. This includes detecting both explicit assignments ("John will handle this") and implicit ones ("We should look into this further").
- Prioritize information based on context and relevance - The model evaluates the significance of different pieces of information within their specific context, considering factors such as urgency, impact, and relationship to overall objectives. This helps in creating hierarchical summaries that emphasize what matters most.
- Track key themes and recurring topics across conversations - GPT-4o maintains awareness of discussion patterns, identifying when certain subjects resurface and how they evolve over time. This capability is particularly valuable for long-term project monitoring or tracking ongoing concerns across multiple meetings.
Example:
This example focuses on using GPT-4o to extract specific, crucial information—key points, decisions, action items—from transcribed speech, going beyond a general summary.
This again uses the two-step approach: Whisper transcribes the audio, and then GPT-4o analyzes the text based on a prompt designed for extraction.
Download the audio sample: https://files.cuantum.tech/audio/meeting_for_extraction.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:07 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-21 22:07:00 CDT" # Updated time
current_location = "Austin, Texas, United States"
print(f"Running GPT-4o key point extraction from speech example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'meeting_for_extraction.mp3' with the actual filename.
audio_file_path = "meeting_for_extraction.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before extraction.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Extract Key Points, Decisions, Actions using GPT-4o ---
def extract_key_points(client, text_to_analyze):
"""Sends transcribed text to GPT-4o for key point extraction."""
print("\nStep 2: Extracting key points, decisions, and actions...")
if not text_to_analyze:
print("Error: No text provided for extraction.")
return None
# Prompt designed specifically for extraction
system_prompt = "You are an expert meeting analyst. Your task is to carefully read the provided transcript and extract specific types of information."
user_prompt = f"""Analyze the following meeting or discussion transcription. Identify and extract the following information, presenting each under a clear heading:
1. **Key Points / Core Concepts:** List the main topics, arguments, or fundamental ideas discussed.
2. **Decisions Made:** List any clear decisions that were reached during the discussion.
3. **Action Items:** List specific tasks assigned to individuals or the group. If possible, note who is responsible and any mentioned deadlines.
If any category has no relevant items, state "None identified".
Transcription Text:
---
{text_to_analyze}
---
Extracted Information:
"""
try:
print("Sending text to GPT-4o for extraction...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong analytical capabilities
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=600, # Adjust based on expected length of extracted info
temperature=0.2 # Lower temperature for more factual extraction
)
extracted_info = response.choices[0].message.content
print("Extraction successful.")
return extracted_info.strip()
except OpenAIError as e:
print(f"OpenAI API Error during extraction: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during extraction: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("------------------------------------")
# Step 2: Extract Key Information
extracted_details = extract_key_points(
client,
full_transcription
)
if extracted_details:
print("\n--- Extracted Key Information ---")
print(extracted_details)
print("---------------------------------")
print("\nThis demonstrates GPT-4o identifying and structuring key takeaways from the discussion.")
else:
print("\nFailed to extract key information.")
else:
print("\nTranscription failed, cannot proceed to key point extraction.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability for Key Point Extraction from spoken content. After transcribing audio using Whisper, GPT-4o analyzes the text to identify and isolate crucial information like core concepts, decisions made, and action items assigned.
- Two-Step Process: Like summarization, this relies on:
- Step 1 (Whisper): Transcribing the audio (
client.audio.transcriptions.create
) to get the full text. The critical note about handling audio files larger than 25MB via chunking and concatenation still applies. - Step 2 (GPT-4o): Analyzing the complete transcription using
client.chat.completions.create
with a prompt specifically designed for extraction.
- Step 1 (Whisper): Transcribing the audio (
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file from a meeting or discussion where key information is likely present (meeting_for_extraction.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1, returning the plain text transcription. - Extraction Function (
extract_key_points
):- Handles Step 2, taking the full transcription text.
- Prompt Engineering for Extraction: This is key. The prompt explicitly instructs GPT-4o to act as an analyst and extract information under specific headings: "Key Points / Core Concepts," "Decisions Made," and "Action Items." This structured request guides GPT-4o to identify and categorize the relevant information accurately. A lower
temperature
(e.g., 0.2) is suggested to encourage more factual, less creative output suitable for extraction. - Uses
gpt-4o
for its advanced analytical skills.
- Output: The function returns a text string containing the extracted information, ideally structured under the requested headings.
- Main Execution: The script transcribes the audio, then passes the text to the extraction function, and finally prints the structured output.
- Use Case Relevance: This directly addresses the "Key Point Extraction" capability. It shows how AI can automatically process lengthy discussions to pull out the most important concepts, track decisions, and list actionable tasks, saving significant time in reviewing recordings or generating meeting follow-ups. It highlights GPT-4o's ability to understand conversational flow and identify significant moments (decisions, assignments) within the text.
Emotional Intelligence
Detects tone, sentiment, and emotional undertones in spoken communication through GPT-4o's advanced natural language processing capabilities. This sophisticated system performs deep analysis of speech patterns and contextual elements to understand the emotional layers of communication. The model can identify subtle emotional cues such as:
- Voice inflections and patterns that indicate excitement, hesitation, or concern - Including pitch variations, speech rhythm changes, and vocal stress patterns that humans naturally use to convey emotions
- Changes in speaking tempo and volume that suggest emotional states - For example, rapid speech might indicate excitement or anxiety, while slower speech could suggest thoughtfulness or uncertainty
- Contextual emotional markers like laughter, sighs, or pauses - The model recognizes non-verbal sounds and silence that carry significant emotional meaning in conversation
- Cultural and situational nuances that affect emotional expression - Understanding how different cultures express emotions differently and how context influences emotional interpretation
This emotional awareness enables GPT-4o to provide more nuanced and context-appropriate responses, making it particularly valuable for applications in customer service (where understanding customer frustration or satisfaction is crucial), therapeutic conversations (where emotional support and understanding are paramount), and personal coaching (where motivation and emotional growth are key objectives). The system's ability to detect these subtle emotional signals allows for more empathetic and effective communication across various professional and personal contexts.
Example:
This example explores using GPT-4o for "Emotional Intelligence" – detecting tone, sentiment, and emotional undertones in speech.
It's important to understand how this works with current standard OpenAI APIs. While GPT-4o excels at understanding emotion from text, directly analyzing audio features like pitch, tone variance, tempo, sighs, or laughter as audio isn't a primary function of the standard Whisper transcription or the Chat Completions API endpoint when processing transcribed text.
Therefore, the most practical way to demonstrate this concept using these APIs is a two-step process:
- Transcribe Speech to Text: Use Whisper to get the words spoken.
- Analyze Text for Emotion: Use GPT-4o to analyze the transcribed text for indicators of emotion, sentiment, or tone based on word choice, phrasing, and context described in the text.
Download the sample audio: https://files.cuantum.tech/audio/emotional_speech.mp3
This code example implements this two-step, text-based analysis approach.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:13 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-21 20:13:00 CDT" # Updated time
current_location = "Atlanta, Georgia, United States"
print(f"Running GPT-4o speech emotion analysis (text-based) example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with potentially emotional speech
# IMPORTANT: Replace 'emotional_speech.mp3' with the actual filename.
audio_file_path = "emotional_speech.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Transcribed Text for Emotion/Sentiment using GPT-4o ---
def analyze_text_emotion(client, text_to_analyze):
"""
Sends transcribed text to GPT-4o for emotion and sentiment analysis.
Note: This analyzes the text content, not acoustic features of the original audio.
"""
print("\nStep 2: Analyzing transcribed text for emotion/sentiment...")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Prompt designed for text-based emotion/sentiment analysis
system_prompt = "You are an expert in communication analysis, skilled at detecting sentiment, tone, and potential underlying emotions from text."
user_prompt = f"""Analyze the following text for emotional indicators:
Text:
---
{text_to_analyze}
---
Based *only* on the words, phrasing, and punctuation in the text provided:
1. What is the overall sentiment (e.g., Positive, Negative, Neutral, Mixed)?
2. What is the likely emotional tone (e.g., Frustrated, Excited, Calm, Anxious, Sarcastic, Happy, Sad)?
3. Are there specific words or phrases that indicate these emotions? Explain briefly.
Provide the analysis:
"""
try:
print("Sending text to GPT-4o for emotion analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for nuanced understanding
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=300, # Adjust as needed
temperature=0.4 # Slightly lower temp for more grounded analysis
)
analysis = response.choices[0].message.content
print("Emotion analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio
transcribed_text = transcribe_speech(client, audio_file_path)
if transcribed_text:
print(f"\n--- Transcription Result ---")
print(transcribed_text)
print("----------------------------")
# Step 2: Analyze the transcription for emotion/sentiment
emotion_analysis = analyze_text_emotion(
client,
transcribed_text
)
if emotion_analysis:
print("\n--- Emotion/Sentiment Analysis (from Text) ---")
print(emotion_analysis)
print("----------------------------------------------")
print("\nNote: This analysis is based on the transcribed text content. It does not directly analyze acoustic features like tone of voice from the original audio.")
else:
print("\nEmotion analysis failed.")
else:
print("\nTranscription failed, cannot proceed to emotion analysis.")
# --- End of Code Example ---
Code breakdown:
- Context: This code demonstrates how GPT-4o can be used to infer emotional tone and sentiment from spoken language. It utilizes a two-step process common for this type of analysis with current APIs.
- Two-Step Process & Limitation:
- Step 1 (Whisper): The audio is first transcribed into text using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): The resulting text is then analyzed by GPT-4o (
client.chat.completions.create
) using a prompt specifically designed to identify sentiment and emotional indicators within the text. - Important Limitation: The explanation (and code comments) must clearly state that this method analyzes the linguistic content (words, phrasing) provided by Whisper. It does not directly analyze acoustic features of the original audio like pitch, tempo, or specific non-verbal sounds (sighs, laughter) unless those happen to be transcribed by Whisper (which is often not the case for subtle cues). True acoustic emotion detection would require different tools or APIs.
- Step 1 (Whisper): The audio is first transcribed into text using
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file where the speaker's words might suggest an emotion (emotional_speech.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1, returning plain text. - Emotion Analysis Function (
analyze_text_emotion
):- Handles Step 2, taking the transcribed text.
- Prompt Design: The prompt explicitly asks GPT-4o to analyze the provided text for overall sentiment (Positive/Negative/Neutral), likely emotional tone (Frustrated, Excited, etc.), and supporting textual evidence. It clarifies the analysis should be based only on the text.
- Uses
gpt-4o
for its sophisticated language understanding.
- Output: The function returns GPT-4o's textual analysis of the inferred emotion and sentiment.
- Main Execution: The script transcribes the audio, passes the text for analysis, prints both results, and reiterates the limitation regarding acoustic features.
- Use Case Relevance: While not analyzing acoustics directly, this text-based approach is still valuable for applications like customer service (detecting frustration/satisfaction from word choice), analyzing feedback, or getting a general sense of sentiment from spoken interactions, complementing other forms of analysis. It showcases GPT-4o's ability to interpret emotional language.
Remember to use an audio file where the spoken words convey some emotion for this example to be effective. Replace 'emotional_speech.mp3'
with your file path.
Implicit Understanding
GPT-4o demonstrates remarkable capabilities in understanding the deeper layers of human communication, going far beyond simple word recognition to grasp the intricate nuances of speech. The model's sophisticated comprehension abilities include:
- Detect underlying context and assumptions
- Understands implicit knowledge shared between speakers
- Recognizes unstated but commonly accepted facts within specific domains
- Identifies hidden premises in conversations
- Understand cultural references and idiomatic expressions
- Processes region-specific sayings and colloquialisms
- Recognizes cultural-specific metaphors and analogies
- Adapts understanding based on cultural context
- Interpret rhetorical devices
Example:
Similar to the previous examples involving deeper understanding (Semantic, Contextual, Emotional), this typically uses the two-step approach: Whisper transcribes the words, and then GPT-4o analyzes the resulting text, this time specifically prompted to look for implicit layers.
Download the sample audio: https://files.cuantum.tech/audio/implicit_speech.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:21 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-12 16:21:00 CDT" # Updated time
current_location = "Dallas, Texas, United States"
print(f"Running GPT-4o implicit speech understanding example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with implicit meaning
# IMPORTANT: Replace 'implicit_speech.mp3' with the actual filename.
audio_file_path = "implicit_speech.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Transcribed Text for Implicit Meaning using GPT-4o ---
def analyze_implicit_meaning(client, text_to_analyze):
"""
Sends transcribed text to GPT-4o to analyze implicit meanings,
assumptions, references, or rhetorical devices.
"""
print("\nStep 2: Analyzing transcribed text for implicit meaning...")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Prompt designed for identifying implicit communication layers
system_prompt = "You are an expert analyst of human communication, skilled at identifying meaning that is implied but not explicitly stated."
user_prompt = f"""Analyze the following statement or question:
Statement/Question:
---
{text_to_analyze}
---
Based on common knowledge, cultural context, and conversational patterns, please explain:
1. Any underlying assumptions the speaker might be making.
2. Any implicit meanings or suggestions conveyed beyond the literal words.
3. Any cultural references, idioms, or sayings being used or alluded to.
4. If it's a rhetorical question, what point is likely being made?
Provide a breakdown of the implicit layers of communication present:
"""
try:
print("Sending text to GPT-4o for implicit meaning analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for deep understanding
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=400, # Adjust as needed
temperature=0.5 # Allow for some interpretation
)
analysis = response.choices[0].message.content
print("Implicit meaning analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio
transcribed_text = transcribe_speech(client, audio_file_path)
if transcribed_text:
print(f"\n--- Transcription Result ---")
print(transcribed_text)
print("----------------------------")
# Step 2: Analyze the transcription for implicit meaning
implicit_analysis = analyze_implicit_meaning(
client,
transcribed_text
)
if implicit_analysis:
print("\n--- Implicit Meaning Analysis ---")
print(implicit_analysis)
print("-------------------------------")
print("\nThis demonstrates GPT-4o identifying meaning beyond the literal text, based on common knowledge and context.")
else:
print("\nImplicit meaning analysis failed.")
else:
print("\nTranscription failed, cannot proceed to implicit meaning analysis.")
Code breakdown:
- Context: This code example demonstrates GPT-4o's capability for Implicit Understanding – grasping the unstated assumptions, references, and meanings embedded within spoken language.
- Two-Step Process: It follows the established pattern:
- Step 1 (Whisper): Transcribe the audio containing the implicitly meaningful speech into text using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): Analyze the transcribed text using
client.chat.completions.create
, with a prompt specifically designed to uncover hidden layers of meaning.
- Step 1 (Whisper): Transcribe the audio containing the implicitly meaningful speech into text using
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file where the meaning relies on shared knowledge, cultural context, or isn't fully literal (e.g., using an idiom, a rhetorical question, or making an assumption clear only through context).implicit_speech.mp3
is used as the placeholder. - Transcription Function (
transcribe_speech
): Handles Step 1, returning the plain text transcription. - Implicit Analysis Function (
analyze_implicit_meaning
):- Handles Step 2, taking the transcribed text.
- Prompt Engineering for Implicit Meaning: The prompt is key here. It instructs GPT-4o to look beyond the literal words and identify underlying assumptions, implied suggestions, cultural references/idioms, and the purpose behind rhetorical questions.
- Uses
gpt-4o
for its extensive knowledge base and reasoning ability needed to infer these implicit elements.
- Output: The function returns GPT-4o's textual analysis of the unstated meanings detected in the input text.
- Main Execution: The script transcribes the audio, passes the text for implicit analysis, and prints both the literal transcription and GPT-4o's interpretation of the hidden meanings.
- Use Case Relevance: This demonstrates how GPT-4o can process communication more like a human, understanding not just what was said, but also what was meant or assumed. This is crucial for applications requiring deep comprehension, such as analyzing user feedback, understanding nuanced dialogue in meetings, or interpreting culturally rich content.
Remember to use an audio file containing speech that requires some level of inference or background knowledge to fully understand for testing this code effectively. Replace 'implicit_speech.mp3'
with your file path.
From Transcription to Comprehensive Understanding
This advance marks a revolutionary transformation in AI's ability to process human speech. While traditional systems like Whisper excel at transcription - the mechanical process of converting spoken words into written text - modern AI systems like GPT-4o achieve true comprehension, understanding not just the words themselves but their deeper meaning, context, and implications. This leap forward enables AI to process human communication in ways that are remarkably similar to how humans naturally understand conversation, including subtle nuances, implied meanings, and contextual relevance.
To illustrate this transformative evolution in capability, let's examine a detailed example that highlights the stark contrast between simple transcription and advanced comprehension:
- Consider this statement: "I think we should delay the product launch until next quarter." A traditional transcription system like Whisper would perfectly capture these words, but that's where its understanding ends - it simply converts speech to text with high accuracy.
- GPT-4o, however, demonstrates a sophisticated level of understanding that mirrors human comprehension:
- Primary Message Analysis: Beyond just identifying the suggestion to reschedule, it understands this as a strategic proposal that requires careful consideration
- Business Impact Evaluation: Comprehensively assesses how this delay would affect various aspects of the business, from resource allocation to team scheduling to budget implications
- Strategic Market Analysis: Examines the broader market context, including competitor movements, market trends, and potential windows of opportunity
- Comprehensive Risk Assessment: Evaluates both immediate and long-term consequences, considering everything from technical readiness to market positioning
What makes GPT-4o truly remarkable is its ability to engage in nuanced analytical discussions about the content, addressing complex strategic questions that require deep understanding:
- External Factors: What specific market conditions, competitive pressures, or industry trends might have motivated this delay suggestion?
- Stakeholder Impact: How would this timeline adjustment affect relationships with investors, partners, and customers? What communication strategies might be needed?
- Strategic Opportunities: What potential advantages could emerge from this delay, such as additional feature development or market timing optimization?
2.3.2 What Can GPT-4o Do with Speech Input?
GPT-4o represents a significant advancement in audio processing technology, offering a comprehensive suite of capabilities that transform how we interact with and understand spoken content. As a cutting-edge language model with multimodal processing abilities, it combines sophisticated speech recognition with deep contextual understanding to deliver powerful audio analysis features. Let's explore GPT-4o's some other functions and capabilities:
Action Item Extraction
Prompt example: "List all the tasks mentioned in this voice note."
GPT-4o excels at identifying and extracting action items from spoken content through sophisticated natural language processing. The model can:
- Parse complex conversations to detect both explicit ("Please do X") and implicit ("We should consider Y") tasks
- Distinguish between hypothetical discussions and actual commitments
- Categorize tasks by priority, deadline, and assignee
- Identify dependencies between different action items
- Flag follow-up requirements and recurring tasks
This capability transforms unstructured audio discussions into structured, actionable task lists, significantly improving meeting productivity and follow-through. By automatically maintaining a comprehensive record of commitments, it ensures accountability while reducing the cognitive load on participants who would otherwise need to manually track these items. The system can also integrate with popular task management tools, making it seamless to convert spoken assignments into trackable tickets or to-dos.
Example:
This script uses the familiar two-step process: first transcribing the audio with Whisper, then analyzing the text with GPT-4o using a prompt specifically designed to identify and structure action items.
Download the audio sample: https://files.cuantum.tech/audio/meeting_tasks.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:39 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-24 10:29:00 CDT" # Updated time
current_location = "Plano, Texas, United States"
print(f"Running GPT-4o action item extraction from speech example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'meeting_tasks.mp3' with the actual filename.
audio_file_path = "meeting_tasks.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before extraction.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Extract Action Items from Text using GPT-4o ---
def extract_action_items(client, text_to_analyze):
"""Sends transcribed text to GPT-4o for action item extraction."""
print("\nStep 2: Extracting action items...")
if not text_to_analyze:
print("Error: No text provided for extraction.")
return None
# Prompt designed specifically for extracting structured action items
system_prompt = "You are an expert meeting analyst focused on identifying actionable tasks."
user_prompt = f"""Analyze the following meeting or discussion transcription. Identify and extract all specific action items mentioned.
For each action item, provide:
- A clear description of the task.
- The person assigned (if mentioned, otherwise state 'Unassigned' or 'Group').
- Any deadline mentioned (if mentioned, otherwise state 'No deadline mentioned').
Distinguish between definite commitments/tasks and mere suggestions or hypothetical possibilities. Only list items that sound like actual tasks or commitments.
Format the output as a numbered list.
Transcription Text:
---
{text_to_analyze}
---
Extracted Action Items:
"""
try:
print("Sending text to GPT-4o for action item extraction...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong analytical capabilities
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=500, # Adjust based on expected number of action items
temperature=0.1 # Very low temperature for factual extraction
)
extracted_actions = response.choices[0].message.content
print("Action item extraction successful.")
return extracted_actions.strip()
except OpenAIError as e:
print(f"OpenAI API Error during extraction: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during extraction: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("------------------------------------")
# Step 2: Extract Action Items
action_items_list = extract_action_items(
client,
full_transcription
)
if action_items_list:
print("\n--- Extracted Action Items ---")
print(action_items_list)
print("------------------------------")
print("\nThis demonstrates GPT-4o identifying and structuring actionable tasks from the discussion.")
else:
print("\nFailed to extract action items.")
else:
print("\nTranscription failed, cannot proceed to action item extraction.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability for Action Item Extraction from spoken content. After transcribing audio with Whisper, GPT-4o analyzes the text to identify specific tasks, assignments, and deadlines discussed.
- Two-Step Process: It uses the standard workflow:
- Step 1 (Whisper): Transcribe the meeting/discussion audio (
client.audio.transcriptions.create
) into text. The note about handling audio files > 25MB via chunking/concatenation remains critical for real-world use. - Step 2 (GPT-4o): Analyze the complete transcription using
client.chat.completions.create
with a prompt tailored for task extraction.
- Step 1 (Whisper): Transcribe the meeting/discussion audio (
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file from a meeting where tasks were assigned (meeting_tasks.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1. - Action Item Extraction Function (
extract_action_items
):- Handles Step 2, taking the full transcription text.
- Prompt Engineering for Tasks: This is the core. The prompt explicitly instructs GPT-4o to identify action items, distinguish them from mere suggestions, and extract the task description, assigned person (if mentioned), and deadline (if mentioned). It requests a structured, numbered list format. A very low
temperature
(e.g., 0.1) is recommended to keep the output focused on factual extraction. - Uses
gpt-4o
for its ability to understand conversational context and identify commitments.
- Output: The function returns a text string containing the structured list of extracted action items.
- Main Execution: The script transcribes the audio, passes the text to the extraction function, and prints the resulting list of tasks.
- Use Case Relevance: This directly addresses the "Action Item Extraction" capability. It shows how AI can automatically convert unstructured verbal discussions into organized, actionable task lists. This significantly boosts productivity by ensuring follow-through, clarifying responsibilities, and reducing the manual effort of tracking commitments made during meetings. It highlights GPT-4o's ability to parse complex conversations and identify both explicit and implicit task assignments.
Q&A about the Audio
Prompt Example: "What did the speaker say about the budget?"
GPT-4o's advanced query capabilities allow for natural conversations about audio content, enabling users to ask specific questions and receive contextually relevant answers. The model can:
- Extract precise information from specific segments
- Understand context and references across the entire audio
- Handle follow-up questions about previously discussed topics
- Provide time-stamped references to relevant portions
- Cross-reference information from multiple parts of the recording
This functionality transforms how we interact with audio content, making it as searchable and queryable as text documents. Instead of manually scrubbing through recordings, users can simply ask questions in natural language and receive accurate, concise responses. The system is particularly valuable for:
- Meeting participants who need to verify specific details
- Researchers analyzing interview recordings
- Students reviewing lecture content
- Professionals fact-checking client conversations
- Teams seeking to understand historical discussions
Example:
This script first transcribes an audio file using Whisper and then uses GPT-4o to answer a specific question asked by the user about the content of that transcription.
Download the audio sample: https://files.cuantum.tech/audio/meeting_for_qa.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:47 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-01-11 11:47:00 CDT" # Updated time
current_location = "Orlando, Florida, United States"
print(f"Running GPT-4o Q&A about audio example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'meeting_for_qa.mp3' with the actual filename.
audio_file_path = "meeting_for_qa.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before Q&A.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Answer Question Based on Text using GPT-4o ---
def answer_question_about_text(client, full_text, question):
"""Sends transcribed text and a question to GPT-4o to get an answer."""
print(f"\nStep 2: Answering question about the transcription...")
print(f"Question: \"{question}\"")
if not full_text:
print("Error: No transcription text provided to answer questions about.")
return None
if not question:
print("Error: No question provided.")
return None
# Prompt designed specifically for answering questions based on provided text
system_prompt = "You are an AI assistant specialized in answering questions based *only* on the provided text transcription. Do not use outside knowledge."
user_prompt = f"""Based *solely* on the following transcription text, please answer the question below. If the answer is not found in the text, state that clearly.
Transcription Text:
---
{full_text}
---
Question: {question}
Answer:
"""
try:
print("Sending transcription and question to GPT-4o...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong comprehension and answering
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=300, # Adjust based on expected answer length
temperature=0.1 # Low temperature for factual answers based on text
)
answer = response.choices[0].message.content
print("Answer generation successful.")
return answer.strip()
except OpenAIError as e:
print(f"OpenAI API Error during Q&A: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during Q&A: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
transcription = transcribe_speech(client, audio_file_path)
if transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(transcription[:1000] + "..." if len(transcription) > 1000 else transcription)
print("------------------------------------")
# --- Ask Questions about the Transcription ---
# Define the question(s) you want to ask
user_question = "What was decided about the email marketing CTA button?"
# user_question = "Who is responsible for the A/B test on Platform B?"
# user_question = "What was the engagement increase on Platform A?"
print(f"\n--- Answering Question ---")
# Step 2: Get the answer from GPT-4o
answer = answer_question_about_text(
client,
transcription,
user_question
)
if answer:
print(f"\nAnswer to '{user_question}':")
print(answer)
print("------------------------------")
print("\nThis demonstrates GPT-4o answering specific questions based on the transcribed audio content.")
else:
print(f"\nFailed to get an answer for the question: '{user_question}'")
else:
print("\nTranscription failed, cannot proceed to Q&A.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability to function as a Q&A system for audio content. After transcribing speech with Whisper, users can ask specific questions in natural language, and GPT-4o will provide answers based on the information contained within the transcription.
- Two-Step Process: The workflow involves:
- Step 1 (Whisper): Transcribe the relevant audio file (or concatenated text from chunks of a longer file) using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): Send the complete transcription along with the user's specific question to
client.chat.completions.create
.
- Step 1 (Whisper): Transcribe the relevant audio file (or concatenated text from chunks of a longer file) using
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file containing the discussion or information the user might ask questions about (meeting_for_qa.mp3
). The critical note about handling audio > 25MB via chunking/concatenation before the Q&A step remains essential. - Transcription Function (
transcribe_speech
): Handles Step 1. - Q&A Function (
answer_question_about_text
):- Handles Step 2, taking both the
full_text
transcription and thequestion
as input. - Prompt Engineering for Q&A: The prompt is crucial. It instructs GPT-4o to act as a specialized assistant that answers questions based only on the provided transcription text, explicitly telling it not to use external knowledge and to state if the answer isn't found in the text. This grounding is important for accuracy. A low
temperature
(e.g., 0.1) helps ensure factual answers derived directly from the source text. - Uses
gpt-4o
for its excellent reading comprehension and question-answering abilities.
- Handles Step 2, taking both the
- Output: The function returns GPT-4o's answer to the specific question asked.
- Main Execution: The script transcribes the audio, defines a sample
user_question
, passes the transcription and question to the Q&A function, and prints the resulting answer. - Use Case Relevance: This directly addresses the "Q&A about the Audio" capability. It transforms audio recordings from passive archives into interactive knowledge sources. Users can quickly find specific details, verify facts, or understand parts of a discussion without manually searching through the audio, making it invaluable for reviewing meetings, lectures, interviews, or any recorded conversation.
Remember to use an audio file containing information relevant to potential questions for testing (you can use the sample audio provided). Modify the user_question
variable to test different queries against the transcribed content.
Highlight Key Moments
Prompt example: "Identify the most important statements made in this audio."
GPT-4o excels at identifying and extracting crucial moments from audio content through its advanced natural language understanding capabilities. The model can:
- Identify key decisions and action items
- Extract important quotes and statements
- Highlight strategic discussions and conclusions
- Pinpoint critical transitions in conversations
This feature is particularly valuable for:
- Meeting participants who need to quickly review important takeaways
- Executives scanning long recordings for decision points
- Teams tracking project milestones discussed in calls
- Researchers identifying significant moments in interviews
The model provides timestamps and contextual summaries for each highlighted moment, making it easier to navigate directly to the most relevant parts of the recording without reviewing the entire audio file.
Example:
This script follows the established two-step pattern: transcribing the audio with Whisper and then analyzing the text with GPT-4o using a prompt designed to identify significant statements, decisions, or conclusions.
Download the sample audio: https://files.cuantum.tech/audio/key_discussion.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:52 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-02-14 15:52:00 CDT" # Updated time
current_location = "Tampa, Florida, United States"
print(f"Running GPT-4o key moment highlighting example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'key_discussion.mp3' with the actual filename.
audio_file_path = "key_discussion.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before highlighting.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Highlight Key Moments from Text using GPT-4o ---
def highlight_key_moments(client, text_to_analyze):
"""Sends transcribed text to GPT-4o to identify and extract key moments."""
print("\nStep 2: Identifying key moments from transcription...")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Prompt designed specifically for identifying key moments/statements
system_prompt = "You are an expert analyst skilled at identifying the most significant parts of a discussion or presentation."
user_prompt = f"""Analyze the following transcription text. Identify and extract the key moments, which could include:
- Important decisions made
- Critical conclusions reached
- Significant statements or impactful quotes
- Major topic shifts or transitions
- Key questions asked or answered
For each key moment identified, provide the relevant quote or a concise summary of the moment. Present the output as a list.
Transcription Text:
---
{text_to_analyze}
---
Key Moments:
"""
try:
print("Sending text to GPT-4o for key moment identification...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong comprehension
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=700, # Adjust based on expected number/length of key moments
temperature=0.3 # Lean towards factual identification
)
key_moments = response.choices[0].message.content
print("Key moment identification successful.")
return key_moments.strip()
except OpenAIError as e:
print(f"OpenAI API Error during highlighting: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during highlighting: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("------------------------------------")
# Step 2: Highlight Key Moments
highlights = highlight_key_moments(
client,
full_transcription
)
if highlights:
print("\n--- Identified Key Moments ---")
print(highlights)
print("----------------------------")
print("\nThis demonstrates GPT-4o extracting significant parts from the discussion.")
print("\nNote: Adding precise timestamps to these moments requires further processing using Whisper's 'verbose_json' output and correlating the text.")
else:
print("\nFailed to identify key moments.")
else:
print("\nTranscription failed, cannot proceed to highlight key moments.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability to Highlight Key Moments from spoken content. After transcription via Whisper, GPT-4o analyzes the text to pinpoint and extract the most significant parts, such as crucial decisions, important statements, or major topic shifts.
- Two-Step Process:
- Step 1 (Whisper): Transcribe the audio (
client.audio.transcriptions.create
) to get the full text. The necessity of chunking/concatenating for audio files > 25MB is reiterated. - Step 2 (GPT-4o): Analyze the complete transcription using
client.chat.completions.create
with a prompt specifically asking for key moments.
- Step 1 (Whisper): Transcribe the audio (
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file containing a discussion or presentation where significant moments occur (key_discussion.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1. - Highlighting Function (
highlight_key_moments
):- Handles Step 2, taking the full transcription text.
- Prompt Engineering for Highlights: The prompt instructs GPT-4o to act as an analyst and identify various types of key moments (decisions, conclusions, impactful quotes, transitions). It asks for the relevant quote or a concise summary for each identified moment, formatted as a list.
- Uses
gpt-4o
for its ability to discern importance and context within text.
- Output: The function returns a text string containing the list of identified key moments.
- Timestamp Note: The explanation and code output explicitly mention that while this process identifies the text of key moments, adding precise timestamps would require additional steps. This involves using Whisper's
verbose_json
output format (which includes segment timestamps) and then correlating the text identified by GPT-4o back to those specific timed segments – a more complex task not covered in this basic example. - Main Execution: The script transcribes the audio, passes the text to the highlighting function, and prints the resulting list of key moments.
- Use Case Relevance: This addresses the "Highlight Key Moments" capability by showing how AI can quickly sift through potentially long recordings to surface the most critical parts. This is highly valuable for efficient review of meetings, interviews, or lectures, allowing users to focus on what matters most without listening to the entire audio.
For testing purposes, use an audio file that contains a relevant discussion with clear, identifiable key segments (you can use the sample audio file provided).
2.3.3 Real-World Use Cases
The modern business landscape increasingly relies on audio communication across various sectors, from sales and customer service to education and personal development. Understanding and effectively utilizing these audio interactions has become crucial for organizations seeking to improve their operations, enhance customer relationships, and drive better outcomes. This section explores several key applications where advanced audio processing and analysis can create significant value, demonstrating how AI-powered tools can transform raw audio data into actionable insights.
From analyzing sales conversations to enhancing educational experiences, these use cases showcase the versatility and power of audio understanding technologies in addressing real-world challenges. Each application represents a unique opportunity to leverage voice data for improved decision-making, process optimization, and better user experiences.
1. Sales Enablement
Advanced analysis of sales call recordings provides a comprehensive toolkit for sales teams to optimize their performance. The system can identify key objections raised by prospects, allowing teams to develop better counter-arguments and prepare responses in advance. It tracks successful closing techniques by analyzing patterns in successful deals, revealing which approaches work best for different customer segments and situations.
The system also measures crucial metrics like conversion rates, call duration, talk-to-listen ratios, and key phrase usage. This data helps sales teams understand which behaviors correlate with successful outcomes. By analyzing customer responses and reaction patterns, teams can refine their pitch timing, improve their questioning techniques, and better understand buying signals.
This technology also enables sales managers to document and share effective approaches across the team, creating a knowledge base of best practices for common challenges. This institutional knowledge can be particularly valuable for onboarding new team members and maintaining consistent sales excellence across the organization.
2. Meeting Intelligence
Comprehensive meeting analysis transforms how organizations capture and utilize meeting content. The system goes beyond basic transcription by:
- Identifying and categorizing key discussion points for easy reference
- Automatically detecting and extracting action items from conversations
- Assigning responsibilities to specific team members based on verbal commitments
- Creating structured timelines and tracking deadlines mentioned during meetings
- Generating automated task lists with clear ownership and due dates
- Highlighting decision points and meeting outcomes
- Providing searchable meeting archives for future reference
The system employs advanced natural language processing to understand context, relationships, and commitments expressed during conversations. This enables automatic task creation and assignment, ensuring nothing falls through the cracks. Integration with project management tools allows for seamless workflow automation, while smart reminders help keep team members accountable for their commitments.
3. Customer Support
Deep analysis of customer service interactions provides comprehensive insights into customer experience and support team performance. The system can:
- Evaluate customer sentiment in real-time by analyzing tone, word choice, and conversation flow
- Automatically categorize and prioritize urgent issues based on keyword detection and context analysis
- Generate detailed satisfaction metrics through conversation analysis and customer feedback
- Track key performance indicators like first-response time and resolution time
- Identify common pain points and recurring issues across multiple interactions
- Monitor support agent performance and consistency in service delivery
This enables support teams to improve response times, identify trending problems, and maintain consistent service quality across all interactions. The system can also provide automated coaching suggestions for support agents and generate insights for product improvement based on customer feedback patterns.
4. Personal Journaling
Transform voice memos into structured reflections with emotional context analysis. Using advanced natural language processing, the system analyzes voice recordings to detect emotional states, stress levels, and overall sentiment through tone of voice, word choice, and speaking patterns. This creates a rich, multi-dimensional journal entry that captures not just what was said, but how it was expressed.
The system's mood tracking capabilities go beyond simple positive/negative classifications, identifying nuanced emotional states like excitement, uncertainty, confidence, or concern. By analyzing these patterns over time, users can gain valuable insights into their emotional well-being and identify triggers or patterns that affect their mental state.
For personal goal tracking, the system can automatically categorize and tag mentions of objectives, progress updates, and setbacks. It can generate progress reports showing momentum toward specific goals, highlight common obstacles, and even suggest potential solutions based on past successful strategies. The behavioral trend analysis examines patterns in decision-making, habit formation, and personal growth, providing users with actionable insights for self-improvement.
5. Education & Language Practice
Comprehensive language learning support revolutionizes how students practice and improve their language skills. The system provides several key benefits:
- Speech Analysis: Advanced algorithms analyze pronunciation patterns, detecting subtle variations in phonemes, stress patterns, and intonation. This helps learners understand exactly where their pronunciation differs from native speakers.
- Error Detection: The system identifies not just pronunciation errors, but also grammatical mistakes, incorrect word usage, and syntactical issues in real-time. This immediate feedback helps prevent the formation of bad habits.
- Personalized Feedback: Instead of generic corrections, the system provides context-aware feedback that considers the learner's proficiency level, native language, and common interference patterns specific to their language background.
- Progress Tracking: Sophisticated metrics track various aspects of language development, including vocabulary range, speaking fluency, grammar accuracy, and pronunciation improvement over time. Visual progress reports help motivate learners and identify areas needing focus.
- Adaptive Learning: Based on performance analysis, the system creates customized exercise plans targeting specific weaknesses. These might include focused pronunciation drills, grammar exercises, or vocabulary building activities tailored to the learner's needs.
The system can track improvement over time and suggest targeted exercises for areas needing improvement, creating a dynamic and responsive learning environment that adapts to each student's progress.
2.3.4 Privacy Considerations
Privacy is paramount when handling audio recordings. First and foremost, obtaining consent before analyzing third-party voice recordings is a crucial legal and ethical requirement. It's essential to secure written or documented permission from all participants before processing any voice recordings, whether they're from meetings, interviews, calls, or other audio content involving third parties. Organizations should implement a formal consent process that clearly outlines how the audio will be used and analyzed.
Security measures must be implemented throughout the processing workflow. After analysis is complete, it's critical to use openai.files.delete(file_id)
to remove audio files from OpenAI's servers. This practice minimizes data exposure and helps prevent unauthorized access and potential data breaches. Organizations should establish automated cleanup procedures to ensure consistent deletion of processed files.
Long-term storage of voice data requires special consideration. Never store sensitive voice recordings without explicit approval from all parties involved. Organizations should implement strict data handling policies that clearly specify storage duration, security measures, and intended use. Extra caution should be taken with recordings containing personal information, business secrets, or confidential discussions. Best practices include implementing encryption for stored audio files and maintaining detailed access logs.
2.3 Speech Understanding in GPT-4o
In this section, you'll discover how to work with advanced audio processing capabilities that go beyond basic transcription. GPT-4o introduces a revolutionary approach to audio understanding by allowing direct integration of audio files alongside textual prompts. This creates a seamless multimodal interaction system where both audio and text inputs are processed simultaneously. The system can analyze various aspects of speech, including tone, context, and semantic meaning, enabling you to build sophisticated smart assistants that can listen, understand, and respond naturally within any given context.
The technology represents a significant advancement in audio processing by combining Whisper-style transcription with GPT-4o's advanced reasoning capabilities. While Whisper excels at converting speech to text, GPT-4o takes this further by performing deep analysis of the transcribed content. This integration happens in one fluid interaction, eliminating the need for separate processing steps. For example, when processing a business meeting recording, GPT-4o can simultaneously transcribe the speech, identify speakers, extract action items, and generate summaries - all while maintaining context and understanding subtle nuances in communication.
This powerful combination opens up unprecedented possibilities for creating more intuitive and responsive AI applications. These applications can not only process and understand spoken language but can also interpret context, emotion, and intent in ways that were previously not possible. Whether it's analyzing customer service calls, processing educational lectures, or facilitating multilingual communication, the system provides a comprehensive understanding of spoken content that goes far beyond simple transcription.
2.3.1 Why GPT-4o for Speech?
While the Whisper API excels at converting spoken language into written text, GPT-4o represents a revolutionary leap forward in audio processing capabilities. To understand the distinction, imagine Whisper as a highly skilled transcriptionist who can accurately write down every word spoken, while GPT-4o functions more like an experienced analyst with deep contextual understanding.
GPT-4o's capabilities extend far beyond basic transcription. It can understand and process speech at multiple levels simultaneously:
Semantic Understanding
Comprehends the actual meaning behind the words, going beyond simple word-for-word translation. This advanced capability allows GPT-4o to process language at multiple levels simultaneously, understanding not only the literal meaning but also the deeper semantic layers, cultural context, and intended message. This includes understanding idioms, metaphors, cultural references, and regional expressions within the speech, as well as detecting subtle nuances in communication that might be lost in simple transcription.
For example, when someone says "it's raining cats and dogs," GPT-4o understands this means heavy rainfall rather than literally interpreting animals falling from the sky. Similarly, when processing phrases like "break a leg" before a performance or "piece of cake" to describe an easy task, the system correctly interprets these idiomatic expressions within their cultural context.
It can also grasp complex concepts like sarcasm ("Oh, great, another meeting"), humor ("Why did the GPT model cross the road?"), and rhetorical questions ("Who wouldn't want that?"), making it capable of truly understanding human communication in its full context. This sophisticated understanding extends to cultural-specific references, professional jargon, and even regional dialectical variations, ensuring accurate interpretation regardless of the speaker's background or communication style.
Example:
Since the standard OpenAI API interaction for this typically involves first converting speech to text (using Whisper) and then analyzing that text for deeper meaning (using GPT-4o), the code example will demonstrate this two-step process.
This script will:
- Transcribe an audio file containing potentially nuanced language using Whisper.
- Send the transcribed text to GPT-4o with a prompt asking for semantic interpretation.
Download the audio sample: https://files.cuantum.tech/audio/idiom_speech.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 7:37 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-04-21 19:37:00 CDT"
current_location = "Dallas, Texas, United States"
print(f"Running GPT-4o semantic speech understanding example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with nuanced speech
# IMPORTANT: Replace 'idiom_speech.mp3' with the actual filename.
# Good examples for audio content: "Wow, that presentation just knocked my socks off!",
# "Sure, I'd LOVE to attend another three-hour meeting.", "He really spilled the beans."
audio_file_path = "idiom_speech.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Text for Semantic Meaning using GPT-4o ---
def analyze_text_meaning(client, text_to_analyze):
"""Sends transcribed text to GPT-4o for semantic analysis."""
print(f"\nStep 2: Analyzing text for semantic meaning: \"{text_to_analyze}\"")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Construct prompt to ask for deeper meaning
system_prompt = "You are an expert in linguistics and communication."
user_prompt = (
f"Analyze the following phrase or sentence:\n\n'{text_to_analyze}'\n\n"
"Explain its likely intended meaning, considering context, idioms, "
"metaphors, sarcasm, humor, cultural references, or other nuances. "
"Go beyond a literal, word-for-word interpretation."
)
try:
print("Sending text to GPT-4o for analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for its strong understanding capabilities
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=250, # Adjust as needed
temperature=0.5 # Lower temperature for more focused analysis
)
analysis = response.choices[0].message.content
print("Semantic analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio
transcribed_text = transcribe_speech(client, audio_file_path)
if transcribed_text:
print(f"\nTranscription Result: {transcribed_text}")
# Step 2: Analyze the transcription for meaning
semantic_analysis = analyze_text_meaning(client, transcribed_text)
if semantic_analysis:
print("\n--- Semantic Analysis Result ---")
print(semantic_analysis)
print("--------------------------------\n")
print("This demonstrates GPT-4o understanding nuances beyond literal text.")
else:
print("\nSemantic analysis failed.")
else:
print("\nTranscription failed, cannot proceed to analysis.")
Code breakdown:
- Context: This code demonstrates GPT-4o's advanced semantic understanding of speech. It goes beyond simple transcription by interpreting the meaning, including nuances like idioms, sarcasm, or context-dependent phrases.
- Two-Step Process: The example uses a standard two-step API approach:
- Step 1 (Whisper): The audio file is first converted into text using the Whisper API (
client.audio.transcriptions.create
). This captures the spoken words accurately. - Step 2 (GPT-4o): The transcribed text is then sent to the GPT-4o model (
client.chat.completions.create
) with a specific prompt asking it to analyze the meaning behind the words, considering non-literal interpretations.
- Step 1 (Whisper): The audio file is first converted into text using the Whisper API (
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and crucially, an audio file containing speech that has some nuance (e.g., includes an idiom like "spill the beans", a sarcastic remark like "Oh great, another meeting", or a culturally specific phrase). - Transcription Function (
transcribe_speech
): This function handles Step 1, taking the audio file path and returning the plain text transcription from Whisper. - Semantic Analysis Function (
analyze_text_meaning
):- This function handles Step 2. It takes the transcribed text.
- Prompt Design: It constructs a prompt specifically asking GPT-4o to act as a linguistic expert and explain the intended meaning, considering idioms, sarcasm, context, etc., explicitly requesting analysis beyond the literal interpretation.
- Uses
gpt-4o
as the model for its strong reasoning and understanding capabilities. - Returns the analysis provided by GPT-4o.
- Main Execution: The script first transcribes the audio. If successful, it passes the text to the analysis function. Finally, it prints both the literal transcription and GPT-4o's semantic interpretation.
- Use Case Relevance: This example clearly shows how combining Whisper and GPT-4o allows for a deeper understanding of spoken language than transcription alone. It demonstrates the capability described – comprehending idioms ("raining cats and dogs"), sarcasm, humor, and context – making AI interaction more aligned with human communication.
Remember to use an audio file containing non-literal language for testing to best showcase the semantic analysis step. Replace 'idiom_speech.mp3'
with your actual file path.
Contextual Analysis
Interprets statements within their broader context, taking into account surrounding information, previous discussions, cultural references, and situational factors. This includes understanding how time, place, speaker relationships, and prior conversations influence meaning. The analysis considers multiple layers of context:
- Temporal Context: When something is said (time of day, day of week, season, or historical period)
- Social Context: The relationships between speakers, power dynamics, and social norms
- Physical Context: The location and environment where communication occurs
- Cultural Context: Shared knowledge, beliefs, and customs that influence interpretation
For example, the phrase "it's getting late" could mean different things in different contexts:
- During a workday meeting: A polite suggestion to wrap up the discussion
- At a social gathering: An indication that someone needs to leave
- From a parent to a child: A reminder about bedtime
- In a project discussion: Concern about approaching deadlines
GPT-4o analyzes these contextual clues along with additional factors such as tone of voice, speech patterns, and conversation history to provide more accurate and nuanced interpretations of spoken communication. This deep contextual understanding allows the system to capture the true intended meaning behind words, rather than just their literal interpretation.
Example:
This use case focuses on GPT-4o's ability to interpret transcribed speech within its broader context (temporal, social, physical, cultural). Like the semantic understanding example, this typically involves a two-step process: transcribing the speech with Whisper, then analyzing the text with GPT-4o, but this time explicitly providing contextual information to GPT-4o.
This code example will:
- Transcribe a simple, context-dependent phrase from an audio file using Whisper.
- Send the transcribed text to GPT-4o multiple times, each time providing a different context description.
- Show how GPT-4o's interpretation of the same phrase changes based on the provided context.
Download the sample audio: https://files.cuantum.tech/audio/context_phrase.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 7:44 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-02-11 11:44:00 CDT" # Updated time
current_location = "Miami, Florida, United States"
print(f"Running GPT-4o contextual speech analysis example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with the context-dependent phrase
# IMPORTANT: Replace 'context_phrase.mp3' with the actual filename.
# The audio content should ideally be just "It's getting late."
audio_file_path = "context_phrase.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from the previous example (gpt4o_speech_semantic_py)
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Text for Meaning WITHIN a Given Context using GPT-4o ---
def analyze_text_with_context(client, text_to_analyze, context_description):
"""Sends transcribed text and context description to GPT-4o for analysis."""
print(f"\nStep 2: Analyzing text \"{text_to_analyze}\" within context...")
print(f"Context Provided: {context_description}")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
if not context_description:
print("Error: Context description must be provided for this analysis.")
return None
# Construct prompt asking for interpretation based on context
system_prompt = "You are an expert in analyzing communication and understanding context."
user_prompt = (
f"Consider the phrase: '{text_to_analyze}'\n\n"
f"Now, consider the specific context in which it was said: '{context_description}'\n\n"
"Based *only* on this context, explain the likely intended meaning, implication, "
"or function of the phrase in this situation."
)
try:
print("Sending text and context to GPT-4o for analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong contextual reasoning
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=200, # Adjust as needed
temperature=0.3 # Lower temperature for more focused contextual interpretation
)
analysis = response.choices[0].message.content
print("Contextual analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio phrase
transcribed_phrase = transcribe_speech(client, audio_file_path)
if transcribed_phrase:
print(f"\nTranscription Result: \"{transcribed_phrase}\"")
# Define different contexts for the same phrase
contexts = [
"Said during a business meeting scheduled to end at 5:00 PM, spoken at 4:55 PM.",
"Said by a guest at a social party around 1:00 AM.",
"Said by a parent to a young child at 9:00 PM on a school night.",
"Said during a critical project discussion about an upcoming deadline, spoken late in the evening.",
"Said by someone looking out the window on a short winter afternoon."
]
print("\n--- Analyzing Phrase in Different Contexts ---")
# Step 2: Analyze the phrase within each context
for i, context in enumerate(contexts):
print(f"\n--- Analysis for Context {i+1} ---")
contextual_meaning = analyze_text_with_context(
client,
transcribed_phrase,
context
)
if contextual_meaning:
print(f"Meaning in Context: {contextual_meaning}")
else:
print("Contextual analysis failed for this context.")
print("------------------------------------")
print("\nThis demonstrates how GPT-4o interprets the same phrase differently based on provided context.")
else:
print("\nTranscription failed, cannot proceed to contextual analysis.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability for contextual analysis of speech. It shows how the interpretation of a spoken phrase can change dramatically depending on the surrounding situation (temporal, social, situational factors).
- Two-Step Process with Context Injection:
- Step 1 (Whisper): The audio file containing a context-dependent phrase (e.g., "It's getting late.") is transcribed into text using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): The transcribed text is then sent to GPT-4o (
client.chat.completions.create
), but crucially, the prompt now includes a description of the specific context in which the phrase was spoken.
- Step 1 (Whisper): The audio file containing a context-dependent phrase (e.g., "It's getting late.") is transcribed into text using
- Prerequisites: Requires the standard
openai
andpython-dotenv
setup, API key, and an audio file containing a simple phrase whose meaning heavily depends on context (the example uses "It's getting late."). - Transcription Function (
transcribe_speech
): This function (reused from the previous example) handles Step 1. - Contextual Analysis Function (
analyze_text_with_context
):- This function handles Step 2 and now accepts an additional argument:
context_description
. - Prompt Design: The prompt explicitly provides both the transcribed phrase and the
context_description
to GPT-4o, asking it to interpret the phrase within that specific situation. - Uses
gpt-4o
for its ability to reason based on provided context.
- This function handles Step 2 and now accepts an additional argument:
- Demonstrating Context Dependency (Main Execution):
- The script first transcribes the phrase (e.g., "It's getting late.").
- It then defines a list of different context descriptions (meeting ending, late-night party, bedtime, project deadline, short winter day).
- It calls the
analyze_text_with_context
function repeatedly, using the same transcribed phrase but providing a different context description each time. - By printing the analysis result for each context, the script clearly shows how GPT-4o's interpretation shifts based on the context provided (e.g., suggesting wrapping up vs. indicating tiredness vs. noting dwindling daylight).
- Use Case Relevance: This highlights GPT-4o's sophisticated understanding, moving beyond literal words to grasp intended meaning influenced by temporal, social, and situational factors. This is vital for applications needing accurate interpretation of real-world communication in business, social interactions, or any context-rich environment. It shows how developers can provide relevant context alongside transcribed text to get more accurate and nuanced interpretations from the AI.
For testing this code effectively, either create an audio file containing just the phrase "It's getting late" (or another context-dependent phrase), or download the provided sample file. Remember to update the 'context_phrase.mp3'
path to match your file location.
Summary Generation
GPT-4o's summary generation capabilities represent a significant advancement in AI-powered content analysis. The system creates concise, meaningful summaries of complex discussions by intelligently distilling key information from lengthy conversations, meetings, or presentations. Using advanced natural language processing and contextual understanding, GPT-4o can identify main themes, critical points, and essential takeaways while maintaining the core meaning and context of the original discussion.
The system employs several sophisticated techniques:
- Pattern Recognition: Identifies recurring themes and important discussion points across long conversations
- Contextual Analysis: Understands the broader context and relationships between different parts of the discussion
- Priority Detection: Automatically determines which information is most crucial for the summary
- Semantic Understanding: Captures underlying meanings and implications beyond just surface-level content
The generated summaries can be customized for different purposes and audiences:
- Executive Briefings: Focused on strategic insights and high-level decisions
- Meeting Minutes: Detailed documentation of discussions and action items
- Quick Overviews: Condensed highlights for rapid information consumption
- Technical Summaries: Emphasis on specific technical details and specifications
What sets GPT-4o apart is its ability to preserve important details while significantly reducing information overload, making it an invaluable tool for modern business communication and knowledge management.
Example:
This example focuses on GPT-4o's ability to generate concise and meaningful summaries from potentially lengthy spoken content obtained via Whisper.
This involves the familiar two-step process: first, transcribing the audio with Whisper to get the full text, and second, using GPT-4o's language understanding capabilities to analyze and summarize that text according to specific needs. This example will demonstrate generating different types of summaries from the same transcription.
Download the sample audio: https://files.cuantum.tech/audio/discussion_audio.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 7:59 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-04-10 15:59:00 CDT" # Updated time
current_location = "Houston, Texas, United States"
print(f"Running GPT-4o speech summarization example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'discussion_audio.mp3' with the actual filename.
audio_file_path = "discussion_audio.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before summarization.")
except OSError:
pass # Ignore size check error, proceed with transcription attempt
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Generate Summary from Text using GPT-4o ---
def summarize_text(client, text_to_summarize, summary_style="concise overview"):
"""Sends transcribed text to GPT-4o for summarization."""
print(f"\nStep 2: Generating '{summary_style}' summary...")
if not text_to_summarize:
print("Error: No text provided for summarization.")
return None
# Tailor the prompt based on the desired summary style
system_prompt = "You are an expert meeting summarizer and information distiller."
user_prompt = f"""Please generate a {summary_style} of the following discussion transcription.
Focus on accurately capturing the key information relevant to a {summary_style}. For example:
- For an 'executive briefing', focus on strategic points, decisions, and outcomes.
- For 'detailed meeting minutes', include main topics, key arguments, decisions, and action items.
- For a 'concise overview', provide the absolute main points and purpose.
- For a 'technical summary', emphasize technical details, specifications, or findings.
Transcription Text:
---
{text_to_summarize}
---
Generate the {summary_style}:
"""
try:
print(f"Sending text to GPT-4o for {summary_style}...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong summarization
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=400, # Adjust based on expected summary length
temperature=0.5 # Balance creativity and focus
)
summary = response.choices[0].message.content
print(f"'{summary_style}' generation successful.")
return summary.strip()
except OpenAIError as e:
print(f"OpenAI API Error during summarization: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during summarization: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("--------------------------")
# Step 2: Generate summaries in different styles
summary_styles_to_generate = [
"concise overview",
"detailed meeting minutes with action items",
"executive briefing focusing on decisions",
# "technical summary" # Add if relevant to your audio content
]
print("\n--- Generating Summaries ---")
for style in summary_styles_to_generate:
print(f"\n--- Summary Style: {style} ---")
summary_result = summarize_text(
client,
full_transcription,
summary_style=style
)
if summary_result:
print(summary_result)
else:
print(f"Failed to generate '{style}'.")
print("------------------------------------")
print("\nThis demonstrates GPT-4o generating different summaries from the same transcription based on the prompt.")
else:
print("\nTranscription failed, cannot proceed to summarization.")
Code breakdown:
- Context: This code demonstrates GPT-4o's advanced capability for summary generation from spoken content. It leverages the two-step process: transcribing audio with Whisper and then using GPT-4o to intelligently distill the key information from the transcription into a concise summary.
- Handling Lengthy Audio (Crucial Note): The prerequisites and code comments explicitly address the 25MB limit of the Whisper API. For real-world long meetings or presentations, the audio must be chunked, each chunk transcribed separately, and the resulting texts concatenated before being passed to the summarization step. The code example itself processes a single audio file for simplicity but highlights this essential workflow for longer content.
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file representing the discussion to be summarized (discussion_audio.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1, converting the input audio (or audio chunk) into plain text using Whisper. - Summarization Function (
summarize_text
):- Handles Step 2, taking the full transcribed text as input.
- Customizable Summaries: Accepts a
summary_style
argument (e.g., "executive briefing", "detailed meeting minutes"). - Prompt Engineering: The prompt sent to GPT-4o is dynamically constructed based on the requested
summary_style
. It instructs GPT-4o to act as an expert summarizer and tailor the output (focusing on strategic points, action items, technical details, etc.) according to the desired style. - Uses
gpt-4o
for its advanced understanding and summarization skills.
- Demonstrating Different Summary Types (Main Execution):
- The script first gets the full transcription.
- It then defines a list of different
summary_styles_to_generate
. - It calls the
summarize_text
function multiple times, passing the same full transcription each time but varying thesummary_style
argument. - By printing each resulting summary, the script clearly shows how GPT-4o adapts the level of detail and focus based on the prompt, generating distinct outputs (e.g., a brief overview vs. detailed minutes) from the identical source text.
- Use Case Relevance: This directly addresses the "Summary Generation" capability. It shows how combining Whisper and GPT-4o can transform lengthy spoken discussions into various useful formats (executive briefings, meeting minutes, quick overviews), saving time and improving knowledge management in business, education, and content creation.
Key Point Extraction
Identifies and highlights crucial information by leveraging GPT-4o's advanced natural language processing capabilities. Through sophisticated algorithms and contextual understanding, the model analyzes spoken content to extract meaningful insights. The model can:
- Extract core concepts and main arguments from spoken content - This involves identifying the fundamental ideas, key messages, and supporting evidence presented in conversations, presentations, or discussions. The model distinguishes between primary and secondary points, ensuring that essential information is captured.
- Identify critical decision points and action items - By analyzing conversation flow and context, GPT-4o recognizes moments when decisions are made, commitments are established, or tasks are assigned. This includes detecting both explicit assignments ("John will handle this") and implicit ones ("We should look into this further").
- Prioritize information based on context and relevance - The model evaluates the significance of different pieces of information within their specific context, considering factors such as urgency, impact, and relationship to overall objectives. This helps in creating hierarchical summaries that emphasize what matters most.
- Track key themes and recurring topics across conversations - GPT-4o maintains awareness of discussion patterns, identifying when certain subjects resurface and how they evolve over time. This capability is particularly valuable for long-term project monitoring or tracking ongoing concerns across multiple meetings.
Example:
This example focuses on using GPT-4o to extract specific, crucial information—key points, decisions, action items—from transcribed speech, going beyond a general summary.
This again uses the two-step approach: Whisper transcribes the audio, and then GPT-4o analyzes the text based on a prompt designed for extraction.
Download the audio sample: https://files.cuantum.tech/audio/meeting_for_extraction.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:07 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-21 22:07:00 CDT" # Updated time
current_location = "Austin, Texas, United States"
print(f"Running GPT-4o key point extraction from speech example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'meeting_for_extraction.mp3' with the actual filename.
audio_file_path = "meeting_for_extraction.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before extraction.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Extract Key Points, Decisions, Actions using GPT-4o ---
def extract_key_points(client, text_to_analyze):
"""Sends transcribed text to GPT-4o for key point extraction."""
print("\nStep 2: Extracting key points, decisions, and actions...")
if not text_to_analyze:
print("Error: No text provided for extraction.")
return None
# Prompt designed specifically for extraction
system_prompt = "You are an expert meeting analyst. Your task is to carefully read the provided transcript and extract specific types of information."
user_prompt = f"""Analyze the following meeting or discussion transcription. Identify and extract the following information, presenting each under a clear heading:
1. **Key Points / Core Concepts:** List the main topics, arguments, or fundamental ideas discussed.
2. **Decisions Made:** List any clear decisions that were reached during the discussion.
3. **Action Items:** List specific tasks assigned to individuals or the group. If possible, note who is responsible and any mentioned deadlines.
If any category has no relevant items, state "None identified".
Transcription Text:
---
{text_to_analyze}
---
Extracted Information:
"""
try:
print("Sending text to GPT-4o for extraction...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong analytical capabilities
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=600, # Adjust based on expected length of extracted info
temperature=0.2 # Lower temperature for more factual extraction
)
extracted_info = response.choices[0].message.content
print("Extraction successful.")
return extracted_info.strip()
except OpenAIError as e:
print(f"OpenAI API Error during extraction: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during extraction: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("------------------------------------")
# Step 2: Extract Key Information
extracted_details = extract_key_points(
client,
full_transcription
)
if extracted_details:
print("\n--- Extracted Key Information ---")
print(extracted_details)
print("---------------------------------")
print("\nThis demonstrates GPT-4o identifying and structuring key takeaways from the discussion.")
else:
print("\nFailed to extract key information.")
else:
print("\nTranscription failed, cannot proceed to key point extraction.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability for Key Point Extraction from spoken content. After transcribing audio using Whisper, GPT-4o analyzes the text to identify and isolate crucial information like core concepts, decisions made, and action items assigned.
- Two-Step Process: Like summarization, this relies on:
- Step 1 (Whisper): Transcribing the audio (
client.audio.transcriptions.create
) to get the full text. The critical note about handling audio files larger than 25MB via chunking and concatenation still applies. - Step 2 (GPT-4o): Analyzing the complete transcription using
client.chat.completions.create
with a prompt specifically designed for extraction.
- Step 1 (Whisper): Transcribing the audio (
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file from a meeting or discussion where key information is likely present (meeting_for_extraction.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1, returning the plain text transcription. - Extraction Function (
extract_key_points
):- Handles Step 2, taking the full transcription text.
- Prompt Engineering for Extraction: This is key. The prompt explicitly instructs GPT-4o to act as an analyst and extract information under specific headings: "Key Points / Core Concepts," "Decisions Made," and "Action Items." This structured request guides GPT-4o to identify and categorize the relevant information accurately. A lower
temperature
(e.g., 0.2) is suggested to encourage more factual, less creative output suitable for extraction. - Uses
gpt-4o
for its advanced analytical skills.
- Output: The function returns a text string containing the extracted information, ideally structured under the requested headings.
- Main Execution: The script transcribes the audio, then passes the text to the extraction function, and finally prints the structured output.
- Use Case Relevance: This directly addresses the "Key Point Extraction" capability. It shows how AI can automatically process lengthy discussions to pull out the most important concepts, track decisions, and list actionable tasks, saving significant time in reviewing recordings or generating meeting follow-ups. It highlights GPT-4o's ability to understand conversational flow and identify significant moments (decisions, assignments) within the text.
Emotional Intelligence
Detects tone, sentiment, and emotional undertones in spoken communication through GPT-4o's advanced natural language processing capabilities. This sophisticated system performs deep analysis of speech patterns and contextual elements to understand the emotional layers of communication. The model can identify subtle emotional cues such as:
- Voice inflections and patterns that indicate excitement, hesitation, or concern - Including pitch variations, speech rhythm changes, and vocal stress patterns that humans naturally use to convey emotions
- Changes in speaking tempo and volume that suggest emotional states - For example, rapid speech might indicate excitement or anxiety, while slower speech could suggest thoughtfulness or uncertainty
- Contextual emotional markers like laughter, sighs, or pauses - The model recognizes non-verbal sounds and silence that carry significant emotional meaning in conversation
- Cultural and situational nuances that affect emotional expression - Understanding how different cultures express emotions differently and how context influences emotional interpretation
This emotional awareness enables GPT-4o to provide more nuanced and context-appropriate responses, making it particularly valuable for applications in customer service (where understanding customer frustration or satisfaction is crucial), therapeutic conversations (where emotional support and understanding are paramount), and personal coaching (where motivation and emotional growth are key objectives). The system's ability to detect these subtle emotional signals allows for more empathetic and effective communication across various professional and personal contexts.
Example:
This example explores using GPT-4o for "Emotional Intelligence" – detecting tone, sentiment, and emotional undertones in speech.
It's important to understand how this works with current standard OpenAI APIs. While GPT-4o excels at understanding emotion from text, directly analyzing audio features like pitch, tone variance, tempo, sighs, or laughter as audio isn't a primary function of the standard Whisper transcription or the Chat Completions API endpoint when processing transcribed text.
Therefore, the most practical way to demonstrate this concept using these APIs is a two-step process:
- Transcribe Speech to Text: Use Whisper to get the words spoken.
- Analyze Text for Emotion: Use GPT-4o to analyze the transcribed text for indicators of emotion, sentiment, or tone based on word choice, phrasing, and context described in the text.
Download the sample audio: https://files.cuantum.tech/audio/emotional_speech.mp3
This code example implements this two-step, text-based analysis approach.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:13 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-21 20:13:00 CDT" # Updated time
current_location = "Atlanta, Georgia, United States"
print(f"Running GPT-4o speech emotion analysis (text-based) example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with potentially emotional speech
# IMPORTANT: Replace 'emotional_speech.mp3' with the actual filename.
audio_file_path = "emotional_speech.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Transcribed Text for Emotion/Sentiment using GPT-4o ---
def analyze_text_emotion(client, text_to_analyze):
"""
Sends transcribed text to GPT-4o for emotion and sentiment analysis.
Note: This analyzes the text content, not acoustic features of the original audio.
"""
print("\nStep 2: Analyzing transcribed text for emotion/sentiment...")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Prompt designed for text-based emotion/sentiment analysis
system_prompt = "You are an expert in communication analysis, skilled at detecting sentiment, tone, and potential underlying emotions from text."
user_prompt = f"""Analyze the following text for emotional indicators:
Text:
---
{text_to_analyze}
---
Based *only* on the words, phrasing, and punctuation in the text provided:
1. What is the overall sentiment (e.g., Positive, Negative, Neutral, Mixed)?
2. What is the likely emotional tone (e.g., Frustrated, Excited, Calm, Anxious, Sarcastic, Happy, Sad)?
3. Are there specific words or phrases that indicate these emotions? Explain briefly.
Provide the analysis:
"""
try:
print("Sending text to GPT-4o for emotion analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for nuanced understanding
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=300, # Adjust as needed
temperature=0.4 # Slightly lower temp for more grounded analysis
)
analysis = response.choices[0].message.content
print("Emotion analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio
transcribed_text = transcribe_speech(client, audio_file_path)
if transcribed_text:
print(f"\n--- Transcription Result ---")
print(transcribed_text)
print("----------------------------")
# Step 2: Analyze the transcription for emotion/sentiment
emotion_analysis = analyze_text_emotion(
client,
transcribed_text
)
if emotion_analysis:
print("\n--- Emotion/Sentiment Analysis (from Text) ---")
print(emotion_analysis)
print("----------------------------------------------")
print("\nNote: This analysis is based on the transcribed text content. It does not directly analyze acoustic features like tone of voice from the original audio.")
else:
print("\nEmotion analysis failed.")
else:
print("\nTranscription failed, cannot proceed to emotion analysis.")
# --- End of Code Example ---
Code breakdown:
- Context: This code demonstrates how GPT-4o can be used to infer emotional tone and sentiment from spoken language. It utilizes a two-step process common for this type of analysis with current APIs.
- Two-Step Process & Limitation:
- Step 1 (Whisper): The audio is first transcribed into text using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): The resulting text is then analyzed by GPT-4o (
client.chat.completions.create
) using a prompt specifically designed to identify sentiment and emotional indicators within the text. - Important Limitation: The explanation (and code comments) must clearly state that this method analyzes the linguistic content (words, phrasing) provided by Whisper. It does not directly analyze acoustic features of the original audio like pitch, tempo, or specific non-verbal sounds (sighs, laughter) unless those happen to be transcribed by Whisper (which is often not the case for subtle cues). True acoustic emotion detection would require different tools or APIs.
- Step 1 (Whisper): The audio is first transcribed into text using
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file where the speaker's words might suggest an emotion (emotional_speech.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1, returning plain text. - Emotion Analysis Function (
analyze_text_emotion
):- Handles Step 2, taking the transcribed text.
- Prompt Design: The prompt explicitly asks GPT-4o to analyze the provided text for overall sentiment (Positive/Negative/Neutral), likely emotional tone (Frustrated, Excited, etc.), and supporting textual evidence. It clarifies the analysis should be based only on the text.
- Uses
gpt-4o
for its sophisticated language understanding.
- Output: The function returns GPT-4o's textual analysis of the inferred emotion and sentiment.
- Main Execution: The script transcribes the audio, passes the text for analysis, prints both results, and reiterates the limitation regarding acoustic features.
- Use Case Relevance: While not analyzing acoustics directly, this text-based approach is still valuable for applications like customer service (detecting frustration/satisfaction from word choice), analyzing feedback, or getting a general sense of sentiment from spoken interactions, complementing other forms of analysis. It showcases GPT-4o's ability to interpret emotional language.
Remember to use an audio file where the spoken words convey some emotion for this example to be effective. Replace 'emotional_speech.mp3'
with your file path.
Implicit Understanding
GPT-4o demonstrates remarkable capabilities in understanding the deeper layers of human communication, going far beyond simple word recognition to grasp the intricate nuances of speech. The model's sophisticated comprehension abilities include:
- Detect underlying context and assumptions
- Understands implicit knowledge shared between speakers
- Recognizes unstated but commonly accepted facts within specific domains
- Identifies hidden premises in conversations
- Understand cultural references and idiomatic expressions
- Processes region-specific sayings and colloquialisms
- Recognizes cultural-specific metaphors and analogies
- Adapts understanding based on cultural context
- Interpret rhetorical devices
Example:
Similar to the previous examples involving deeper understanding (Semantic, Contextual, Emotional), this typically uses the two-step approach: Whisper transcribes the words, and then GPT-4o analyzes the resulting text, this time specifically prompted to look for implicit layers.
Download the sample audio: https://files.cuantum.tech/audio/implicit_speech.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:21 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-12 16:21:00 CDT" # Updated time
current_location = "Dallas, Texas, United States"
print(f"Running GPT-4o implicit speech understanding example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file with implicit meaning
# IMPORTANT: Replace 'implicit_speech.mp3' with the actual filename.
audio_file_path = "implicit_speech.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Analyze Transcribed Text for Implicit Meaning using GPT-4o ---
def analyze_implicit_meaning(client, text_to_analyze):
"""
Sends transcribed text to GPT-4o to analyze implicit meanings,
assumptions, references, or rhetorical devices.
"""
print("\nStep 2: Analyzing transcribed text for implicit meaning...")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Prompt designed for identifying implicit communication layers
system_prompt = "You are an expert analyst of human communication, skilled at identifying meaning that is implied but not explicitly stated."
user_prompt = f"""Analyze the following statement or question:
Statement/Question:
---
{text_to_analyze}
---
Based on common knowledge, cultural context, and conversational patterns, please explain:
1. Any underlying assumptions the speaker might be making.
2. Any implicit meanings or suggestions conveyed beyond the literal words.
3. Any cultural references, idioms, or sayings being used or alluded to.
4. If it's a rhetorical question, what point is likely being made?
Provide a breakdown of the implicit layers of communication present:
"""
try:
print("Sending text to GPT-4o for implicit meaning analysis...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for deep understanding
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=400, # Adjust as needed
temperature=0.5 # Allow for some interpretation
)
analysis = response.choices[0].message.content
print("Implicit meaning analysis successful.")
return analysis.strip()
except OpenAIError as e:
print(f"OpenAI API Error during analysis: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during analysis: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio
transcribed_text = transcribe_speech(client, audio_file_path)
if transcribed_text:
print(f"\n--- Transcription Result ---")
print(transcribed_text)
print("----------------------------")
# Step 2: Analyze the transcription for implicit meaning
implicit_analysis = analyze_implicit_meaning(
client,
transcribed_text
)
if implicit_analysis:
print("\n--- Implicit Meaning Analysis ---")
print(implicit_analysis)
print("-------------------------------")
print("\nThis demonstrates GPT-4o identifying meaning beyond the literal text, based on common knowledge and context.")
else:
print("\nImplicit meaning analysis failed.")
else:
print("\nTranscription failed, cannot proceed to implicit meaning analysis.")
Code breakdown:
- Context: This code example demonstrates GPT-4o's capability for Implicit Understanding – grasping the unstated assumptions, references, and meanings embedded within spoken language.
- Two-Step Process: It follows the established pattern:
- Step 1 (Whisper): Transcribe the audio containing the implicitly meaningful speech into text using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): Analyze the transcribed text using
client.chat.completions.create
, with a prompt specifically designed to uncover hidden layers of meaning.
- Step 1 (Whisper): Transcribe the audio containing the implicitly meaningful speech into text using
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file where the meaning relies on shared knowledge, cultural context, or isn't fully literal (e.g., using an idiom, a rhetorical question, or making an assumption clear only through context).implicit_speech.mp3
is used as the placeholder. - Transcription Function (
transcribe_speech
): Handles Step 1, returning the plain text transcription. - Implicit Analysis Function (
analyze_implicit_meaning
):- Handles Step 2, taking the transcribed text.
- Prompt Engineering for Implicit Meaning: The prompt is key here. It instructs GPT-4o to look beyond the literal words and identify underlying assumptions, implied suggestions, cultural references/idioms, and the purpose behind rhetorical questions.
- Uses
gpt-4o
for its extensive knowledge base and reasoning ability needed to infer these implicit elements.
- Output: The function returns GPT-4o's textual analysis of the unstated meanings detected in the input text.
- Main Execution: The script transcribes the audio, passes the text for implicit analysis, and prints both the literal transcription and GPT-4o's interpretation of the hidden meanings.
- Use Case Relevance: This demonstrates how GPT-4o can process communication more like a human, understanding not just what was said, but also what was meant or assumed. This is crucial for applications requiring deep comprehension, such as analyzing user feedback, understanding nuanced dialogue in meetings, or interpreting culturally rich content.
Remember to use an audio file containing speech that requires some level of inference or background knowledge to fully understand for testing this code effectively. Replace 'implicit_speech.mp3'
with your file path.
From Transcription to Comprehensive Understanding
This advance marks a revolutionary transformation in AI's ability to process human speech. While traditional systems like Whisper excel at transcription - the mechanical process of converting spoken words into written text - modern AI systems like GPT-4o achieve true comprehension, understanding not just the words themselves but their deeper meaning, context, and implications. This leap forward enables AI to process human communication in ways that are remarkably similar to how humans naturally understand conversation, including subtle nuances, implied meanings, and contextual relevance.
To illustrate this transformative evolution in capability, let's examine a detailed example that highlights the stark contrast between simple transcription and advanced comprehension:
- Consider this statement: "I think we should delay the product launch until next quarter." A traditional transcription system like Whisper would perfectly capture these words, but that's where its understanding ends - it simply converts speech to text with high accuracy.
- GPT-4o, however, demonstrates a sophisticated level of understanding that mirrors human comprehension:
- Primary Message Analysis: Beyond just identifying the suggestion to reschedule, it understands this as a strategic proposal that requires careful consideration
- Business Impact Evaluation: Comprehensively assesses how this delay would affect various aspects of the business, from resource allocation to team scheduling to budget implications
- Strategic Market Analysis: Examines the broader market context, including competitor movements, market trends, and potential windows of opportunity
- Comprehensive Risk Assessment: Evaluates both immediate and long-term consequences, considering everything from technical readiness to market positioning
What makes GPT-4o truly remarkable is its ability to engage in nuanced analytical discussions about the content, addressing complex strategic questions that require deep understanding:
- External Factors: What specific market conditions, competitive pressures, or industry trends might have motivated this delay suggestion?
- Stakeholder Impact: How would this timeline adjustment affect relationships with investors, partners, and customers? What communication strategies might be needed?
- Strategic Opportunities: What potential advantages could emerge from this delay, such as additional feature development or market timing optimization?
2.3.2 What Can GPT-4o Do with Speech Input?
GPT-4o represents a significant advancement in audio processing technology, offering a comprehensive suite of capabilities that transform how we interact with and understand spoken content. As a cutting-edge language model with multimodal processing abilities, it combines sophisticated speech recognition with deep contextual understanding to deliver powerful audio analysis features. Let's explore GPT-4o's some other functions and capabilities:
Action Item Extraction
Prompt example: "List all the tasks mentioned in this voice note."
GPT-4o excels at identifying and extracting action items from spoken content through sophisticated natural language processing. The model can:
- Parse complex conversations to detect both explicit ("Please do X") and implicit ("We should consider Y") tasks
- Distinguish between hypothetical discussions and actual commitments
- Categorize tasks by priority, deadline, and assignee
- Identify dependencies between different action items
- Flag follow-up requirements and recurring tasks
This capability transforms unstructured audio discussions into structured, actionable task lists, significantly improving meeting productivity and follow-through. By automatically maintaining a comprehensive record of commitments, it ensures accountability while reducing the cognitive load on participants who would otherwise need to manually track these items. The system can also integrate with popular task management tools, making it seamless to convert spoken assignments into trackable tickets or to-dos.
Example:
This script uses the familiar two-step process: first transcribing the audio with Whisper, then analyzing the text with GPT-4o using a prompt specifically designed to identify and structure action items.
Download the audio sample: https://files.cuantum.tech/audio/meeting_tasks.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:39 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-03-24 10:29:00 CDT" # Updated time
current_location = "Plano, Texas, United States"
print(f"Running GPT-4o action item extraction from speech example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'meeting_tasks.mp3' with the actual filename.
audio_file_path = "meeting_tasks.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before extraction.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Extract Action Items from Text using GPT-4o ---
def extract_action_items(client, text_to_analyze):
"""Sends transcribed text to GPT-4o for action item extraction."""
print("\nStep 2: Extracting action items...")
if not text_to_analyze:
print("Error: No text provided for extraction.")
return None
# Prompt designed specifically for extracting structured action items
system_prompt = "You are an expert meeting analyst focused on identifying actionable tasks."
user_prompt = f"""Analyze the following meeting or discussion transcription. Identify and extract all specific action items mentioned.
For each action item, provide:
- A clear description of the task.
- The person assigned (if mentioned, otherwise state 'Unassigned' or 'Group').
- Any deadline mentioned (if mentioned, otherwise state 'No deadline mentioned').
Distinguish between definite commitments/tasks and mere suggestions or hypothetical possibilities. Only list items that sound like actual tasks or commitments.
Format the output as a numbered list.
Transcription Text:
---
{text_to_analyze}
---
Extracted Action Items:
"""
try:
print("Sending text to GPT-4o for action item extraction...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong analytical capabilities
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=500, # Adjust based on expected number of action items
temperature=0.1 # Very low temperature for factual extraction
)
extracted_actions = response.choices[0].message.content
print("Action item extraction successful.")
return extracted_actions.strip()
except OpenAIError as e:
print(f"OpenAI API Error during extraction: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during extraction: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("------------------------------------")
# Step 2: Extract Action Items
action_items_list = extract_action_items(
client,
full_transcription
)
if action_items_list:
print("\n--- Extracted Action Items ---")
print(action_items_list)
print("------------------------------")
print("\nThis demonstrates GPT-4o identifying and structuring actionable tasks from the discussion.")
else:
print("\nFailed to extract action items.")
else:
print("\nTranscription failed, cannot proceed to action item extraction.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability for Action Item Extraction from spoken content. After transcribing audio with Whisper, GPT-4o analyzes the text to identify specific tasks, assignments, and deadlines discussed.
- Two-Step Process: It uses the standard workflow:
- Step 1 (Whisper): Transcribe the meeting/discussion audio (
client.audio.transcriptions.create
) into text. The note about handling audio files > 25MB via chunking/concatenation remains critical for real-world use. - Step 2 (GPT-4o): Analyze the complete transcription using
client.chat.completions.create
with a prompt tailored for task extraction.
- Step 1 (Whisper): Transcribe the meeting/discussion audio (
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file from a meeting where tasks were assigned (meeting_tasks.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1. - Action Item Extraction Function (
extract_action_items
):- Handles Step 2, taking the full transcription text.
- Prompt Engineering for Tasks: This is the core. The prompt explicitly instructs GPT-4o to identify action items, distinguish them from mere suggestions, and extract the task description, assigned person (if mentioned), and deadline (if mentioned). It requests a structured, numbered list format. A very low
temperature
(e.g., 0.1) is recommended to keep the output focused on factual extraction. - Uses
gpt-4o
for its ability to understand conversational context and identify commitments.
- Output: The function returns a text string containing the structured list of extracted action items.
- Main Execution: The script transcribes the audio, passes the text to the extraction function, and prints the resulting list of tasks.
- Use Case Relevance: This directly addresses the "Action Item Extraction" capability. It shows how AI can automatically convert unstructured verbal discussions into organized, actionable task lists. This significantly boosts productivity by ensuring follow-through, clarifying responsibilities, and reducing the manual effort of tracking commitments made during meetings. It highlights GPT-4o's ability to parse complex conversations and identify both explicit and implicit task assignments.
Q&A about the Audio
Prompt Example: "What did the speaker say about the budget?"
GPT-4o's advanced query capabilities allow for natural conversations about audio content, enabling users to ask specific questions and receive contextually relevant answers. The model can:
- Extract precise information from specific segments
- Understand context and references across the entire audio
- Handle follow-up questions about previously discussed topics
- Provide time-stamped references to relevant portions
- Cross-reference information from multiple parts of the recording
This functionality transforms how we interact with audio content, making it as searchable and queryable as text documents. Instead of manually scrubbing through recordings, users can simply ask questions in natural language and receive accurate, concise responses. The system is particularly valuable for:
- Meeting participants who need to verify specific details
- Researchers analyzing interview recordings
- Students reviewing lecture content
- Professionals fact-checking client conversations
- Teams seeking to understand historical discussions
Example:
This script first transcribes an audio file using Whisper and then uses GPT-4o to answer a specific question asked by the user about the content of that transcription.
Download the audio sample: https://files.cuantum.tech/audio/meeting_for_qa.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:47 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-01-11 11:47:00 CDT" # Updated time
current_location = "Orlando, Florida, United States"
print(f"Running GPT-4o Q&A about audio example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'meeting_for_qa.mp3' with the actual filename.
audio_file_path = "meeting_for_qa.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before Q&A.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Answer Question Based on Text using GPT-4o ---
def answer_question_about_text(client, full_text, question):
"""Sends transcribed text and a question to GPT-4o to get an answer."""
print(f"\nStep 2: Answering question about the transcription...")
print(f"Question: \"{question}\"")
if not full_text:
print("Error: No transcription text provided to answer questions about.")
return None
if not question:
print("Error: No question provided.")
return None
# Prompt designed specifically for answering questions based on provided text
system_prompt = "You are an AI assistant specialized in answering questions based *only* on the provided text transcription. Do not use outside knowledge."
user_prompt = f"""Based *solely* on the following transcription text, please answer the question below. If the answer is not found in the text, state that clearly.
Transcription Text:
---
{full_text}
---
Question: {question}
Answer:
"""
try:
print("Sending transcription and question to GPT-4o...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong comprehension and answering
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=300, # Adjust based on expected answer length
temperature=0.1 # Low temperature for factual answers based on text
)
answer = response.choices[0].message.content
print("Answer generation successful.")
return answer.strip()
except OpenAIError as e:
print(f"OpenAI API Error during Q&A: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during Q&A: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
transcription = transcribe_speech(client, audio_file_path)
if transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(transcription[:1000] + "..." if len(transcription) > 1000 else transcription)
print("------------------------------------")
# --- Ask Questions about the Transcription ---
# Define the question(s) you want to ask
user_question = "What was decided about the email marketing CTA button?"
# user_question = "Who is responsible for the A/B test on Platform B?"
# user_question = "What was the engagement increase on Platform A?"
print(f"\n--- Answering Question ---")
# Step 2: Get the answer from GPT-4o
answer = answer_question_about_text(
client,
transcription,
user_question
)
if answer:
print(f"\nAnswer to '{user_question}':")
print(answer)
print("------------------------------")
print("\nThis demonstrates GPT-4o answering specific questions based on the transcribed audio content.")
else:
print(f"\nFailed to get an answer for the question: '{user_question}'")
else:
print("\nTranscription failed, cannot proceed to Q&A.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability to function as a Q&A system for audio content. After transcribing speech with Whisper, users can ask specific questions in natural language, and GPT-4o will provide answers based on the information contained within the transcription.
- Two-Step Process: The workflow involves:
- Step 1 (Whisper): Transcribe the relevant audio file (or concatenated text from chunks of a longer file) using
client.audio.transcriptions.create
. - Step 2 (GPT-4o): Send the complete transcription along with the user's specific question to
client.chat.completions.create
.
- Step 1 (Whisper): Transcribe the relevant audio file (or concatenated text from chunks of a longer file) using
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file containing the discussion or information the user might ask questions about (meeting_for_qa.mp3
). The critical note about handling audio > 25MB via chunking/concatenation before the Q&A step remains essential. - Transcription Function (
transcribe_speech
): Handles Step 1. - Q&A Function (
answer_question_about_text
):- Handles Step 2, taking both the
full_text
transcription and thequestion
as input. - Prompt Engineering for Q&A: The prompt is crucial. It instructs GPT-4o to act as a specialized assistant that answers questions based only on the provided transcription text, explicitly telling it not to use external knowledge and to state if the answer isn't found in the text. This grounding is important for accuracy. A low
temperature
(e.g., 0.1) helps ensure factual answers derived directly from the source text. - Uses
gpt-4o
for its excellent reading comprehension and question-answering abilities.
- Handles Step 2, taking both the
- Output: The function returns GPT-4o's answer to the specific question asked.
- Main Execution: The script transcribes the audio, defines a sample
user_question
, passes the transcription and question to the Q&A function, and prints the resulting answer. - Use Case Relevance: This directly addresses the "Q&A about the Audio" capability. It transforms audio recordings from passive archives into interactive knowledge sources. Users can quickly find specific details, verify facts, or understand parts of a discussion without manually searching through the audio, making it invaluable for reviewing meetings, lectures, interviews, or any recorded conversation.
Remember to use an audio file containing information relevant to potential questions for testing (you can use the sample audio provided). Modify the user_question
variable to test different queries against the transcribed content.
Highlight Key Moments
Prompt example: "Identify the most important statements made in this audio."
GPT-4o excels at identifying and extracting crucial moments from audio content through its advanced natural language understanding capabilities. The model can:
- Identify key decisions and action items
- Extract important quotes and statements
- Highlight strategic discussions and conclusions
- Pinpoint critical transitions in conversations
This feature is particularly valuable for:
- Meeting participants who need to quickly review important takeaways
- Executives scanning long recordings for decision points
- Teams tracking project milestones discussed in calls
- Researchers identifying significant moments in interviews
The model provides timestamps and contextual summaries for each highlighted moment, making it easier to navigate directly to the most relevant parts of the recording without reviewing the entire audio file.
Example:
This script follows the established two-step pattern: transcribing the audio with Whisper and then analyzing the text with GPT-4o using a prompt designed to identify significant statements, decisions, or conclusions.
Download the sample audio: https://files.cuantum.tech/audio/key_discussion.mp3
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import datetime
# --- Configuration ---
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Get the current date and location context
# Current time is Monday, April 21, 2025 at 8:52 PM CDT.
# Current location is Little Elm, Texas, United States.
current_timestamp = "2025-02-14 15:52:00 CDT" # Updated time
current_location = "Tampa, Florida, United States"
print(f"Running GPT-4o key moment highlighting example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the path to your audio file (or audio chunk)
# IMPORTANT: Replace 'key_discussion.mp3' with the actual filename.
audio_file_path = "key_discussion.mp3"
# --- Step 1: Transcribe Speech to Text using Whisper ---
# Reusing the function from previous examples
def transcribe_speech(client, file_path):
"""Transcribes the audio file using Whisper."""
print(f"\nStep 1: Transcribing audio file: {file_path}")
if not os.path.exists(file_path):
print(f"Error: Audio file not found at '{file_path}'")
return None
# Add note about chunking for long files if size check implemented
try:
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB (Limit: 25MB per API call)")
if file_size_mb > 25:
print("Warning: File exceeds 25MB. For full content, chunking and multiple transcriptions are needed before highlighting.")
except OSError:
pass # Ignore size check error
try:
with open(file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcription successful.")
return response # Returns plain text
except OpenAIError as e:
print(f"OpenAI API Error during transcription: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during transcription: {e}")
return None
# --- Step 2: Highlight Key Moments from Text using GPT-4o ---
def highlight_key_moments(client, text_to_analyze):
"""Sends transcribed text to GPT-4o to identify and extract key moments."""
print("\nStep 2: Identifying key moments from transcription...")
if not text_to_analyze:
print("Error: No text provided for analysis.")
return None
# Prompt designed specifically for identifying key moments/statements
system_prompt = "You are an expert analyst skilled at identifying the most significant parts of a discussion or presentation."
user_prompt = f"""Analyze the following transcription text. Identify and extract the key moments, which could include:
- Important decisions made
- Critical conclusions reached
- Significant statements or impactful quotes
- Major topic shifts or transitions
- Key questions asked or answered
For each key moment identified, provide the relevant quote or a concise summary of the moment. Present the output as a list.
Transcription Text:
---
{text_to_analyze}
---
Key Moments:
"""
try:
print("Sending text to GPT-4o for key moment identification...")
response = client.chat.completions.create(
model="gpt-4o", # Use GPT-4o for strong comprehension
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=700, # Adjust based on expected number/length of key moments
temperature=0.3 # Lean towards factual identification
)
key_moments = response.choices[0].message.content
print("Key moment identification successful.")
return key_moments.strip()
except OpenAIError as e:
print(f"OpenAI API Error during highlighting: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during highlighting: {e}")
return None
# --- Main Execution ---
if __name__ == "__main__":
# Step 1: Transcribe the audio discussion (or chunk)
full_transcription = transcribe_speech(client, audio_file_path)
if full_transcription:
print(f"\n--- Full Transcription (Preview) ---")
# Limit printing very long transcripts in the example output
print(full_transcription[:1000] + "..." if len(full_transcription) > 1000 else full_transcription)
print("------------------------------------")
# Step 2: Highlight Key Moments
highlights = highlight_key_moments(
client,
full_transcription
)
if highlights:
print("\n--- Identified Key Moments ---")
print(highlights)
print("----------------------------")
print("\nThis demonstrates GPT-4o extracting significant parts from the discussion.")
print("\nNote: Adding precise timestamps to these moments requires further processing using Whisper's 'verbose_json' output and correlating the text.")
else:
print("\nFailed to identify key moments.")
else:
print("\nTranscription failed, cannot proceed to highlight key moments.")
Code breakdown:
- Context: This code demonstrates GPT-4o's capability to Highlight Key Moments from spoken content. After transcription via Whisper, GPT-4o analyzes the text to pinpoint and extract the most significant parts, such as crucial decisions, important statements, or major topic shifts.
- Two-Step Process:
- Step 1 (Whisper): Transcribe the audio (
client.audio.transcriptions.create
) to get the full text. The necessity of chunking/concatenating for audio files > 25MB is reiterated. - Step 2 (GPT-4o): Analyze the complete transcription using
client.chat.completions.create
with a prompt specifically asking for key moments.
- Step 1 (Whisper): Transcribe the audio (
- Prerequisites: Standard setup (
openai
,python-dotenv
, API key) and an audio file containing a discussion or presentation where significant moments occur (key_discussion.mp3
). - Transcription Function (
transcribe_speech
): Handles Step 1. - Highlighting Function (
highlight_key_moments
):- Handles Step 2, taking the full transcription text.
- Prompt Engineering for Highlights: The prompt instructs GPT-4o to act as an analyst and identify various types of key moments (decisions, conclusions, impactful quotes, transitions). It asks for the relevant quote or a concise summary for each identified moment, formatted as a list.
- Uses
gpt-4o
for its ability to discern importance and context within text.
- Output: The function returns a text string containing the list of identified key moments.
- Timestamp Note: The explanation and code output explicitly mention that while this process identifies the text of key moments, adding precise timestamps would require additional steps. This involves using Whisper's
verbose_json
output format (which includes segment timestamps) and then correlating the text identified by GPT-4o back to those specific timed segments – a more complex task not covered in this basic example. - Main Execution: The script transcribes the audio, passes the text to the highlighting function, and prints the resulting list of key moments.
- Use Case Relevance: This addresses the "Highlight Key Moments" capability by showing how AI can quickly sift through potentially long recordings to surface the most critical parts. This is highly valuable for efficient review of meetings, interviews, or lectures, allowing users to focus on what matters most without listening to the entire audio.
For testing purposes, use an audio file that contains a relevant discussion with clear, identifiable key segments (you can use the sample audio file provided).
2.3.3 Real-World Use Cases
The modern business landscape increasingly relies on audio communication across various sectors, from sales and customer service to education and personal development. Understanding and effectively utilizing these audio interactions has become crucial for organizations seeking to improve their operations, enhance customer relationships, and drive better outcomes. This section explores several key applications where advanced audio processing and analysis can create significant value, demonstrating how AI-powered tools can transform raw audio data into actionable insights.
From analyzing sales conversations to enhancing educational experiences, these use cases showcase the versatility and power of audio understanding technologies in addressing real-world challenges. Each application represents a unique opportunity to leverage voice data for improved decision-making, process optimization, and better user experiences.
1. Sales Enablement
Advanced analysis of sales call recordings provides a comprehensive toolkit for sales teams to optimize their performance. The system can identify key objections raised by prospects, allowing teams to develop better counter-arguments and prepare responses in advance. It tracks successful closing techniques by analyzing patterns in successful deals, revealing which approaches work best for different customer segments and situations.
The system also measures crucial metrics like conversion rates, call duration, talk-to-listen ratios, and key phrase usage. This data helps sales teams understand which behaviors correlate with successful outcomes. By analyzing customer responses and reaction patterns, teams can refine their pitch timing, improve their questioning techniques, and better understand buying signals.
This technology also enables sales managers to document and share effective approaches across the team, creating a knowledge base of best practices for common challenges. This institutional knowledge can be particularly valuable for onboarding new team members and maintaining consistent sales excellence across the organization.
2. Meeting Intelligence
Comprehensive meeting analysis transforms how organizations capture and utilize meeting content. The system goes beyond basic transcription by:
- Identifying and categorizing key discussion points for easy reference
- Automatically detecting and extracting action items from conversations
- Assigning responsibilities to specific team members based on verbal commitments
- Creating structured timelines and tracking deadlines mentioned during meetings
- Generating automated task lists with clear ownership and due dates
- Highlighting decision points and meeting outcomes
- Providing searchable meeting archives for future reference
The system employs advanced natural language processing to understand context, relationships, and commitments expressed during conversations. This enables automatic task creation and assignment, ensuring nothing falls through the cracks. Integration with project management tools allows for seamless workflow automation, while smart reminders help keep team members accountable for their commitments.
3. Customer Support
Deep analysis of customer service interactions provides comprehensive insights into customer experience and support team performance. The system can:
- Evaluate customer sentiment in real-time by analyzing tone, word choice, and conversation flow
- Automatically categorize and prioritize urgent issues based on keyword detection and context analysis
- Generate detailed satisfaction metrics through conversation analysis and customer feedback
- Track key performance indicators like first-response time and resolution time
- Identify common pain points and recurring issues across multiple interactions
- Monitor support agent performance and consistency in service delivery
This enables support teams to improve response times, identify trending problems, and maintain consistent service quality across all interactions. The system can also provide automated coaching suggestions for support agents and generate insights for product improvement based on customer feedback patterns.
4. Personal Journaling
Transform voice memos into structured reflections with emotional context analysis. Using advanced natural language processing, the system analyzes voice recordings to detect emotional states, stress levels, and overall sentiment through tone of voice, word choice, and speaking patterns. This creates a rich, multi-dimensional journal entry that captures not just what was said, but how it was expressed.
The system's mood tracking capabilities go beyond simple positive/negative classifications, identifying nuanced emotional states like excitement, uncertainty, confidence, or concern. By analyzing these patterns over time, users can gain valuable insights into their emotional well-being and identify triggers or patterns that affect their mental state.
For personal goal tracking, the system can automatically categorize and tag mentions of objectives, progress updates, and setbacks. It can generate progress reports showing momentum toward specific goals, highlight common obstacles, and even suggest potential solutions based on past successful strategies. The behavioral trend analysis examines patterns in decision-making, habit formation, and personal growth, providing users with actionable insights for self-improvement.
5. Education & Language Practice
Comprehensive language learning support revolutionizes how students practice and improve their language skills. The system provides several key benefits:
- Speech Analysis: Advanced algorithms analyze pronunciation patterns, detecting subtle variations in phonemes, stress patterns, and intonation. This helps learners understand exactly where their pronunciation differs from native speakers.
- Error Detection: The system identifies not just pronunciation errors, but also grammatical mistakes, incorrect word usage, and syntactical issues in real-time. This immediate feedback helps prevent the formation of bad habits.
- Personalized Feedback: Instead of generic corrections, the system provides context-aware feedback that considers the learner's proficiency level, native language, and common interference patterns specific to their language background.
- Progress Tracking: Sophisticated metrics track various aspects of language development, including vocabulary range, speaking fluency, grammar accuracy, and pronunciation improvement over time. Visual progress reports help motivate learners and identify areas needing focus.
- Adaptive Learning: Based on performance analysis, the system creates customized exercise plans targeting specific weaknesses. These might include focused pronunciation drills, grammar exercises, or vocabulary building activities tailored to the learner's needs.
The system can track improvement over time and suggest targeted exercises for areas needing improvement, creating a dynamic and responsive learning environment that adapts to each student's progress.
2.3.4 Privacy Considerations
Privacy is paramount when handling audio recordings. First and foremost, obtaining consent before analyzing third-party voice recordings is a crucial legal and ethical requirement. It's essential to secure written or documented permission from all participants before processing any voice recordings, whether they're from meetings, interviews, calls, or other audio content involving third parties. Organizations should implement a formal consent process that clearly outlines how the audio will be used and analyzed.
Security measures must be implemented throughout the processing workflow. After analysis is complete, it's critical to use openai.files.delete(file_id)
to remove audio files from OpenAI's servers. This practice minimizes data exposure and helps prevent unauthorized access and potential data breaches. Organizations should establish automated cleanup procedures to ensure consistent deletion of processed files.
Long-term storage of voice data requires special consideration. Never store sensitive voice recordings without explicit approval from all parties involved. Organizations should implement strict data handling policies that clearly specify storage duration, security measures, and intended use. Extra caution should be taken with recordings containing personal information, business secrets, or confidential discussions. Best practices include implementing encryption for stored audio files and maintaining detailed access logs.