Chapter 2: Audio Understanding and Generation with Whisper and GPT-4o
Practical Exercises — Chapter 2
Exercise 1: Transcribe an English Audio File
Task:
Use the Whisper API to transcribe a short .mp3
audio file containing English speech.
Solution:
import openai
import os
from dotenv import load_dotenv
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
audio_file = open("english_note.mp3", "rb")
transcript = openai.Audio.transcribe(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcript:\n", transcript)
Exercise 2: Translate Foreign Language Audio into English
Task:
Upload a non-English audio file and translate it into English using the Whisper API.
Solution:
translated = openai.Audio.translate(
model="whisper-1",
file=open("spanish_clip.mp3", "rb"),
response_format="text"
)
print("Translation:\n", translated)
Exercise 3: Upload and Analyze an Audio File with GPT-4o
Task:
Upload an .mp3
audio file and ask GPT-4o to summarize it.
Solution:
audio_upload = openai.files.create(
file=open("meeting_summary.mp3", "rb"),
purpose="assistants"
)
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Please summarize this meeting."},
{"type": "audio", "audio": {"file_id": audio_upload.id}}
]
}
]
)
print("Summary:\n", response["choices"][0]["message"]["content"])
Exercise 4: Generate a Spoken Response Using Text-to-Speech
Task:
Take a GPT-generated reply and convert it to speech using OpenAI’s TTS API.
Solution:
text_to_speak = "Sure! The marketing meeting discussed Q3 strategies and budget allocations."
speech = openai.audio.speech.create(
model="tts-1",
voice="nova",
input=text_to_speak
)
with open("spoken_reply.mp3", "wb") as f:
f.write(speech.content)
print("Voice reply saved as 'spoken_reply.mp3'")
Exercise 5: Build a Voice-to-Voice Mini Assistant
Task:
Build a basic pipeline that accepts audio, generates a response using GPT-4o, and replies back using synthesized voice.
Solution:
# Step 1: Upload audio
uploaded_audio = openai.files.create(
file=open("user_voice_prompt.mp3", "rb"),
purpose="assistants"
)
# Step 2: GPT-4o processes it
chat = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Please answer this question politely."},
{"type": "audio", "audio": {"file_id": uploaded_audio.id}}
]
}
]
)
reply = chat["choices"][0]["message"]["content"]
# Step 3: Convert GPT reply to audio
tts = openai.audio.speech.create(
model="tts-1",
voice="echo",
input=reply
)
with open("voice_response.mp3", "wb") as f:
f.write(tts.content)
print("Assistant reply saved as 'voice_response.mp3'")
In these exercises, you practiced:
- Uploading and transcribing audio with Whisper
- Translating foreign speech to English
- Summarizing and interpreting audio with GPT-4o
- Converting GPT replies into natural-sounding speech
- Building your first voice-to-voice assistant pipeline
You now have all the tools to powerfully integrate speech into any AI project, whether you're building a language tutor, accessibility assistant, voice-based productivity tool, or smart speaker experience.
Practical Exercises — Chapter 2
Exercise 1: Transcribe an English Audio File
Task:
Use the Whisper API to transcribe a short .mp3
audio file containing English speech.
Solution:
import openai
import os
from dotenv import load_dotenv
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
audio_file = open("english_note.mp3", "rb")
transcript = openai.Audio.transcribe(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcript:\n", transcript)
Exercise 2: Translate Foreign Language Audio into English
Task:
Upload a non-English audio file and translate it into English using the Whisper API.
Solution:
translated = openai.Audio.translate(
model="whisper-1",
file=open("spanish_clip.mp3", "rb"),
response_format="text"
)
print("Translation:\n", translated)
Exercise 3: Upload and Analyze an Audio File with GPT-4o
Task:
Upload an .mp3
audio file and ask GPT-4o to summarize it.
Solution:
audio_upload = openai.files.create(
file=open("meeting_summary.mp3", "rb"),
purpose="assistants"
)
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Please summarize this meeting."},
{"type": "audio", "audio": {"file_id": audio_upload.id}}
]
}
]
)
print("Summary:\n", response["choices"][0]["message"]["content"])
Exercise 4: Generate a Spoken Response Using Text-to-Speech
Task:
Take a GPT-generated reply and convert it to speech using OpenAI’s TTS API.
Solution:
text_to_speak = "Sure! The marketing meeting discussed Q3 strategies and budget allocations."
speech = openai.audio.speech.create(
model="tts-1",
voice="nova",
input=text_to_speak
)
with open("spoken_reply.mp3", "wb") as f:
f.write(speech.content)
print("Voice reply saved as 'spoken_reply.mp3'")
Exercise 5: Build a Voice-to-Voice Mini Assistant
Task:
Build a basic pipeline that accepts audio, generates a response using GPT-4o, and replies back using synthesized voice.
Solution:
# Step 1: Upload audio
uploaded_audio = openai.files.create(
file=open("user_voice_prompt.mp3", "rb"),
purpose="assistants"
)
# Step 2: GPT-4o processes it
chat = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Please answer this question politely."},
{"type": "audio", "audio": {"file_id": uploaded_audio.id}}
]
}
]
)
reply = chat["choices"][0]["message"]["content"]
# Step 3: Convert GPT reply to audio
tts = openai.audio.speech.create(
model="tts-1",
voice="echo",
input=reply
)
with open("voice_response.mp3", "wb") as f:
f.write(tts.content)
print("Assistant reply saved as 'voice_response.mp3'")
In these exercises, you practiced:
- Uploading and transcribing audio with Whisper
- Translating foreign speech to English
- Summarizing and interpreting audio with GPT-4o
- Converting GPT replies into natural-sounding speech
- Building your first voice-to-voice assistant pipeline
You now have all the tools to powerfully integrate speech into any AI project, whether you're building a language tutor, accessibility assistant, voice-based productivity tool, or smart speaker experience.
Practical Exercises — Chapter 2
Exercise 1: Transcribe an English Audio File
Task:
Use the Whisper API to transcribe a short .mp3
audio file containing English speech.
Solution:
import openai
import os
from dotenv import load_dotenv
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
audio_file = open("english_note.mp3", "rb")
transcript = openai.Audio.transcribe(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcript:\n", transcript)
Exercise 2: Translate Foreign Language Audio into English
Task:
Upload a non-English audio file and translate it into English using the Whisper API.
Solution:
translated = openai.Audio.translate(
model="whisper-1",
file=open("spanish_clip.mp3", "rb"),
response_format="text"
)
print("Translation:\n", translated)
Exercise 3: Upload and Analyze an Audio File with GPT-4o
Task:
Upload an .mp3
audio file and ask GPT-4o to summarize it.
Solution:
audio_upload = openai.files.create(
file=open("meeting_summary.mp3", "rb"),
purpose="assistants"
)
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Please summarize this meeting."},
{"type": "audio", "audio": {"file_id": audio_upload.id}}
]
}
]
)
print("Summary:\n", response["choices"][0]["message"]["content"])
Exercise 4: Generate a Spoken Response Using Text-to-Speech
Task:
Take a GPT-generated reply and convert it to speech using OpenAI’s TTS API.
Solution:
text_to_speak = "Sure! The marketing meeting discussed Q3 strategies and budget allocations."
speech = openai.audio.speech.create(
model="tts-1",
voice="nova",
input=text_to_speak
)
with open("spoken_reply.mp3", "wb") as f:
f.write(speech.content)
print("Voice reply saved as 'spoken_reply.mp3'")
Exercise 5: Build a Voice-to-Voice Mini Assistant
Task:
Build a basic pipeline that accepts audio, generates a response using GPT-4o, and replies back using synthesized voice.
Solution:
# Step 1: Upload audio
uploaded_audio = openai.files.create(
file=open("user_voice_prompt.mp3", "rb"),
purpose="assistants"
)
# Step 2: GPT-4o processes it
chat = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Please answer this question politely."},
{"type": "audio", "audio": {"file_id": uploaded_audio.id}}
]
}
]
)
reply = chat["choices"][0]["message"]["content"]
# Step 3: Convert GPT reply to audio
tts = openai.audio.speech.create(
model="tts-1",
voice="echo",
input=reply
)
with open("voice_response.mp3", "wb") as f:
f.write(tts.content)
print("Assistant reply saved as 'voice_response.mp3'")
In these exercises, you practiced:
- Uploading and transcribing audio with Whisper
- Translating foreign speech to English
- Summarizing and interpreting audio with GPT-4o
- Converting GPT replies into natural-sounding speech
- Building your first voice-to-voice assistant pipeline
You now have all the tools to powerfully integrate speech into any AI project, whether you're building a language tutor, accessibility assistant, voice-based productivity tool, or smart speaker experience.
Practical Exercises — Chapter 2
Exercise 1: Transcribe an English Audio File
Task:
Use the Whisper API to transcribe a short .mp3
audio file containing English speech.
Solution:
import openai
import os
from dotenv import load_dotenv
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
audio_file = open("english_note.mp3", "rb")
transcript = openai.Audio.transcribe(
model="whisper-1",
file=audio_file,
response_format="text"
)
print("Transcript:\n", transcript)
Exercise 2: Translate Foreign Language Audio into English
Task:
Upload a non-English audio file and translate it into English using the Whisper API.
Solution:
translated = openai.Audio.translate(
model="whisper-1",
file=open("spanish_clip.mp3", "rb"),
response_format="text"
)
print("Translation:\n", translated)
Exercise 3: Upload and Analyze an Audio File with GPT-4o
Task:
Upload an .mp3
audio file and ask GPT-4o to summarize it.
Solution:
audio_upload = openai.files.create(
file=open("meeting_summary.mp3", "rb"),
purpose="assistants"
)
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Please summarize this meeting."},
{"type": "audio", "audio": {"file_id": audio_upload.id}}
]
}
]
)
print("Summary:\n", response["choices"][0]["message"]["content"])
Exercise 4: Generate a Spoken Response Using Text-to-Speech
Task:
Take a GPT-generated reply and convert it to speech using OpenAI’s TTS API.
Solution:
text_to_speak = "Sure! The marketing meeting discussed Q3 strategies and budget allocations."
speech = openai.audio.speech.create(
model="tts-1",
voice="nova",
input=text_to_speak
)
with open("spoken_reply.mp3", "wb") as f:
f.write(speech.content)
print("Voice reply saved as 'spoken_reply.mp3'")
Exercise 5: Build a Voice-to-Voice Mini Assistant
Task:
Build a basic pipeline that accepts audio, generates a response using GPT-4o, and replies back using synthesized voice.
Solution:
# Step 1: Upload audio
uploaded_audio = openai.files.create(
file=open("user_voice_prompt.mp3", "rb"),
purpose="assistants"
)
# Step 2: GPT-4o processes it
chat = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Please answer this question politely."},
{"type": "audio", "audio": {"file_id": uploaded_audio.id}}
]
}
]
)
reply = chat["choices"][0]["message"]["content"]
# Step 3: Convert GPT reply to audio
tts = openai.audio.speech.create(
model="tts-1",
voice="echo",
input=reply
)
with open("voice_response.mp3", "wb") as f:
f.write(tts.content)
print("Assistant reply saved as 'voice_response.mp3'")
In these exercises, you practiced:
- Uploading and transcribing audio with Whisper
- Translating foreign speech to English
- Summarizing and interpreting audio with GPT-4o
- Converting GPT replies into natural-sounding speech
- Building your first voice-to-voice assistant pipeline
You now have all the tools to powerfully integrate speech into any AI project, whether you're building a language tutor, accessibility assistant, voice-based productivity tool, or smart speaker experience.