Chapter 5: Image and Audio Integration Projects
Practical Exercises — Chapter 5
Exercise 1: Create a DALL·E Image Generator with Flask
Task:
Build a Flask web app that allows users to enter a text prompt and receive an AI-generated image using OpenAI's DALL·E 3 model.
Solution:
# app.py
from flask import Flask, request, render_template
import openai
import os
from dotenv import load_dotenv
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
app = Flask(__name__)
@app.route("/", methods=["GET", "POST"])
def index():
image_url = None
if request.method == "POST":
prompt = request.form["prompt"]
response = openai.Image.create(
prompt=prompt,
model="dall-e-3",
size="1024x1024",
response_format="url"
)
image_url = response["data"][0]["url"]
return render_template("index.html", image_url=image_url)
if __name__ == "__main__":
app.run(debug=True)
<!-- templates/index.html -->
<form method="post">
<textarea name="prompt" rows="3" placeholder="Describe an image..."></textarea><br>
<input type="submit" value="Generate Image">
</form>
{% if image_url %}
<img src="{{ image_url }}" alt="Generated Image">
{% endif %}
Exercise 2: Transcribe Audio Using Whisper
Task:
Create a Flask app that allows users to upload an audio file and view the transcription using Whisper.
Solution:
@app.route("/", methods=["GET", "POST"])
def index():
transcript = None
if request.method == "POST":
file = request.files["audio_file"]
if file:
file.save("temp.m4a")
with open("temp.m4a", "rb") as audio_file:
result = openai.Audio.transcribe(
model="whisper-1",
file=audio_file
)
transcript = result["text"]
return render_template("index.html", transcript=transcript)
<!-- templates/index.html -->
<form method="post" enctype="multipart/form-data">
<input type="file" name="audio_file" required>
<input type="submit" value="Transcribe">
</form>
{% if transcript %}
<textarea readonly>{{ transcript }}</textarea>
{% endif %}
Exercise 3: Convert Transcription into Image Prompt
Task:
After transcribing the audio, send the transcript to GPT-4o to generate a creative scene description.
Solution (within app.py):
# After getting `transcript`
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Convert this transcription into a visual scene prompt for DALL·E."},
{"role": "user", "content": transcript}
]
)
prompt_summary = response["choices"][0]["message"]["content"]
Exercise 4: Build a Multimodal Assistant (Audio ➝ Text ➝ Image)
Task:
Integrate the full pipeline: upload audio → transcribe with Whisper → generate visual prompt with GPT-4o → generate image with DALL·E.
Solution:
@app.route("/", methods=["GET", "POST"])
def index():
transcript, prompt_summary, image_url = None, None, None
if request.method == "POST":
file = request.files["audio_file"]
if file:
file.save("temp.m4a")
# Step 1: Transcribe
with open("temp.m4a", "rb") as audio_file:
transcript = openai.Audio.transcribe("whisper-1", file=audio_file)["text"]
# Step 2: Convert to visual scene
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Convert this message into a vivid scene description for image generation."},
{"role": "user", "content": transcript}
]
)
prompt_summary = response["choices"][0]["message"]["content"]
# Step 3: Generate image
image_url = openai.Image.create(
prompt=prompt_summary,
model="dall-e-3",
size="1024x1024",
response_format="url"
)["data"][0]["url"]
return render_template("index.html", transcript=transcript, prompt_summary=prompt_summary, image_url=image_url)
<!-- Display in index.html -->
{% if transcript %}
<h3>Transcript:</h3>
<textarea readonly>{{ transcript }}</textarea>
{% endif %}
{% if prompt_summary %}
<h3>Image Prompt:</h3>
<p>{{ prompt_summary }}</p>
{% endif %}
{% if image_url %}
<h3>Generated Image:</h3>
<img src="{{ image_url }}" alt="Generated">
{% endif %}
Exercise 5: Experiment with Input Variations
Task:
Test your app with different types of audio content:
- Descriptions (e.g., "a cat sitting on a window sill during a snowstorm")
- Emotions (e.g., "I felt so peaceful at the beach")
- Narratives (e.g., "Yesterday I saw a dragonfly land on my laptop")
Observe how the transcription → GPT → DALL·E chain adapts to each.
Reflection Questions:
- How accurate is Whisper’s transcription?
- How creative or literal is GPT-4o’s interpretation?
- Do DALL·E’s images reflect the intended mood or concept?
In these exercises, you:
- Built a DALL·E-powered image generator
- Used Whisper to transcribe real voice input
- Chained together speech → text → GPT prompt → image
- Learned to manage multiple response types (text, image) in Flask
- Created a simple multimodal AI experience
These tools are just the beginning — the same building blocks can be used for story apps, educational tools, marketing generators, voice design assistants, and more.
Practical Exercises — Chapter 5
Exercise 1: Create a DALL·E Image Generator with Flask
Task:
Build a Flask web app that allows users to enter a text prompt and receive an AI-generated image using OpenAI's DALL·E 3 model.
Solution:
# app.py
from flask import Flask, request, render_template
import openai
import os
from dotenv import load_dotenv
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
app = Flask(__name__)
@app.route("/", methods=["GET", "POST"])
def index():
image_url = None
if request.method == "POST":
prompt = request.form["prompt"]
response = openai.Image.create(
prompt=prompt,
model="dall-e-3",
size="1024x1024",
response_format="url"
)
image_url = response["data"][0]["url"]
return render_template("index.html", image_url=image_url)
if __name__ == "__main__":
app.run(debug=True)
<!-- templates/index.html -->
<form method="post">
<textarea name="prompt" rows="3" placeholder="Describe an image..."></textarea><br>
<input type="submit" value="Generate Image">
</form>
{% if image_url %}
<img src="{{ image_url }}" alt="Generated Image">
{% endif %}
Exercise 2: Transcribe Audio Using Whisper
Task:
Create a Flask app that allows users to upload an audio file and view the transcription using Whisper.
Solution:
@app.route("/", methods=["GET", "POST"])
def index():
transcript = None
if request.method == "POST":
file = request.files["audio_file"]
if file:
file.save("temp.m4a")
with open("temp.m4a", "rb") as audio_file:
result = openai.Audio.transcribe(
model="whisper-1",
file=audio_file
)
transcript = result["text"]
return render_template("index.html", transcript=transcript)
<!-- templates/index.html -->
<form method="post" enctype="multipart/form-data">
<input type="file" name="audio_file" required>
<input type="submit" value="Transcribe">
</form>
{% if transcript %}
<textarea readonly>{{ transcript }}</textarea>
{% endif %}
Exercise 3: Convert Transcription into Image Prompt
Task:
After transcribing the audio, send the transcript to GPT-4o to generate a creative scene description.
Solution (within app.py):
# After getting `transcript`
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Convert this transcription into a visual scene prompt for DALL·E."},
{"role": "user", "content": transcript}
]
)
prompt_summary = response["choices"][0]["message"]["content"]
Exercise 4: Build a Multimodal Assistant (Audio ➝ Text ➝ Image)
Task:
Integrate the full pipeline: upload audio → transcribe with Whisper → generate visual prompt with GPT-4o → generate image with DALL·E.
Solution:
@app.route("/", methods=["GET", "POST"])
def index():
transcript, prompt_summary, image_url = None, None, None
if request.method == "POST":
file = request.files["audio_file"]
if file:
file.save("temp.m4a")
# Step 1: Transcribe
with open("temp.m4a", "rb") as audio_file:
transcript = openai.Audio.transcribe("whisper-1", file=audio_file)["text"]
# Step 2: Convert to visual scene
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Convert this message into a vivid scene description for image generation."},
{"role": "user", "content": transcript}
]
)
prompt_summary = response["choices"][0]["message"]["content"]
# Step 3: Generate image
image_url = openai.Image.create(
prompt=prompt_summary,
model="dall-e-3",
size="1024x1024",
response_format="url"
)["data"][0]["url"]
return render_template("index.html", transcript=transcript, prompt_summary=prompt_summary, image_url=image_url)
<!-- Display in index.html -->
{% if transcript %}
<h3>Transcript:</h3>
<textarea readonly>{{ transcript }}</textarea>
{% endif %}
{% if prompt_summary %}
<h3>Image Prompt:</h3>
<p>{{ prompt_summary }}</p>
{% endif %}
{% if image_url %}
<h3>Generated Image:</h3>
<img src="{{ image_url }}" alt="Generated">
{% endif %}
Exercise 5: Experiment with Input Variations
Task:
Test your app with different types of audio content:
- Descriptions (e.g., "a cat sitting on a window sill during a snowstorm")
- Emotions (e.g., "I felt so peaceful at the beach")
- Narratives (e.g., "Yesterday I saw a dragonfly land on my laptop")
Observe how the transcription → GPT → DALL·E chain adapts to each.
Reflection Questions:
- How accurate is Whisper’s transcription?
- How creative or literal is GPT-4o’s interpretation?
- Do DALL·E’s images reflect the intended mood or concept?
In these exercises, you:
- Built a DALL·E-powered image generator
- Used Whisper to transcribe real voice input
- Chained together speech → text → GPT prompt → image
- Learned to manage multiple response types (text, image) in Flask
- Created a simple multimodal AI experience
These tools are just the beginning — the same building blocks can be used for story apps, educational tools, marketing generators, voice design assistants, and more.
Practical Exercises — Chapter 5
Exercise 1: Create a DALL·E Image Generator with Flask
Task:
Build a Flask web app that allows users to enter a text prompt and receive an AI-generated image using OpenAI's DALL·E 3 model.
Solution:
# app.py
from flask import Flask, request, render_template
import openai
import os
from dotenv import load_dotenv
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
app = Flask(__name__)
@app.route("/", methods=["GET", "POST"])
def index():
image_url = None
if request.method == "POST":
prompt = request.form["prompt"]
response = openai.Image.create(
prompt=prompt,
model="dall-e-3",
size="1024x1024",
response_format="url"
)
image_url = response["data"][0]["url"]
return render_template("index.html", image_url=image_url)
if __name__ == "__main__":
app.run(debug=True)
<!-- templates/index.html -->
<form method="post">
<textarea name="prompt" rows="3" placeholder="Describe an image..."></textarea><br>
<input type="submit" value="Generate Image">
</form>
{% if image_url %}
<img src="{{ image_url }}" alt="Generated Image">
{% endif %}
Exercise 2: Transcribe Audio Using Whisper
Task:
Create a Flask app that allows users to upload an audio file and view the transcription using Whisper.
Solution:
@app.route("/", methods=["GET", "POST"])
def index():
transcript = None
if request.method == "POST":
file = request.files["audio_file"]
if file:
file.save("temp.m4a")
with open("temp.m4a", "rb") as audio_file:
result = openai.Audio.transcribe(
model="whisper-1",
file=audio_file
)
transcript = result["text"]
return render_template("index.html", transcript=transcript)
<!-- templates/index.html -->
<form method="post" enctype="multipart/form-data">
<input type="file" name="audio_file" required>
<input type="submit" value="Transcribe">
</form>
{% if transcript %}
<textarea readonly>{{ transcript }}</textarea>
{% endif %}
Exercise 3: Convert Transcription into Image Prompt
Task:
After transcribing the audio, send the transcript to GPT-4o to generate a creative scene description.
Solution (within app.py):
# After getting `transcript`
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Convert this transcription into a visual scene prompt for DALL·E."},
{"role": "user", "content": transcript}
]
)
prompt_summary = response["choices"][0]["message"]["content"]
Exercise 4: Build a Multimodal Assistant (Audio ➝ Text ➝ Image)
Task:
Integrate the full pipeline: upload audio → transcribe with Whisper → generate visual prompt with GPT-4o → generate image with DALL·E.
Solution:
@app.route("/", methods=["GET", "POST"])
def index():
transcript, prompt_summary, image_url = None, None, None
if request.method == "POST":
file = request.files["audio_file"]
if file:
file.save("temp.m4a")
# Step 1: Transcribe
with open("temp.m4a", "rb") as audio_file:
transcript = openai.Audio.transcribe("whisper-1", file=audio_file)["text"]
# Step 2: Convert to visual scene
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Convert this message into a vivid scene description for image generation."},
{"role": "user", "content": transcript}
]
)
prompt_summary = response["choices"][0]["message"]["content"]
# Step 3: Generate image
image_url = openai.Image.create(
prompt=prompt_summary,
model="dall-e-3",
size="1024x1024",
response_format="url"
)["data"][0]["url"]
return render_template("index.html", transcript=transcript, prompt_summary=prompt_summary, image_url=image_url)
<!-- Display in index.html -->
{% if transcript %}
<h3>Transcript:</h3>
<textarea readonly>{{ transcript }}</textarea>
{% endif %}
{% if prompt_summary %}
<h3>Image Prompt:</h3>
<p>{{ prompt_summary }}</p>
{% endif %}
{% if image_url %}
<h3>Generated Image:</h3>
<img src="{{ image_url }}" alt="Generated">
{% endif %}
Exercise 5: Experiment with Input Variations
Task:
Test your app with different types of audio content:
- Descriptions (e.g., "a cat sitting on a window sill during a snowstorm")
- Emotions (e.g., "I felt so peaceful at the beach")
- Narratives (e.g., "Yesterday I saw a dragonfly land on my laptop")
Observe how the transcription → GPT → DALL·E chain adapts to each.
Reflection Questions:
- How accurate is Whisper’s transcription?
- How creative or literal is GPT-4o’s interpretation?
- Do DALL·E’s images reflect the intended mood or concept?
In these exercises, you:
- Built a DALL·E-powered image generator
- Used Whisper to transcribe real voice input
- Chained together speech → text → GPT prompt → image
- Learned to manage multiple response types (text, image) in Flask
- Created a simple multimodal AI experience
These tools are just the beginning — the same building blocks can be used for story apps, educational tools, marketing generators, voice design assistants, and more.
Practical Exercises — Chapter 5
Exercise 1: Create a DALL·E Image Generator with Flask
Task:
Build a Flask web app that allows users to enter a text prompt and receive an AI-generated image using OpenAI's DALL·E 3 model.
Solution:
# app.py
from flask import Flask, request, render_template
import openai
import os
from dotenv import load_dotenv
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
app = Flask(__name__)
@app.route("/", methods=["GET", "POST"])
def index():
image_url = None
if request.method == "POST":
prompt = request.form["prompt"]
response = openai.Image.create(
prompt=prompt,
model="dall-e-3",
size="1024x1024",
response_format="url"
)
image_url = response["data"][0]["url"]
return render_template("index.html", image_url=image_url)
if __name__ == "__main__":
app.run(debug=True)
<!-- templates/index.html -->
<form method="post">
<textarea name="prompt" rows="3" placeholder="Describe an image..."></textarea><br>
<input type="submit" value="Generate Image">
</form>
{% if image_url %}
<img src="{{ image_url }}" alt="Generated Image">
{% endif %}
Exercise 2: Transcribe Audio Using Whisper
Task:
Create a Flask app that allows users to upload an audio file and view the transcription using Whisper.
Solution:
@app.route("/", methods=["GET", "POST"])
def index():
transcript = None
if request.method == "POST":
file = request.files["audio_file"]
if file:
file.save("temp.m4a")
with open("temp.m4a", "rb") as audio_file:
result = openai.Audio.transcribe(
model="whisper-1",
file=audio_file
)
transcript = result["text"]
return render_template("index.html", transcript=transcript)
<!-- templates/index.html -->
<form method="post" enctype="multipart/form-data">
<input type="file" name="audio_file" required>
<input type="submit" value="Transcribe">
</form>
{% if transcript %}
<textarea readonly>{{ transcript }}</textarea>
{% endif %}
Exercise 3: Convert Transcription into Image Prompt
Task:
After transcribing the audio, send the transcript to GPT-4o to generate a creative scene description.
Solution (within app.py):
# After getting `transcript`
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Convert this transcription into a visual scene prompt for DALL·E."},
{"role": "user", "content": transcript}
]
)
prompt_summary = response["choices"][0]["message"]["content"]
Exercise 4: Build a Multimodal Assistant (Audio ➝ Text ➝ Image)
Task:
Integrate the full pipeline: upload audio → transcribe with Whisper → generate visual prompt with GPT-4o → generate image with DALL·E.
Solution:
@app.route("/", methods=["GET", "POST"])
def index():
transcript, prompt_summary, image_url = None, None, None
if request.method == "POST":
file = request.files["audio_file"]
if file:
file.save("temp.m4a")
# Step 1: Transcribe
with open("temp.m4a", "rb") as audio_file:
transcript = openai.Audio.transcribe("whisper-1", file=audio_file)["text"]
# Step 2: Convert to visual scene
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Convert this message into a vivid scene description for image generation."},
{"role": "user", "content": transcript}
]
)
prompt_summary = response["choices"][0]["message"]["content"]
# Step 3: Generate image
image_url = openai.Image.create(
prompt=prompt_summary,
model="dall-e-3",
size="1024x1024",
response_format="url"
)["data"][0]["url"]
return render_template("index.html", transcript=transcript, prompt_summary=prompt_summary, image_url=image_url)
<!-- Display in index.html -->
{% if transcript %}
<h3>Transcript:</h3>
<textarea readonly>{{ transcript }}</textarea>
{% endif %}
{% if prompt_summary %}
<h3>Image Prompt:</h3>
<p>{{ prompt_summary }}</p>
{% endif %}
{% if image_url %}
<h3>Generated Image:</h3>
<img src="{{ image_url }}" alt="Generated">
{% endif %}
Exercise 5: Experiment with Input Variations
Task:
Test your app with different types of audio content:
- Descriptions (e.g., "a cat sitting on a window sill during a snowstorm")
- Emotions (e.g., "I felt so peaceful at the beach")
- Narratives (e.g., "Yesterday I saw a dragonfly land on my laptop")
Observe how the transcription → GPT → DALL·E chain adapts to each.
Reflection Questions:
- How accurate is Whisper’s transcription?
- How creative or literal is GPT-4o’s interpretation?
- Do DALL·E’s images reflect the intended mood or concept?
In these exercises, you:
- Built a DALL·E-powered image generator
- Used Whisper to transcribe real voice input
- Chained together speech → text → GPT prompt → image
- Learned to manage multiple response types (text, image) in Flask
- Created a simple multimodal AI experience
These tools are just the beginning — the same building blocks can be used for story apps, educational tools, marketing generators, voice design assistants, and more.