Click here to view the next lesson.

Chapter 5: Image and Audio Integration Projects

Practical Exercises — Chapter 5

Exercise 1: Create a DALL·E Image Generator with Flask

Task:

Build a Flask web app that allows users to enter a text prompt and receive an AI-generated image using OpenAI's DALL·E 3 model.

Solution:

# app.py
from flask import Flask, request, render_template
import openai
import os
from dotenv import load_dotenv

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)

@app.route("/", methods=["GET", "POST"])
def index():
    image_url = None
    if request.method == "POST":
        prompt = request.form["prompt"]
        response = openai.Image.create(
            prompt=prompt,
            model="dall-e-3",
            size="1024x1024",
            response_format="url"
        )
        image_url = response["data"][0]["url"]

    return render_template("index.html", image_url=image_url)

if __name__ == "__main__":
    app.run(debug=True)

<!-- templates/index.html -->
<form method="post">
  <textarea name="prompt" rows="3" placeholder="Describe an image..."></textarea><br>
  <input type="submit" value="Generate Image">
</form>
{% if image_url %}
  <img src="{{ image_url }}" alt="Generated Image">
{% endif %}

Exercise 2: Transcribe Audio Using Whisper

Task:

Create a Flask app that allows users to upload an audio file and view the transcription using Whisper.

Solution:

@app.route("/", methods=["GET", "POST"])
def index():
    transcript = None
    if request.method == "POST":
        file = request.files["audio_file"]
        if file:
            file.save("temp.m4a")
            with open("temp.m4a", "rb") as audio_file:
                result = openai.Audio.transcribe(
                    model="whisper-1",
                    file=audio_file
                )
                transcript = result["text"]
    return render_template("index.html", transcript=transcript)

<!-- templates/index.html -->
<form method="post" enctype="multipart/form-data">
  <input type="file" name="audio_file" required>
  <input type="submit" value="Transcribe">
</form>
{% if transcript %}
  <textarea readonly>{{ transcript }}</textarea>
{% endif %}

Exercise 3: Convert Transcription into Image Prompt

Task:

After transcribing the audio, send the transcript to GPT-4o to generate a creative scene description.

Solution (within app.py):

# After getting `transcript`
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Convert this transcription into a visual scene prompt for DALL·E."},
        {"role": "user", "content": transcript}
    ]
)
prompt_summary = response["choices"][0]["message"]["content"]

Exercise 4: Build a Multimodal Assistant (Audio ➝ Text ➝ Image)

Task:

Integrate the full pipeline: upload audio → transcribe with Whisper → generate visual prompt with GPT-4o → generate image with DALL·E.

Solution:

@app.route("/", methods=["GET", "POST"])
def index():
    transcript, prompt_summary, image_url = None, None, None

    if request.method == "POST":
        file = request.files["audio_file"]
        if file:
            file.save("temp.m4a")

            # Step 1: Transcribe
            with open("temp.m4a", "rb") as audio_file:
                transcript = openai.Audio.transcribe("whisper-1", file=audio_file)["text"]

            # Step 2: Convert to visual scene
            response = openai.ChatCompletion.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": "Convert this message into a vivid scene description for image generation."},
                    {"role": "user", "content": transcript}
                ]
            )
            prompt_summary = response["choices"][0]["message"]["content"]

            # Step 3: Generate image
            image_url = openai.Image.create(
                prompt=prompt_summary,
                model="dall-e-3",
                size="1024x1024",
                response_format="url"
            )["data"][0]["url"]

    return render_template("index.html", transcript=transcript, prompt_summary=prompt_summary, image_url=image_url)

<!-- Display in index.html -->
{% if transcript %}
  <h3>Transcript:</h3>
  <textarea readonly>{{ transcript }}</textarea>
{% endif %}

{% if prompt_summary %}
  <h3>Image Prompt:</h3>
  <p>{{ prompt_summary }}</p>
{% endif %}

{% if image_url %}
  <h3>Generated Image:</h3>
  <img src="{{ image_url }}" alt="Generated">
{% endif %}

Exercise 5: Experiment with Input Variations

Task:

Test your app with different types of audio content:

Descriptions (e.g., "a cat sitting on a window sill during a snowstorm")
Emotions (e.g., "I felt so peaceful at the beach")
Narratives (e.g., "Yesterday I saw a dragonfly land on my laptop")

Observe how the transcription → GPT → DALL·E chain adapts to each.

Reflection Questions:

How accurate is Whisper’s transcription?
How creative or literal is GPT-4o’s interpretation?
Do DALL·E’s images reflect the intended mood or concept?

In these exercises, you:

Built a DALL·E-powered image generator
Used Whisper to transcribe real voice input
Chained together speech → text → GPT prompt → image
Learned to manage multiple response types (text, image) in Flask
Created a simple multimodal AI experience

These tools are just the beginning — the same building blocks can be used for story apps, educational tools, marketing generators, voice design assistants, and more.

Practical Exercises — Chapter 5

Exercise 1: Create a DALL·E Image Generator with Flask

Task:

Build a Flask web app that allows users to enter a text prompt and receive an AI-generated image using OpenAI's DALL·E 3 model.

Solution:

# app.py
from flask import Flask, request, render_template
import openai
import os
from dotenv import load_dotenv

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)

@app.route("/", methods=["GET", "POST"])
def index():
    image_url = None
    if request.method == "POST":
        prompt = request.form["prompt"]
        response = openai.Image.create(
            prompt=prompt,
            model="dall-e-3",
            size="1024x1024",
            response_format="url"
        )
        image_url = response["data"][0]["url"]

    return render_template("index.html", image_url=image_url)

if __name__ == "__main__":
    app.run(debug=True)

<!-- templates/index.html -->
<form method="post">
  <textarea name="prompt" rows="3" placeholder="Describe an image..."></textarea><br>
  <input type="submit" value="Generate Image">
</form>
{% if image_url %}
  <img src="{{ image_url }}" alt="Generated Image">
{% endif %}

Exercise 2: Transcribe Audio Using Whisper

Task:

Create a Flask app that allows users to upload an audio file and view the transcription using Whisper.

Solution:

@app.route("/", methods=["GET", "POST"])
def index():
    transcript = None
    if request.method == "POST":
        file = request.files["audio_file"]
        if file:
            file.save("temp.m4a")
            with open("temp.m4a", "rb") as audio_file:
                result = openai.Audio.transcribe(
                    model="whisper-1",
                    file=audio_file
                )
                transcript = result["text"]
    return render_template("index.html", transcript=transcript)

<!-- templates/index.html -->
<form method="post" enctype="multipart/form-data">
  <input type="file" name="audio_file" required>
  <input type="submit" value="Transcribe">
</form>
{% if transcript %}
  <textarea readonly>{{ transcript }}</textarea>
{% endif %}

Exercise 3: Convert Transcription into Image Prompt

Task:

After transcribing the audio, send the transcript to GPT-4o to generate a creative scene description.

Solution (within app.py):

# After getting `transcript`
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Convert this transcription into a visual scene prompt for DALL·E."},
        {"role": "user", "content": transcript}
    ]
)
prompt_summary = response["choices"][0]["message"]["content"]

Exercise 4: Build a Multimodal Assistant (Audio ➝ Text ➝ Image)

Task:

Integrate the full pipeline: upload audio → transcribe with Whisper → generate visual prompt with GPT-4o → generate image with DALL·E.

Solution:

@app.route("/", methods=["GET", "POST"])
def index():
    transcript, prompt_summary, image_url = None, None, None

    if request.method == "POST":
        file = request.files["audio_file"]
        if file:
            file.save("temp.m4a")

            # Step 1: Transcribe
            with open("temp.m4a", "rb") as audio_file:
                transcript = openai.Audio.transcribe("whisper-1", file=audio_file)["text"]

            # Step 2: Convert to visual scene
            response = openai.ChatCompletion.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": "Convert this message into a vivid scene description for image generation."},
                    {"role": "user", "content": transcript}
                ]
            )
            prompt_summary = response["choices"][0]["message"]["content"]

            # Step 3: Generate image
            image_url = openai.Image.create(
                prompt=prompt_summary,
                model="dall-e-3",
                size="1024x1024",
                response_format="url"
            )["data"][0]["url"]

    return render_template("index.html", transcript=transcript, prompt_summary=prompt_summary, image_url=image_url)

<!-- Display in index.html -->
{% if transcript %}
  <h3>Transcript:</h3>
  <textarea readonly>{{ transcript }}</textarea>
{% endif %}

{% if prompt_summary %}
  <h3>Image Prompt:</h3>
  <p>{{ prompt_summary }}</p>
{% endif %}

{% if image_url %}
  <h3>Generated Image:</h3>
  <img src="{{ image_url }}" alt="Generated">
{% endif %}

Exercise 5: Experiment with Input Variations

Task:

Test your app with different types of audio content:

Descriptions (e.g., "a cat sitting on a window sill during a snowstorm")
Emotions (e.g., "I felt so peaceful at the beach")
Narratives (e.g., "Yesterday I saw a dragonfly land on my laptop")

Observe how the transcription → GPT → DALL·E chain adapts to each.

Reflection Questions:

How accurate is Whisper’s transcription?
How creative or literal is GPT-4o’s interpretation?
Do DALL·E’s images reflect the intended mood or concept?

In these exercises, you:

Built a DALL·E-powered image generator
Used Whisper to transcribe real voice input
Chained together speech → text → GPT prompt → image
Learned to manage multiple response types (text, image) in Flask
Created a simple multimodal AI experience

These tools are just the beginning — the same building blocks can be used for story apps, educational tools, marketing generators, voice design assistants, and more.

Practical Exercises — Chapter 5

Exercise 1: Create a DALL·E Image Generator with Flask

Task:

Build a Flask web app that allows users to enter a text prompt and receive an AI-generated image using OpenAI's DALL·E 3 model.

Solution:

# app.py
from flask import Flask, request, render_template
import openai
import os
from dotenv import load_dotenv

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)

@app.route("/", methods=["GET", "POST"])
def index():
    image_url = None
    if request.method == "POST":
        prompt = request.form["prompt"]
        response = openai.Image.create(
            prompt=prompt,
            model="dall-e-3",
            size="1024x1024",
            response_format="url"
        )
        image_url = response["data"][0]["url"]

    return render_template("index.html", image_url=image_url)

if __name__ == "__main__":
    app.run(debug=True)

<!-- templates/index.html -->
<form method="post">
  <textarea name="prompt" rows="3" placeholder="Describe an image..."></textarea><br>
  <input type="submit" value="Generate Image">
</form>
{% if image_url %}
  <img src="{{ image_url }}" alt="Generated Image">
{% endif %}

Exercise 2: Transcribe Audio Using Whisper

Task:

Create a Flask app that allows users to upload an audio file and view the transcription using Whisper.

Solution:

@app.route("/", methods=["GET", "POST"])
def index():
    transcript = None
    if request.method == "POST":
        file = request.files["audio_file"]
        if file:
            file.save("temp.m4a")
            with open("temp.m4a", "rb") as audio_file:
                result = openai.Audio.transcribe(
                    model="whisper-1",
                    file=audio_file
                )
                transcript = result["text"]
    return render_template("index.html", transcript=transcript)

<!-- templates/index.html -->
<form method="post" enctype="multipart/form-data">
  <input type="file" name="audio_file" required>
  <input type="submit" value="Transcribe">
</form>
{% if transcript %}
  <textarea readonly>{{ transcript }}</textarea>
{% endif %}

Exercise 3: Convert Transcription into Image Prompt

Task:

After transcribing the audio, send the transcript to GPT-4o to generate a creative scene description.

Solution (within app.py):

# After getting `transcript`
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Convert this transcription into a visual scene prompt for DALL·E."},
        {"role": "user", "content": transcript}
    ]
)
prompt_summary = response["choices"][0]["message"]["content"]

Exercise 4: Build a Multimodal Assistant (Audio ➝ Text ➝ Image)

Task:

Integrate the full pipeline: upload audio → transcribe with Whisper → generate visual prompt with GPT-4o → generate image with DALL·E.

Solution:

@app.route("/", methods=["GET", "POST"])
def index():
    transcript, prompt_summary, image_url = None, None, None

    if request.method == "POST":
        file = request.files["audio_file"]
        if file:
            file.save("temp.m4a")

            # Step 1: Transcribe
            with open("temp.m4a", "rb") as audio_file:
                transcript = openai.Audio.transcribe("whisper-1", file=audio_file)["text"]

            # Step 2: Convert to visual scene
            response = openai.ChatCompletion.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": "Convert this message into a vivid scene description for image generation."},
                    {"role": "user", "content": transcript}
                ]
            )
            prompt_summary = response["choices"][0]["message"]["content"]

            # Step 3: Generate image
            image_url = openai.Image.create(
                prompt=prompt_summary,
                model="dall-e-3",
                size="1024x1024",
                response_format="url"
            )["data"][0]["url"]

    return render_template("index.html", transcript=transcript, prompt_summary=prompt_summary, image_url=image_url)

<!-- Display in index.html -->
{% if transcript %}
  <h3>Transcript:</h3>
  <textarea readonly>{{ transcript }}</textarea>
{% endif %}

{% if prompt_summary %}
  <h3>Image Prompt:</h3>
  <p>{{ prompt_summary }}</p>
{% endif %}

{% if image_url %}
  <h3>Generated Image:</h3>
  <img src="{{ image_url }}" alt="Generated">
{% endif %}

Exercise 5: Experiment with Input Variations

Task:

Test your app with different types of audio content:

Descriptions (e.g., "a cat sitting on a window sill during a snowstorm")
Emotions (e.g., "I felt so peaceful at the beach")
Narratives (e.g., "Yesterday I saw a dragonfly land on my laptop")

Observe how the transcription → GPT → DALL·E chain adapts to each.

Reflection Questions:

How accurate is Whisper’s transcription?
How creative or literal is GPT-4o’s interpretation?
Do DALL·E’s images reflect the intended mood or concept?

In these exercises, you:

Built a DALL·E-powered image generator
Used Whisper to transcribe real voice input
Chained together speech → text → GPT prompt → image
Learned to manage multiple response types (text, image) in Flask
Created a simple multimodal AI experience

These tools are just the beginning — the same building blocks can be used for story apps, educational tools, marketing generators, voice design assistants, and more.

Practical Exercises — Chapter 5

Exercise 1: Create a DALL·E Image Generator with Flask

Task:

Build a Flask web app that allows users to enter a text prompt and receive an AI-generated image using OpenAI's DALL·E 3 model.

Solution:

# app.py
from flask import Flask, request, render_template
import openai
import os
from dotenv import load_dotenv

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)

@app.route("/", methods=["GET", "POST"])
def index():
    image_url = None
    if request.method == "POST":
        prompt = request.form["prompt"]
        response = openai.Image.create(
            prompt=prompt,
            model="dall-e-3",
            size="1024x1024",
            response_format="url"
        )
        image_url = response["data"][0]["url"]

    return render_template("index.html", image_url=image_url)

if __name__ == "__main__":
    app.run(debug=True)

<!-- templates/index.html -->
<form method="post">
  <textarea name="prompt" rows="3" placeholder="Describe an image..."></textarea><br>
  <input type="submit" value="Generate Image">
</form>
{% if image_url %}
  <img src="{{ image_url }}" alt="Generated Image">
{% endif %}

Exercise 2: Transcribe Audio Using Whisper

Task:

Create a Flask app that allows users to upload an audio file and view the transcription using Whisper.

Solution:

@app.route("/", methods=["GET", "POST"])
def index():
    transcript = None
    if request.method == "POST":
        file = request.files["audio_file"]
        if file:
            file.save("temp.m4a")
            with open("temp.m4a", "rb") as audio_file:
                result = openai.Audio.transcribe(
                    model="whisper-1",
                    file=audio_file
                )
                transcript = result["text"]
    return render_template("index.html", transcript=transcript)

<!-- templates/index.html -->
<form method="post" enctype="multipart/form-data">
  <input type="file" name="audio_file" required>
  <input type="submit" value="Transcribe">
</form>
{% if transcript %}
  <textarea readonly>{{ transcript }}</textarea>
{% endif %}

Exercise 3: Convert Transcription into Image Prompt

Task:

After transcribing the audio, send the transcript to GPT-4o to generate a creative scene description.

Solution (within app.py):

# After getting `transcript`
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Convert this transcription into a visual scene prompt for DALL·E."},
        {"role": "user", "content": transcript}
    ]
)
prompt_summary = response["choices"][0]["message"]["content"]

Exercise 4: Build a Multimodal Assistant (Audio ➝ Text ➝ Image)

Task:

Integrate the full pipeline: upload audio → transcribe with Whisper → generate visual prompt with GPT-4o → generate image with DALL·E.

Solution:

@app.route("/", methods=["GET", "POST"])
def index():
    transcript, prompt_summary, image_url = None, None, None

    if request.method == "POST":
        file = request.files["audio_file"]
        if file:
            file.save("temp.m4a")

            # Step 1: Transcribe
            with open("temp.m4a", "rb") as audio_file:
                transcript = openai.Audio.transcribe("whisper-1", file=audio_file)["text"]

            # Step 2: Convert to visual scene
            response = openai.ChatCompletion.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": "Convert this message into a vivid scene description for image generation."},
                    {"role": "user", "content": transcript}
                ]
            )
            prompt_summary = response["choices"][0]["message"]["content"]

            # Step 3: Generate image
            image_url = openai.Image.create(
                prompt=prompt_summary,
                model="dall-e-3",
                size="1024x1024",
                response_format="url"
            )["data"][0]["url"]

    return render_template("index.html", transcript=transcript, prompt_summary=prompt_summary, image_url=image_url)

<!-- Display in index.html -->
{% if transcript %}
  <h3>Transcript:</h3>
  <textarea readonly>{{ transcript }}</textarea>
{% endif %}

{% if prompt_summary %}
  <h3>Image Prompt:</h3>
  <p>{{ prompt_summary }}</p>
{% endif %}

{% if image_url %}
  <h3>Generated Image:</h3>
  <img src="{{ image_url }}" alt="Generated">
{% endif %}

Exercise 5: Experiment with Input Variations

Task:

Test your app with different types of audio content:

Descriptions (e.g., "a cat sitting on a window sill during a snowstorm")
Emotions (e.g., "I felt so peaceful at the beach")
Narratives (e.g., "Yesterday I saw a dragonfly land on my laptop")

Observe how the transcription → GPT → DALL·E chain adapts to each.

Reflection Questions:

How accurate is Whisper’s transcription?
How creative or literal is GPT-4o’s interpretation?
Do DALL·E’s images reflect the intended mood or concept?

In these exercises, you:

Built a DALL·E-powered image generator
Used Whisper to transcribe real voice input
Chained together speech → text → GPT prompt → image
Learned to manage multiple response types (text, image) in Flask
Created a simple multimodal AI experience

These tools are just the beginning — the same building blocks can be used for story apps, educational tools, marketing generators, voice design assistants, and more.

Purchase this book

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Chapter 5: Image and Audio Integration Projects

Practical Exercises — Chapter 5

Exercise 1: Create a DALL·E Image Generator with Flask

Exercise 2: Transcribe Audio Using Whisper

Exercise 3: Convert Transcription into Image Prompt

Exercise 4: Build a Multimodal Assistant (Audio ➝ Text ➝ Image)

Exercise 5: Experiment with Input Variations

Practical Exercises — Chapter 5

Exercise 1: Create a DALL·E Image Generator with Flask

Exercise 2: Transcribe Audio Using Whisper

Exercise 3: Convert Transcription into Image Prompt

Exercise 4: Build a Multimodal Assistant (Audio ➝ Text ➝ Image)

Exercise 5: Experiment with Input Variations

Practical Exercises — Chapter 5

Exercise 1: Create a DALL·E Image Generator with Flask

Exercise 2: Transcribe Audio Using Whisper

Exercise 3: Convert Transcription into Image Prompt

Exercise 4: Build a Multimodal Assistant (Audio ➝ Text ➝ Image)

Exercise 5: Experiment with Input Variations

Practical Exercises — Chapter 5

Exercise 1: Create a DALL·E Image Generator with Flask

Exercise 2: Transcribe Audio Using Whisper

Exercise 3: Convert Transcription into Image Prompt

Exercise 4: Build a Multimodal Assistant (Audio ➝ Text ➝ Image)

Exercise 5: Experiment with Input Variations