Click here to view the next lesson.

Chapter 5: Image and Audio Integration Projects

5.5 Basic Integration of Multiple Modalities

In this section, we'll explore how to combine multiple AI modalities - speech, language understanding, and image generation - into a single cohesive application. While previous sections focused on working with individual technologies, here we'll learn how these powerful tools can work together to create more sophisticated and engaging user experiences.

This integration represents a significant step forward in AI application development, moving beyond simple single-purpose tools to create systems that can process and respond across different forms of communication. By combining Whisper's audio transcription capabilities, GPT-4o's natural language understanding, and DALL·E's image generation, we can build applications that truly demonstrate the potential of modern AI technologies.

The project we'll build serves as an excellent introduction to multimodal AI integration, demonstrating how different AI models can be orchestrated to create a seamless experience. This approach opens up exciting possibilities for developers looking to create more natural and intuitive human-AI interactions.

5.5.1 What You'll Build

In this section, you'll create a sophisticated Flask-based web application that functions as a basic multimodal assistant. This application demonstrates the seamless integration of multiple AI technologies to create an interactive and intelligent user experience. The assistant is capable of:

Accepting an audio message from the user through a clean web interface, supporting various audio formats
Leveraging OpenAI's Whisper technology to accurately transcribe the audio message into text, handling different accents and languages
Utilizing GPT-4o's advanced natural language processing capabilities to analyze the transcribed text, understanding context, intent, and key themes
Employing DALL·E 3's sophisticated image generation abilities to create relevant, high-quality visuals based on the understood context
Presenting a cohesive user experience by displaying both the original transcription and the AI-generated image on a single, well-designed webpage

This project serves as an excellent example of a multimodal assistant that seamlessly combines three different types of AI processing: audio processing, natural language understanding, and image generation. By processing audio input, extracting meaningful context, and creating visual representations, it demonstrates the potential of integrated AI technologies. This foundation opens up exciting possibilities for various real-world applications, such as:

Spoken design prompts for visual creators - enabling artists and designers to verbally describe their vision and instantly see it rendered
Audio journaling with illustrated output - transforming spoken diary entries into visual memories with matching AI-generated artwork
Voice-controlled storytelling applications - creating interactive narratives where spoken words come to life through instant visual generation

5.5.2 Step-by-Step Implementation

Step 1: Install Required Packages

Download the sample audio: https://files.cuantum.tech/audio/audio-file-sample.mp3

Ensure you have the necessary Python libraries installed. Open your terminal and execute the following command:

pip install flask openai python-dotenv

This command installs:

flask: A micro web framework for building the web application.
openai: The OpenAI Python library for interacting with the Whisper, Chat Completion, and DALL·E 3 APIs.
python-dotenv: A library to load environment variables from a .env file.

Step 2: Set Up Project Structure

Create the following folder structure for your project:

/multimodal_app
│
├── app.py
├── .env
└── templates/
    └── index.html

/multimodal_app: The root directory for your project.
app.py: The Python file containing the Flask application code.
.env: A file to store your OpenAI API key.
templates/: A directory to store your HTML templates.
templates/index.html: The HTML template for the main page of your application.

Step 3: Create the Flask App (app.py)

Create a Python file named app.py in the root directory of your project and add the following code:

from flask import Flask, request, render_template, jsonify, make_response
import openai
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}  # Allowed audio file extensions


def allowed_file(filename: str) -> bool:
    """
    Checks if the uploaded file has an allowed extension.

    Args:
        filename (str): The name of the file.

    Returns:
        bool: True if the file has an allowed extension, False otherwise.
    """
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio file: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.Audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        logger.info(f"Transcription successful. Length: {len(transcript)} characters.")
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None


def generate_image_prompt(text: str) -> Optional[str]:
    """
    Generates a prompt for DALL·E 3 based on the transcribed text using GPT-4o.

    Args:
        text (str): The transcribed text.

    Returns:
        Optional[str]: The generated image prompt, or None on error.
    """
    try:
        logger.info("Generating image prompt using GPT-4o")
        response = openai.chat.completions.create(
            model="gpt-4o",  # You can also experiment with other chat models
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.",
                },
                {"role": "user", "content": text},
            ],
        )
        prompt = response.choices[0].message.content
        logger.info(f"Generated image prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None


def generate_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024", response_format: str = "url") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}")
        response = openai.Image.create(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None



@app.route("/", methods=["GET", "POST"])
def index():
    """
    Handles the main route for the web application.
    Processes audio uploads, transcribes them, generates image prompts, and displays images.
    """
    transcript = None
    image_url = None
    prompt_summary = None
    error_message = None

    if request.method == "POST":
        if 'audio_file' not in request.files:
            error_message = "No file part"
            logger.warning(error_message)
            return render_template("index.html", error=error_message)
        file = request.files['audio_file']
        if file.filename == '':
            error_message = "No file selected"
            logger.warning(error_message)
            return render_template("index.html", error=error_message)

        if file and allowed_file(file.filename):
            try:
                # Securely save the uploaded file to a temporary location
                temp_file_path = os.path.join(app.root_path, "temp_audio." + file.filename.rsplit('.', 1)[1].lower())
                file.save(temp_file_path)

                transcript = transcribe_audio(temp_file_path)  # Transcribe audio
                if not transcript:
                    error_message = "Audio transcription failed. Please try again."
                    return render_template("index.html", error=error_message)

                prompt_summary = generate_image_prompt(transcript)  # Generate prompt
                if not prompt_summary:
                    error_message = "Failed to generate image prompt. Please try again."
                    return render_template("index.html", error=error_message)

                image_url = generate_image(prompt_summary)  # Generate image
                if not image_url:
                    error_message = "Failed to generate image. Please try again."
                    return render_template("index.html", error=error_message)

                # Optionally, delete the temporary file after processing
                os.remove(temp_file_path)

            except Exception as e:
                error_message = f"An error occurred: {e}"
                logger.error(error_message)
                return render_template("index.html", error=error_message)
        else:
            error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
            logger.warning(error_message)
            return render_template("index.html", error=error_message)

    return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary, error=error_message)



@app.errorhandler(500)
def internal_server_error(e):
    """Handles internal server errors."""
    logger.error(f"Internal Server Error: {e}")
    return render_template("error.html", error="Internal Server Error"), 500


if __name__ == "__main__":
    app.run(debug=True)

Code Breakdown:

Import Statements: Imports the necessary Flask modules, OpenAI library, os, dotenv, logging, and Optional and Dict for type hinting.
Environment Variables: Loads the OpenAI API key from the .env file.
Flask Application: Creates a Flask application instance.
Logging Configuration: Configures logging.
allowed_file Function: Checks if the uploaded file has an allowed audio extension.
transcribe_audio Function:
- Takes the audio file path as input.
- Opens the audio file in binary mode ("rb").
- Calls the OpenAI API's openai.Audio.transcribe() method to transcribe the audio.
- Extracts the transcribed text from the response.
- Logs the file path before transcription and the length of the transcribed text after successful transcription.
- Includes error handling for OpenAI API errors and other exceptions.
generate_image_prompt Function:
- Takes the transcribed text as input.
- Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a text prompt suitable for image generation.
- The system message instructs the model to act as a creative assistant and provide a vivid description of a scene based on the audio.
- Extracts the generated prompt from the API response.
- Includes error handling.
generate_image Function:
- Takes the image prompt as input.
- Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
- Extracts the image URL from the API response.
- Includes error handling.
index Route:
- Handles both GET and POST requests.
- For GET requests, it renders the initial HTML page.
- For POST requests (when the user uploads an audio file):
  - It validates the uploaded file.
  - It saves the uploaded file temporarily.
  - It calls transcribe_audio() to transcribe the audio.
  - It calls generate_image_prompt() to generate an image prompt from the transcription.
  - It calls generate_image() to generate an image from the prompt.
  - It renders the index.html template, passing the transcription text and the image URL.
- Includes comprehensive error handling to catch potential issues during file upload, transcription, prompt generation, and image generation.
@app.errorhandler(500): Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.
if __name__ == "__main__":: Starts the Flask development server if the script is executed directly.

Step 4: Create HTML Template (templates/index.html)

Create a folder named templates in the same directory as app.py. Inside the templates folder, create a file named index.html with the following HTML code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Multimodal Assistant</title>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
    <style>
        /* --- General Styles --- */
        body {
            font-family: 'Inter', sans-serif;
            padding: 40px;
            background-color: #f9fafb; /* Tailwind's gray-50 */
            display: flex;
            justify-content: center;
            align-items: center;
            min-height: 100vh;
            margin: 0;
            color: #374151; /* Tailwind's gray-700 */
        }
        .container {
            max-width: 800px; /* Increased max-width */
            width: 95%; /* Take up most of the viewport */
            background-color: #fff;
            padding: 2rem;
            border-radius: 0.75rem; /* Tailwind's rounded-lg */
            box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
            text-align: center;
        }
        h2 {
            font-size: 2.25rem; /* Tailwind's text-3xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1.5rem; /* Tailwind's mb-6 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        p{
            color: #6b7280; /* Tailwind's gray-500 */
            margin-bottom: 1rem;
        }

        /* --- Form Styles --- */
        form {
            margin-top: 1rem; /* Tailwind's mt-4 */
            display: flex;
            flex-direction: column;
            align-items: center; /* Center form elements */
            gap: 0.5rem; /* Tailwind's gap-2 */
        }
        label {
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600;  /* Tailwind's font-semibold */
            color: #4b5563; /* Tailwind's gray-600 */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="file"] {
            width: 100%;
            max-width: 400px; /* Added max-width for file input */
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            font-size: 1rem; /* Tailwind's text-base */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            margin-left: auto;
            margin-right: auto;

        }
        input[type="submit"] {
            padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
            color: #fff;
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600; /* Tailwind's font-semibold */
            cursor: pointer;
            transition: background-color 0.3s ease; /* Smooth transition */
            border: none;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
            margin-top: 1rem;
        }
        input[type="submit"]:hover {
            background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
        }
        input[type="submit"]:focus {
            outline: none;
            box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
        }

        /* --- Result Styles --- */
        .result-container {
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
        }
        .transcript-container{
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */

        }

        h3 {
            font-size: 1.5rem; /* Tailwind's text-2xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1rem; /* Tailwind's mb-4 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        textarea {
            width: 100%;
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            resize: none;
            font-size: 1rem; /* Tailwind's text-base */
            line-height: 1.5rem; /* Tailwind's leading-relaxed */
            margin-top: 0.5rem; /* Tailwind's mt-2 */
            margin-bottom: 0;
            box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
        }
        textarea:focus {
            outline: none;
            border-color: #3b82f6; /* Tailwind's border-blue-500 */
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
        }
        img {
            max-width: 100%;
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            margin-top: 1.5rem; /* Tailwind's mt-6 */
            box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
        }

        /* --- Error Styles --- */
        .error-message {
            color: #dc2626; /* Tailwind's text-red-600 */
            margin-top: 1rem; /* Tailwind's mt-4 */
            padding: 0.75rem;
            background-color: #fee2e2; /* Tailwind's bg-red-100 */
            border-radius: 0.375rem; /* Tailwind's rounded-md */
            border: 1px solid #fecaca; /* Tailwind's border-red-300 */
            text-align: center;
        }
    </style>
</head>
<body>
    <div class="container">
        <h2>🎤🧠🎨 Multimodal Assistant</h2>
        <p> Upload an audio file to transcribe and generate a corresponding image. </p>
        <form method="POST" enctype="multipart/form-data">
            <label for="audio_file">Upload your voice note:</label><br>
            <input type="file" name="audio_file" accept="audio/*" required><br><br>
            <input type="submit" value="Generate Visual Response">
        </form>

        {% if transcript %}
            <div class = "transcript-container">
                <h3>📝 Transcript:</h3>
                <textarea readonly>{{ transcript }}</textarea>
            </div>
        {% endif %}

        {% if prompt_summary %}
            <div class = "result-container">
                <h3>🎯 Prompt Used for Image:</h3>
                <p>{{ prompt_summary }}</p>
            </div>
        {% endif %}

        {% if image_url %}
            <div class = "result-container">
                <h3>🖼️ Generated Image:</h3>
                <img src="{{ image_url }}" alt="Generated image">
            </div>
        {% endif %}
        {% if error %}
            <div class="error-message">{{ error }}</div>
        {% endif %}
    </div>
</body>
</html>

Key elements in the HTML template:

HTML Structure:
- The <head> section defines the title, links a CSS stylesheet, and sets the viewport for responsiveness.
- The <body> contains the visible content, including a form for uploading audio and sections to display the transcription and generated image.
CSS Styling:
- Modern Design: The CSS is updated to use a modern design, similar to Tailwind CSS.
- Responsive Layout: The layout is more responsive, especially for smaller screens.
- User Experience: Improved form and input styling.
- Clear Error Display: Error messages are styled to be clearly visible.
Form:
- A <form> with enctype="multipart/form-data" is used to handle file uploads.
- A <label> and <input type="file"> allow the user to select an audio file. The accept="audio/*" attribute restricts the user to uploading audio files.
- A <input type="submit"> button allows the user to submit the form.
Transcription and Image Display:
- The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a textarea, and the image is displayed using an <img> tag.
Error Handling:
- A <div class="error-message"> is used to display any error messages to the user.

Try It Out

Save the files as app.py and templates/index.html.
Ensure you have your OpenAI API key in the .env file.
Run the application:
python app.py
Open http://localhost:5000 in your browser.
Upload an audio file (you can use the provided sample .mp3 file).
View the transcription and the generated image on the page.

5.5.3 How This Works (Behind the Scenes)

This demonstrates multimodal orchestration at work - a sophisticated process where different AI models collaborate through your application's logic layer. Each model specializes in a different form of data processing (audio, text, and image), and together they create a seamless experience. The application coordinates these models, handling the data transformation between each step and ensuring proper communication flow.

Example Flow: A Detailed Walkthrough

To better understand how this multimodal system works, let's walk through a complete example. When a user uploads a voice note saying:

"I had the most peaceful morning — sitting by a lake with birds singing and the sun rising behind the trees."

The system processes this input through three distinct stages:

Audio Processing with Transcript: First, Whisper converts the audio to text, maintaining accuracy even with background noise or accent variations. Result: "I had the most peaceful morning…"
Scene Analysis and Enhancement: GPT-4o analyzes the transcribed text, identifying key visual elements and spatial relationships to create an optimized image prompt. Result: "A sunrise over a tranquil lake, with birds in the sky and trees reflecting in the water"
Visual Creation: DALL·E takes this refined prompt and generates a photorealistic image, carefully balancing all the described elements into a cohesive scene

The Power of Integration

In this final section of the chapter, you created a multimodal mini-assistant that demonstrates the seamless integration of three distinct AI capabilities:

Whisper: Advanced speech recognition that handles various accents, languages, and audio qualities with remarkable accuracy
GPT-4o: Sophisticated language processing that understands context, emotion, and scene composition to create detailed image descriptions
DALL·E: State-of-the-art image generation that translates text descriptions into vivid, coherent visual scenes

This integration showcases the future of AI applications where multiple models work in concert to process, understand, and respond to user input in rich, meaningful ways. By orchestrating these models together, you've created an intuitive interface that bridges the gap between human communication and AI capabilities.

5.5 Basic Integration of Multiple Modalities

5.5.1 What You'll Build

Accepting an audio message from the user through a clean web interface, supporting various audio formats
Leveraging OpenAI's Whisper technology to accurately transcribe the audio message into text, handling different accents and languages
Utilizing GPT-4o's advanced natural language processing capabilities to analyze the transcribed text, understanding context, intent, and key themes
Employing DALL·E 3's sophisticated image generation abilities to create relevant, high-quality visuals based on the understood context
Presenting a cohesive user experience by displaying both the original transcription and the AI-generated image on a single, well-designed webpage

Spoken design prompts for visual creators - enabling artists and designers to verbally describe their vision and instantly see it rendered
Audio journaling with illustrated output - transforming spoken diary entries into visual memories with matching AI-generated artwork
Voice-controlled storytelling applications - creating interactive narratives where spoken words come to life through instant visual generation

5.5.2 Step-by-Step Implementation

Step 1: Install Required Packages

Download the sample audio: https://files.cuantum.tech/audio/audio-file-sample.mp3

Ensure you have the necessary Python libraries installed. Open your terminal and execute the following command:

pip install flask openai python-dotenv

This command installs:

flask: A micro web framework for building the web application.
openai: The OpenAI Python library for interacting with the Whisper, Chat Completion, and DALL·E 3 APIs.
python-dotenv: A library to load environment variables from a .env file.

Step 2: Set Up Project Structure

Create the following folder structure for your project:

/multimodal_app
│
├── app.py
├── .env
└── templates/
    └── index.html

/multimodal_app: The root directory for your project.
app.py: The Python file containing the Flask application code.
.env: A file to store your OpenAI API key.
templates/: A directory to store your HTML templates.
templates/index.html: The HTML template for the main page of your application.

Step 3: Create the Flask App (app.py)

Create a Python file named app.py in the root directory of your project and add the following code:

from flask import Flask, request, render_template, jsonify, make_response
import openai
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}  # Allowed audio file extensions


def allowed_file(filename: str) -> bool:
    """
    Checks if the uploaded file has an allowed extension.

    Args:
        filename (str): The name of the file.

    Returns:
        bool: True if the file has an allowed extension, False otherwise.
    """
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio file: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.Audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        logger.info(f"Transcription successful. Length: {len(transcript)} characters.")
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None


def generate_image_prompt(text: str) -> Optional[str]:
    """
    Generates a prompt for DALL·E 3 based on the transcribed text using GPT-4o.

    Args:
        text (str): The transcribed text.

    Returns:
        Optional[str]: The generated image prompt, or None on error.
    """
    try:
        logger.info("Generating image prompt using GPT-4o")
        response = openai.chat.completions.create(
            model="gpt-4o",  # You can also experiment with other chat models
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.",
                },
                {"role": "user", "content": text},
            ],
        )
        prompt = response.choices[0].message.content
        logger.info(f"Generated image prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None


def generate_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024", response_format: str = "url") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}")
        response = openai.Image.create(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None



@app.route("/", methods=["GET", "POST"])
def index():
    """
    Handles the main route for the web application.
    Processes audio uploads, transcribes them, generates image prompts, and displays images.
    """
    transcript = None
    image_url = None
    prompt_summary = None
    error_message = None

    if request.method == "POST":
        if 'audio_file' not in request.files:
            error_message = "No file part"
            logger.warning(error_message)
            return render_template("index.html", error=error_message)
        file = request.files['audio_file']
        if file.filename == '':
            error_message = "No file selected"
            logger.warning(error_message)
            return render_template("index.html", error=error_message)

        if file and allowed_file(file.filename):
            try:
                # Securely save the uploaded file to a temporary location
                temp_file_path = os.path.join(app.root_path, "temp_audio." + file.filename.rsplit('.', 1)[1].lower())
                file.save(temp_file_path)

                transcript = transcribe_audio(temp_file_path)  # Transcribe audio
                if not transcript:
                    error_message = "Audio transcription failed. Please try again."
                    return render_template("index.html", error=error_message)

                prompt_summary = generate_image_prompt(transcript)  # Generate prompt
                if not prompt_summary:
                    error_message = "Failed to generate image prompt. Please try again."
                    return render_template("index.html", error=error_message)

                image_url = generate_image(prompt_summary)  # Generate image
                if not image_url:
                    error_message = "Failed to generate image. Please try again."
                    return render_template("index.html", error=error_message)

                # Optionally, delete the temporary file after processing
                os.remove(temp_file_path)

            except Exception as e:
                error_message = f"An error occurred: {e}"
                logger.error(error_message)
                return render_template("index.html", error=error_message)
        else:
            error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
            logger.warning(error_message)
            return render_template("index.html", error=error_message)

    return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary, error=error_message)



@app.errorhandler(500)
def internal_server_error(e):
    """Handles internal server errors."""
    logger.error(f"Internal Server Error: {e}")
    return render_template("error.html", error="Internal Server Error"), 500


if __name__ == "__main__":
    app.run(debug=True)

Code Breakdown:

Import Statements: Imports the necessary Flask modules, OpenAI library, os, dotenv, logging, and Optional and Dict for type hinting.
Environment Variables: Loads the OpenAI API key from the .env file.
Flask Application: Creates a Flask application instance.
Logging Configuration: Configures logging.
allowed_file Function: Checks if the uploaded file has an allowed audio extension.
transcribe_audio Function:
- Takes the audio file path as input.
- Opens the audio file in binary mode ("rb").
- Calls the OpenAI API's openai.Audio.transcribe() method to transcribe the audio.
- Extracts the transcribed text from the response.
- Logs the file path before transcription and the length of the transcribed text after successful transcription.
- Includes error handling for OpenAI API errors and other exceptions.
generate_image_prompt Function:
- Takes the transcribed text as input.
- Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a text prompt suitable for image generation.
- The system message instructs the model to act as a creative assistant and provide a vivid description of a scene based on the audio.
- Extracts the generated prompt from the API response.
- Includes error handling.
generate_image Function:
- Takes the image prompt as input.
- Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
- Extracts the image URL from the API response.
- Includes error handling.
index Route:
- Handles both GET and POST requests.
- For GET requests, it renders the initial HTML page.
- For POST requests (when the user uploads an audio file):
  - It validates the uploaded file.
  - It saves the uploaded file temporarily.
  - It calls transcribe_audio() to transcribe the audio.
  - It calls generate_image_prompt() to generate an image prompt from the transcription.
  - It calls generate_image() to generate an image from the prompt.
  - It renders the index.html template, passing the transcription text and the image URL.
- Includes comprehensive error handling to catch potential issues during file upload, transcription, prompt generation, and image generation.
@app.errorhandler(500): Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.
if __name__ == "__main__":: Starts the Flask development server if the script is executed directly.

Step 4: Create HTML Template (templates/index.html)

Create a folder named templates in the same directory as app.py. Inside the templates folder, create a file named index.html with the following HTML code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Multimodal Assistant</title>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
    <style>
        /* --- General Styles --- */
        body {
            font-family: 'Inter', sans-serif;
            padding: 40px;
            background-color: #f9fafb; /* Tailwind's gray-50 */
            display: flex;
            justify-content: center;
            align-items: center;
            min-height: 100vh;
            margin: 0;
            color: #374151; /* Tailwind's gray-700 */
        }
        .container {
            max-width: 800px; /* Increased max-width */
            width: 95%; /* Take up most of the viewport */
            background-color: #fff;
            padding: 2rem;
            border-radius: 0.75rem; /* Tailwind's rounded-lg */
            box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
            text-align: center;
        }
        h2 {
            font-size: 2.25rem; /* Tailwind's text-3xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1.5rem; /* Tailwind's mb-6 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        p{
            color: #6b7280; /* Tailwind's gray-500 */
            margin-bottom: 1rem;
        }

        /* --- Form Styles --- */
        form {
            margin-top: 1rem; /* Tailwind's mt-4 */
            display: flex;
            flex-direction: column;
            align-items: center; /* Center form elements */
            gap: 0.5rem; /* Tailwind's gap-2 */
        }
        label {
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600;  /* Tailwind's font-semibold */
            color: #4b5563; /* Tailwind's gray-600 */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="file"] {
            width: 100%;
            max-width: 400px; /* Added max-width for file input */
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            font-size: 1rem; /* Tailwind's text-base */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            margin-left: auto;
            margin-right: auto;

        }
        input[type="submit"] {
            padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
            color: #fff;
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600; /* Tailwind's font-semibold */
            cursor: pointer;
            transition: background-color 0.3s ease; /* Smooth transition */
            border: none;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
            margin-top: 1rem;
        }
        input[type="submit"]:hover {
            background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
        }
        input[type="submit"]:focus {
            outline: none;
            box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
        }

        /* --- Result Styles --- */
        .result-container {
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
        }
        .transcript-container{
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */

        }

        h3 {
            font-size: 1.5rem; /* Tailwind's text-2xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1rem; /* Tailwind's mb-4 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        textarea {
            width: 100%;
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            resize: none;
            font-size: 1rem; /* Tailwind's text-base */
            line-height: 1.5rem; /* Tailwind's leading-relaxed */
            margin-top: 0.5rem; /* Tailwind's mt-2 */
            margin-bottom: 0;
            box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
        }
        textarea:focus {
            outline: none;
            border-color: #3b82f6; /* Tailwind's border-blue-500 */
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
        }
        img {
            max-width: 100%;
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            margin-top: 1.5rem; /* Tailwind's mt-6 */
            box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
        }

        /* --- Error Styles --- */
        .error-message {
            color: #dc2626; /* Tailwind's text-red-600 */
            margin-top: 1rem; /* Tailwind's mt-4 */
            padding: 0.75rem;
            background-color: #fee2e2; /* Tailwind's bg-red-100 */
            border-radius: 0.375rem; /* Tailwind's rounded-md */
            border: 1px solid #fecaca; /* Tailwind's border-red-300 */
            text-align: center;
        }
    </style>
</head>
<body>
    <div class="container">
        <h2>🎤🧠🎨 Multimodal Assistant</h2>
        <p> Upload an audio file to transcribe and generate a corresponding image. </p>
        <form method="POST" enctype="multipart/form-data">
            <label for="audio_file">Upload your voice note:</label><br>
            <input type="file" name="audio_file" accept="audio/*" required><br><br>
            <input type="submit" value="Generate Visual Response">
        </form>

        {% if transcript %}
            <div class = "transcript-container">
                <h3>📝 Transcript:</h3>
                <textarea readonly>{{ transcript }}</textarea>
            </div>
        {% endif %}

        {% if prompt_summary %}
            <div class = "result-container">
                <h3>🎯 Prompt Used for Image:</h3>
                <p>{{ prompt_summary }}</p>
            </div>
        {% endif %}

        {% if image_url %}
            <div class = "result-container">
                <h3>🖼️ Generated Image:</h3>
                <img src="{{ image_url }}" alt="Generated image">
            </div>
        {% endif %}
        {% if error %}
            <div class="error-message">{{ error }}</div>
        {% endif %}
    </div>
</body>
</html>

Key elements in the HTML template:

HTML Structure:
- The <head> section defines the title, links a CSS stylesheet, and sets the viewport for responsiveness.
- The <body> contains the visible content, including a form for uploading audio and sections to display the transcription and generated image.
CSS Styling:
- Modern Design: The CSS is updated to use a modern design, similar to Tailwind CSS.
- Responsive Layout: The layout is more responsive, especially for smaller screens.
- User Experience: Improved form and input styling.
- Clear Error Display: Error messages are styled to be clearly visible.
Form:
- A <form> with enctype="multipart/form-data" is used to handle file uploads.
- A <label> and <input type="file"> allow the user to select an audio file. The accept="audio/*" attribute restricts the user to uploading audio files.
- A <input type="submit"> button allows the user to submit the form.
Transcription and Image Display:
- The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a textarea, and the image is displayed using an <img> tag.
Error Handling:
- A <div class="error-message"> is used to display any error messages to the user.

Try It Out

Save the files as app.py and templates/index.html.
Ensure you have your OpenAI API key in the .env file.
Run the application:
python app.py
Open http://localhost:5000 in your browser.
Upload an audio file (you can use the provided sample .mp3 file).
View the transcription and the generated image on the page.

5.5.3 How This Works (Behind the Scenes)

Example Flow: A Detailed Walkthrough

To better understand how this multimodal system works, let's walk through a complete example. When a user uploads a voice note saying:

"I had the most peaceful morning — sitting by a lake with birds singing and the sun rising behind the trees."

The system processes this input through three distinct stages:

Audio Processing with Transcript: First, Whisper converts the audio to text, maintaining accuracy even with background noise or accent variations. Result: "I had the most peaceful morning…"
Scene Analysis and Enhancement: GPT-4o analyzes the transcribed text, identifying key visual elements and spatial relationships to create an optimized image prompt. Result: "A sunrise over a tranquil lake, with birds in the sky and trees reflecting in the water"
Visual Creation: DALL·E takes this refined prompt and generates a photorealistic image, carefully balancing all the described elements into a cohesive scene

The Power of Integration

In this final section of the chapter, you created a multimodal mini-assistant that demonstrates the seamless integration of three distinct AI capabilities:

Whisper: Advanced speech recognition that handles various accents, languages, and audio qualities with remarkable accuracy
GPT-4o: Sophisticated language processing that understands context, emotion, and scene composition to create detailed image descriptions
DALL·E: State-of-the-art image generation that translates text descriptions into vivid, coherent visual scenes

5.5 Basic Integration of Multiple Modalities

5.5.1 What You'll Build

Accepting an audio message from the user through a clean web interface, supporting various audio formats
Leveraging OpenAI's Whisper technology to accurately transcribe the audio message into text, handling different accents and languages
Utilizing GPT-4o's advanced natural language processing capabilities to analyze the transcribed text, understanding context, intent, and key themes
Employing DALL·E 3's sophisticated image generation abilities to create relevant, high-quality visuals based on the understood context
Presenting a cohesive user experience by displaying both the original transcription and the AI-generated image on a single, well-designed webpage

Spoken design prompts for visual creators - enabling artists and designers to verbally describe their vision and instantly see it rendered
Audio journaling with illustrated output - transforming spoken diary entries into visual memories with matching AI-generated artwork
Voice-controlled storytelling applications - creating interactive narratives where spoken words come to life through instant visual generation

5.5.2 Step-by-Step Implementation

Step 1: Install Required Packages

Download the sample audio: https://files.cuantum.tech/audio/audio-file-sample.mp3

Ensure you have the necessary Python libraries installed. Open your terminal and execute the following command:

pip install flask openai python-dotenv

This command installs:

flask: A micro web framework for building the web application.
openai: The OpenAI Python library for interacting with the Whisper, Chat Completion, and DALL·E 3 APIs.
python-dotenv: A library to load environment variables from a .env file.

Step 2: Set Up Project Structure

Create the following folder structure for your project:

/multimodal_app
│
├── app.py
├── .env
└── templates/
    └── index.html

/multimodal_app: The root directory for your project.
app.py: The Python file containing the Flask application code.
.env: A file to store your OpenAI API key.
templates/: A directory to store your HTML templates.
templates/index.html: The HTML template for the main page of your application.

Step 3: Create the Flask App (app.py)

Create a Python file named app.py in the root directory of your project and add the following code:

from flask import Flask, request, render_template, jsonify, make_response
import openai
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}  # Allowed audio file extensions


def allowed_file(filename: str) -> bool:
    """
    Checks if the uploaded file has an allowed extension.

    Args:
        filename (str): The name of the file.

    Returns:
        bool: True if the file has an allowed extension, False otherwise.
    """
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio file: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.Audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        logger.info(f"Transcription successful. Length: {len(transcript)} characters.")
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None


def generate_image_prompt(text: str) -> Optional[str]:
    """
    Generates a prompt for DALL·E 3 based on the transcribed text using GPT-4o.

    Args:
        text (str): The transcribed text.

    Returns:
        Optional[str]: The generated image prompt, or None on error.
    """
    try:
        logger.info("Generating image prompt using GPT-4o")
        response = openai.chat.completions.create(
            model="gpt-4o",  # You can also experiment with other chat models
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.",
                },
                {"role": "user", "content": text},
            ],
        )
        prompt = response.choices[0].message.content
        logger.info(f"Generated image prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None


def generate_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024", response_format: str = "url") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}")
        response = openai.Image.create(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None



@app.route("/", methods=["GET", "POST"])
def index():
    """
    Handles the main route for the web application.
    Processes audio uploads, transcribes them, generates image prompts, and displays images.
    """
    transcript = None
    image_url = None
    prompt_summary = None
    error_message = None

    if request.method == "POST":
        if 'audio_file' not in request.files:
            error_message = "No file part"
            logger.warning(error_message)
            return render_template("index.html", error=error_message)
        file = request.files['audio_file']
        if file.filename == '':
            error_message = "No file selected"
            logger.warning(error_message)
            return render_template("index.html", error=error_message)

        if file and allowed_file(file.filename):
            try:
                # Securely save the uploaded file to a temporary location
                temp_file_path = os.path.join(app.root_path, "temp_audio." + file.filename.rsplit('.', 1)[1].lower())
                file.save(temp_file_path)

                transcript = transcribe_audio(temp_file_path)  # Transcribe audio
                if not transcript:
                    error_message = "Audio transcription failed. Please try again."
                    return render_template("index.html", error=error_message)

                prompt_summary = generate_image_prompt(transcript)  # Generate prompt
                if not prompt_summary:
                    error_message = "Failed to generate image prompt. Please try again."
                    return render_template("index.html", error=error_message)

                image_url = generate_image(prompt_summary)  # Generate image
                if not image_url:
                    error_message = "Failed to generate image. Please try again."
                    return render_template("index.html", error=error_message)

                # Optionally, delete the temporary file after processing
                os.remove(temp_file_path)

            except Exception as e:
                error_message = f"An error occurred: {e}"
                logger.error(error_message)
                return render_template("index.html", error=error_message)
        else:
            error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
            logger.warning(error_message)
            return render_template("index.html", error=error_message)

    return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary, error=error_message)



@app.errorhandler(500)
def internal_server_error(e):
    """Handles internal server errors."""
    logger.error(f"Internal Server Error: {e}")
    return render_template("error.html", error="Internal Server Error"), 500


if __name__ == "__main__":
    app.run(debug=True)

Code Breakdown:

Import Statements: Imports the necessary Flask modules, OpenAI library, os, dotenv, logging, and Optional and Dict for type hinting.
Environment Variables: Loads the OpenAI API key from the .env file.
Flask Application: Creates a Flask application instance.
Logging Configuration: Configures logging.
allowed_file Function: Checks if the uploaded file has an allowed audio extension.
transcribe_audio Function:
- Takes the audio file path as input.
- Opens the audio file in binary mode ("rb").
- Calls the OpenAI API's openai.Audio.transcribe() method to transcribe the audio.
- Extracts the transcribed text from the response.
- Logs the file path before transcription and the length of the transcribed text after successful transcription.
- Includes error handling for OpenAI API errors and other exceptions.
generate_image_prompt Function:
- Takes the transcribed text as input.
- Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a text prompt suitable for image generation.
- The system message instructs the model to act as a creative assistant and provide a vivid description of a scene based on the audio.
- Extracts the generated prompt from the API response.
- Includes error handling.
generate_image Function:
- Takes the image prompt as input.
- Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
- Extracts the image URL from the API response.
- Includes error handling.
index Route:
- Handles both GET and POST requests.
- For GET requests, it renders the initial HTML page.
- For POST requests (when the user uploads an audio file):
  - It validates the uploaded file.
  - It saves the uploaded file temporarily.
  - It calls transcribe_audio() to transcribe the audio.
  - It calls generate_image_prompt() to generate an image prompt from the transcription.
  - It calls generate_image() to generate an image from the prompt.
  - It renders the index.html template, passing the transcription text and the image URL.
- Includes comprehensive error handling to catch potential issues during file upload, transcription, prompt generation, and image generation.
@app.errorhandler(500): Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.
if __name__ == "__main__":: Starts the Flask development server if the script is executed directly.

Step 4: Create HTML Template (templates/index.html)

Create a folder named templates in the same directory as app.py. Inside the templates folder, create a file named index.html with the following HTML code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Multimodal Assistant</title>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
    <style>
        /* --- General Styles --- */
        body {
            font-family: 'Inter', sans-serif;
            padding: 40px;
            background-color: #f9fafb; /* Tailwind's gray-50 */
            display: flex;
            justify-content: center;
            align-items: center;
            min-height: 100vh;
            margin: 0;
            color: #374151; /* Tailwind's gray-700 */
        }
        .container {
            max-width: 800px; /* Increased max-width */
            width: 95%; /* Take up most of the viewport */
            background-color: #fff;
            padding: 2rem;
            border-radius: 0.75rem; /* Tailwind's rounded-lg */
            box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
            text-align: center;
        }
        h2 {
            font-size: 2.25rem; /* Tailwind's text-3xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1.5rem; /* Tailwind's mb-6 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        p{
            color: #6b7280; /* Tailwind's gray-500 */
            margin-bottom: 1rem;
        }

        /* --- Form Styles --- */
        form {
            margin-top: 1rem; /* Tailwind's mt-4 */
            display: flex;
            flex-direction: column;
            align-items: center; /* Center form elements */
            gap: 0.5rem; /* Tailwind's gap-2 */
        }
        label {
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600;  /* Tailwind's font-semibold */
            color: #4b5563; /* Tailwind's gray-600 */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="file"] {
            width: 100%;
            max-width: 400px; /* Added max-width for file input */
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            font-size: 1rem; /* Tailwind's text-base */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            margin-left: auto;
            margin-right: auto;

        }
        input[type="submit"] {
            padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
            color: #fff;
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600; /* Tailwind's font-semibold */
            cursor: pointer;
            transition: background-color 0.3s ease; /* Smooth transition */
            border: none;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
            margin-top: 1rem;
        }
        input[type="submit"]:hover {
            background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
        }
        input[type="submit"]:focus {
            outline: none;
            box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
        }

        /* --- Result Styles --- */
        .result-container {
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
        }
        .transcript-container{
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */

        }

        h3 {
            font-size: 1.5rem; /* Tailwind's text-2xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1rem; /* Tailwind's mb-4 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        textarea {
            width: 100%;
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            resize: none;
            font-size: 1rem; /* Tailwind's text-base */
            line-height: 1.5rem; /* Tailwind's leading-relaxed */
            margin-top: 0.5rem; /* Tailwind's mt-2 */
            margin-bottom: 0;
            box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
        }
        textarea:focus {
            outline: none;
            border-color: #3b82f6; /* Tailwind's border-blue-500 */
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
        }
        img {
            max-width: 100%;
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            margin-top: 1.5rem; /* Tailwind's mt-6 */
            box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
        }

        /* --- Error Styles --- */
        .error-message {
            color: #dc2626; /* Tailwind's text-red-600 */
            margin-top: 1rem; /* Tailwind's mt-4 */
            padding: 0.75rem;
            background-color: #fee2e2; /* Tailwind's bg-red-100 */
            border-radius: 0.375rem; /* Tailwind's rounded-md */
            border: 1px solid #fecaca; /* Tailwind's border-red-300 */
            text-align: center;
        }
    </style>
</head>
<body>
    <div class="container">
        <h2>🎤🧠🎨 Multimodal Assistant</h2>
        <p> Upload an audio file to transcribe and generate a corresponding image. </p>
        <form method="POST" enctype="multipart/form-data">
            <label for="audio_file">Upload your voice note:</label><br>
            <input type="file" name="audio_file" accept="audio/*" required><br><br>
            <input type="submit" value="Generate Visual Response">
        </form>

        {% if transcript %}
            <div class = "transcript-container">
                <h3>📝 Transcript:</h3>
                <textarea readonly>{{ transcript }}</textarea>
            </div>
        {% endif %}

        {% if prompt_summary %}
            <div class = "result-container">
                <h3>🎯 Prompt Used for Image:</h3>
                <p>{{ prompt_summary }}</p>
            </div>
        {% endif %}

        {% if image_url %}
            <div class = "result-container">
                <h3>🖼️ Generated Image:</h3>
                <img src="{{ image_url }}" alt="Generated image">
            </div>
        {% endif %}
        {% if error %}
            <div class="error-message">{{ error }}</div>
        {% endif %}
    </div>
</body>
</html>

Key elements in the HTML template:

HTML Structure:
- The <head> section defines the title, links a CSS stylesheet, and sets the viewport for responsiveness.
- The <body> contains the visible content, including a form for uploading audio and sections to display the transcription and generated image.
CSS Styling:
- Modern Design: The CSS is updated to use a modern design, similar to Tailwind CSS.
- Responsive Layout: The layout is more responsive, especially for smaller screens.
- User Experience: Improved form and input styling.
- Clear Error Display: Error messages are styled to be clearly visible.
Form:
- A <form> with enctype="multipart/form-data" is used to handle file uploads.
- A <label> and <input type="file"> allow the user to select an audio file. The accept="audio/*" attribute restricts the user to uploading audio files.
- A <input type="submit"> button allows the user to submit the form.
Transcription and Image Display:
- The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a textarea, and the image is displayed using an <img> tag.
Error Handling:
- A <div class="error-message"> is used to display any error messages to the user.

Try It Out

Save the files as app.py and templates/index.html.
Ensure you have your OpenAI API key in the .env file.
Run the application:
python app.py
Open http://localhost:5000 in your browser.
Upload an audio file (you can use the provided sample .mp3 file).
View the transcription and the generated image on the page.

5.5.3 How This Works (Behind the Scenes)

Example Flow: A Detailed Walkthrough

To better understand how this multimodal system works, let's walk through a complete example. When a user uploads a voice note saying:

"I had the most peaceful morning — sitting by a lake with birds singing and the sun rising behind the trees."

The system processes this input through three distinct stages:

Audio Processing with Transcript: First, Whisper converts the audio to text, maintaining accuracy even with background noise or accent variations. Result: "I had the most peaceful morning…"
Scene Analysis and Enhancement: GPT-4o analyzes the transcribed text, identifying key visual elements and spatial relationships to create an optimized image prompt. Result: "A sunrise over a tranquil lake, with birds in the sky and trees reflecting in the water"
Visual Creation: DALL·E takes this refined prompt and generates a photorealistic image, carefully balancing all the described elements into a cohesive scene

The Power of Integration

In this final section of the chapter, you created a multimodal mini-assistant that demonstrates the seamless integration of three distinct AI capabilities:

Whisper: Advanced speech recognition that handles various accents, languages, and audio qualities with remarkable accuracy
GPT-4o: Sophisticated language processing that understands context, emotion, and scene composition to create detailed image descriptions
DALL·E: State-of-the-art image generation that translates text descriptions into vivid, coherent visual scenes

5.5 Basic Integration of Multiple Modalities

5.5.1 What You'll Build

Accepting an audio message from the user through a clean web interface, supporting various audio formats
Leveraging OpenAI's Whisper technology to accurately transcribe the audio message into text, handling different accents and languages
Utilizing GPT-4o's advanced natural language processing capabilities to analyze the transcribed text, understanding context, intent, and key themes
Employing DALL·E 3's sophisticated image generation abilities to create relevant, high-quality visuals based on the understood context
Presenting a cohesive user experience by displaying both the original transcription and the AI-generated image on a single, well-designed webpage

Spoken design prompts for visual creators - enabling artists and designers to verbally describe their vision and instantly see it rendered
Audio journaling with illustrated output - transforming spoken diary entries into visual memories with matching AI-generated artwork
Voice-controlled storytelling applications - creating interactive narratives where spoken words come to life through instant visual generation

5.5.2 Step-by-Step Implementation

Step 1: Install Required Packages

Download the sample audio: https://files.cuantum.tech/audio/audio-file-sample.mp3

Ensure you have the necessary Python libraries installed. Open your terminal and execute the following command:

pip install flask openai python-dotenv

This command installs:

flask: A micro web framework for building the web application.
openai: The OpenAI Python library for interacting with the Whisper, Chat Completion, and DALL·E 3 APIs.
python-dotenv: A library to load environment variables from a .env file.

Step 2: Set Up Project Structure

Create the following folder structure for your project:

/multimodal_app
│
├── app.py
├── .env
└── templates/
    └── index.html

/multimodal_app: The root directory for your project.
app.py: The Python file containing the Flask application code.
.env: A file to store your OpenAI API key.
templates/: A directory to store your HTML templates.
templates/index.html: The HTML template for the main page of your application.

Step 3: Create the Flask App (app.py)

Create a Python file named app.py in the root directory of your project and add the following code:

from flask import Flask, request, render_template, jsonify, make_response
import openai
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}  # Allowed audio file extensions


def allowed_file(filename: str) -> bool:
    """
    Checks if the uploaded file has an allowed extension.

    Args:
        filename (str): The name of the file.

    Returns:
        bool: True if the file has an allowed extension, False otherwise.
    """
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio file: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.Audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        logger.info(f"Transcription successful. Length: {len(transcript)} characters.")
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None


def generate_image_prompt(text: str) -> Optional[str]:
    """
    Generates a prompt for DALL·E 3 based on the transcribed text using GPT-4o.

    Args:
        text (str): The transcribed text.

    Returns:
        Optional[str]: The generated image prompt, or None on error.
    """
    try:
        logger.info("Generating image prompt using GPT-4o")
        response = openai.chat.completions.create(
            model="gpt-4o",  # You can also experiment with other chat models
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.",
                },
                {"role": "user", "content": text},
            ],
        )
        prompt = response.choices[0].message.content
        logger.info(f"Generated image prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None


def generate_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024", response_format: str = "url") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}")
        response = openai.Image.create(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None



@app.route("/", methods=["GET", "POST"])
def index():
    """
    Handles the main route for the web application.
    Processes audio uploads, transcribes them, generates image prompts, and displays images.
    """
    transcript = None
    image_url = None
    prompt_summary = None
    error_message = None

    if request.method == "POST":
        if 'audio_file' not in request.files:
            error_message = "No file part"
            logger.warning(error_message)
            return render_template("index.html", error=error_message)
        file = request.files['audio_file']
        if file.filename == '':
            error_message = "No file selected"
            logger.warning(error_message)
            return render_template("index.html", error=error_message)

        if file and allowed_file(file.filename):
            try:
                # Securely save the uploaded file to a temporary location
                temp_file_path = os.path.join(app.root_path, "temp_audio." + file.filename.rsplit('.', 1)[1].lower())
                file.save(temp_file_path)

                transcript = transcribe_audio(temp_file_path)  # Transcribe audio
                if not transcript:
                    error_message = "Audio transcription failed. Please try again."
                    return render_template("index.html", error=error_message)

                prompt_summary = generate_image_prompt(transcript)  # Generate prompt
                if not prompt_summary:
                    error_message = "Failed to generate image prompt. Please try again."
                    return render_template("index.html", error=error_message)

                image_url = generate_image(prompt_summary)  # Generate image
                if not image_url:
                    error_message = "Failed to generate image. Please try again."
                    return render_template("index.html", error=error_message)

                # Optionally, delete the temporary file after processing
                os.remove(temp_file_path)

            except Exception as e:
                error_message = f"An error occurred: {e}"
                logger.error(error_message)
                return render_template("index.html", error=error_message)
        else:
            error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
            logger.warning(error_message)
            return render_template("index.html", error=error_message)

    return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary, error=error_message)



@app.errorhandler(500)
def internal_server_error(e):
    """Handles internal server errors."""
    logger.error(f"Internal Server Error: {e}")
    return render_template("error.html", error="Internal Server Error"), 500


if __name__ == "__main__":
    app.run(debug=True)

Code Breakdown:

Import Statements: Imports the necessary Flask modules, OpenAI library, os, dotenv, logging, and Optional and Dict for type hinting.
Environment Variables: Loads the OpenAI API key from the .env file.
Flask Application: Creates a Flask application instance.
Logging Configuration: Configures logging.
allowed_file Function: Checks if the uploaded file has an allowed audio extension.
transcribe_audio Function:
- Takes the audio file path as input.
- Opens the audio file in binary mode ("rb").
- Calls the OpenAI API's openai.Audio.transcribe() method to transcribe the audio.
- Extracts the transcribed text from the response.
- Logs the file path before transcription and the length of the transcribed text after successful transcription.
- Includes error handling for OpenAI API errors and other exceptions.
generate_image_prompt Function:
- Takes the transcribed text as input.
- Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a text prompt suitable for image generation.
- The system message instructs the model to act as a creative assistant and provide a vivid description of a scene based on the audio.
- Extracts the generated prompt from the API response.
- Includes error handling.
generate_image Function:
- Takes the image prompt as input.
- Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
- Extracts the image URL from the API response.
- Includes error handling.
index Route:
- Handles both GET and POST requests.
- For GET requests, it renders the initial HTML page.
- For POST requests (when the user uploads an audio file):
  - It validates the uploaded file.
  - It saves the uploaded file temporarily.
  - It calls transcribe_audio() to transcribe the audio.
  - It calls generate_image_prompt() to generate an image prompt from the transcription.
  - It calls generate_image() to generate an image from the prompt.
  - It renders the index.html template, passing the transcription text and the image URL.
- Includes comprehensive error handling to catch potential issues during file upload, transcription, prompt generation, and image generation.
@app.errorhandler(500): Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.
if __name__ == "__main__":: Starts the Flask development server if the script is executed directly.

Step 4: Create HTML Template (templates/index.html)

Create a folder named templates in the same directory as app.py. Inside the templates folder, create a file named index.html with the following HTML code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Multimodal Assistant</title>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
    <style>
        /* --- General Styles --- */
        body {
            font-family: 'Inter', sans-serif;
            padding: 40px;
            background-color: #f9fafb; /* Tailwind's gray-50 */
            display: flex;
            justify-content: center;
            align-items: center;
            min-height: 100vh;
            margin: 0;
            color: #374151; /* Tailwind's gray-700 */
        }
        .container {
            max-width: 800px; /* Increased max-width */
            width: 95%; /* Take up most of the viewport */
            background-color: #fff;
            padding: 2rem;
            border-radius: 0.75rem; /* Tailwind's rounded-lg */
            box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
            text-align: center;
        }
        h2 {
            font-size: 2.25rem; /* Tailwind's text-3xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1.5rem; /* Tailwind's mb-6 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        p{
            color: #6b7280; /* Tailwind's gray-500 */
            margin-bottom: 1rem;
        }

        /* --- Form Styles --- */
        form {
            margin-top: 1rem; /* Tailwind's mt-4 */
            display: flex;
            flex-direction: column;
            align-items: center; /* Center form elements */
            gap: 0.5rem; /* Tailwind's gap-2 */
        }
        label {
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600;  /* Tailwind's font-semibold */
            color: #4b5563; /* Tailwind's gray-600 */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="file"] {
            width: 100%;
            max-width: 400px; /* Added max-width for file input */
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            font-size: 1rem; /* Tailwind's text-base */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            margin-left: auto;
            margin-right: auto;

        }
        input[type="submit"] {
            padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
            color: #fff;
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600; /* Tailwind's font-semibold */
            cursor: pointer;
            transition: background-color 0.3s ease; /* Smooth transition */
            border: none;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
            margin-top: 1rem;
        }
        input[type="submit"]:hover {
            background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
        }
        input[type="submit"]:focus {
            outline: none;
            box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
        }

        /* --- Result Styles --- */
        .result-container {
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
        }
        .transcript-container{
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */

        }

        h3 {
            font-size: 1.5rem; /* Tailwind's text-2xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1rem; /* Tailwind's mb-4 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        textarea {
            width: 100%;
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            resize: none;
            font-size: 1rem; /* Tailwind's text-base */
            line-height: 1.5rem; /* Tailwind's leading-relaxed */
            margin-top: 0.5rem; /* Tailwind's mt-2 */
            margin-bottom: 0;
            box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
        }
        textarea:focus {
            outline: none;
            border-color: #3b82f6; /* Tailwind's border-blue-500 */
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
        }
        img {
            max-width: 100%;
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            margin-top: 1.5rem; /* Tailwind's mt-6 */
            box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
        }

        /* --- Error Styles --- */
        .error-message {
            color: #dc2626; /* Tailwind's text-red-600 */
            margin-top: 1rem; /* Tailwind's mt-4 */
            padding: 0.75rem;
            background-color: #fee2e2; /* Tailwind's bg-red-100 */
            border-radius: 0.375rem; /* Tailwind's rounded-md */
            border: 1px solid #fecaca; /* Tailwind's border-red-300 */
            text-align: center;
        }
    </style>
</head>
<body>
    <div class="container">
        <h2>🎤🧠🎨 Multimodal Assistant</h2>
        <p> Upload an audio file to transcribe and generate a corresponding image. </p>
        <form method="POST" enctype="multipart/form-data">
            <label for="audio_file">Upload your voice note:</label><br>
            <input type="file" name="audio_file" accept="audio/*" required><br><br>
            <input type="submit" value="Generate Visual Response">
        </form>

        {% if transcript %}
            <div class = "transcript-container">
                <h3>📝 Transcript:</h3>
                <textarea readonly>{{ transcript }}</textarea>
            </div>
        {% endif %}

        {% if prompt_summary %}
            <div class = "result-container">
                <h3>🎯 Prompt Used for Image:</h3>
                <p>{{ prompt_summary }}</p>
            </div>
        {% endif %}

        {% if image_url %}
            <div class = "result-container">
                <h3>🖼️ Generated Image:</h3>
                <img src="{{ image_url }}" alt="Generated image">
            </div>
        {% endif %}
        {% if error %}
            <div class="error-message">{{ error }}</div>
        {% endif %}
    </div>
</body>
</html>

Key elements in the HTML template:

HTML Structure:
- The <head> section defines the title, links a CSS stylesheet, and sets the viewport for responsiveness.
- The <body> contains the visible content, including a form for uploading audio and sections to display the transcription and generated image.
CSS Styling:
- Modern Design: The CSS is updated to use a modern design, similar to Tailwind CSS.
- Responsive Layout: The layout is more responsive, especially for smaller screens.
- User Experience: Improved form and input styling.
- Clear Error Display: Error messages are styled to be clearly visible.
Form:
- A <form> with enctype="multipart/form-data" is used to handle file uploads.
- A <label> and <input type="file"> allow the user to select an audio file. The accept="audio/*" attribute restricts the user to uploading audio files.
- A <input type="submit"> button allows the user to submit the form.
Transcription and Image Display:
- The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a textarea, and the image is displayed using an <img> tag.
Error Handling:
- A <div class="error-message"> is used to display any error messages to the user.

Try It Out

Save the files as app.py and templates/index.html.
Ensure you have your OpenAI API key in the .env file.
Run the application:
python app.py
Open http://localhost:5000 in your browser.
Upload an audio file (you can use the provided sample .mp3 file).
View the transcription and the generated image on the page.

5.5.3 How This Works (Behind the Scenes)

Example Flow: A Detailed Walkthrough

To better understand how this multimodal system works, let's walk through a complete example. When a user uploads a voice note saying:

"I had the most peaceful morning — sitting by a lake with birds singing and the sun rising behind the trees."

The system processes this input through three distinct stages:

Audio Processing with Transcript: First, Whisper converts the audio to text, maintaining accuracy even with background noise or accent variations. Result: "I had the most peaceful morning…"
Scene Analysis and Enhancement: GPT-4o analyzes the transcribed text, identifying key visual elements and spatial relationships to create an optimized image prompt. Result: "A sunrise over a tranquil lake, with birds in the sky and trees reflecting in the water"
Visual Creation: DALL·E takes this refined prompt and generates a photorealistic image, carefully balancing all the described elements into a cohesive scene

The Power of Integration

In this final section of the chapter, you created a multimodal mini-assistant that demonstrates the seamless integration of three distinct AI capabilities:

Whisper: Advanced speech recognition that handles various accents, languages, and audio qualities with remarkable accuracy
GPT-4o: Sophisticated language processing that understands context, emotion, and scene composition to create detailed image descriptions
DALL·E: State-of-the-art image generation that translates text descriptions into vivid, coherent visual scenes

Purchase this book

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Chapter 5: Image and Audio Integration Projects

5.5 Basic Integration of Multiple Modalities

5.5.1 What You'll Build

5.5.2 Step-by-Step Implementation

5.5.3 How This Works (Behind the Scenes)

5.5 Basic Integration of Multiple Modalities

5.5.1 What You'll Build

5.5.2 Step-by-Step Implementation

5.5.3 How This Works (Behind the Scenes)

5.5 Basic Integration of Multiple Modalities

5.5.1 What You'll Build

5.5.2 Step-by-Step Implementation

5.5.3 How This Works (Behind the Scenes)

5.5 Basic Integration of Multiple Modalities

5.5.1 What You'll Build

5.5.2 Step-by-Step Implementation

5.5.3 How This Works (Behind the Scenes)