Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconOpenAI API Bible Volume 2
OpenAI API Bible Volume 2

Chapter 6: Cross-Model AI Suites

6.1 Combining GPT + DALL·E + Whisper

In previous chapters, you've explored the individual capabilities of three powerful AI models: GPT for natural language processing and text generation, DALL·E for creating detailed images from text descriptions, and Whisper for accurate speech-to-text conversion. You've learned to build standalone applications like chatbots that engage in natural conversations, visual generators that bring ideas to life, and transcription tools that convert spoken words to text. You've even dabbled in simple multimodal applications that use two of these technologies together.

Now, we're taking a significant leap forward: you'll learn to create cohesive AI suites — sophisticated systems that seamlessly integrate speech recognition, text processing, and image generation into unified, powerful applications. These suites can handle complex workflows, such as converting a spoken description into a written narrative and then transforming that narrative into a visual representation, all in one smooth process.

Consider this chapter your advancement from being an API-proficient developer to becoming an architect of modular, orchestrated AI workflows. This is the same technology that powers many of today's leading software solutions. Major companies implement these integrated AI systems in their productivity tools (like advanced document processors), customer experience platforms (such as intelligent support systems), and creative applications (including AI-powered design tools) that serve massive user bases ranging from thousands to millions.

Throughout this chapter, you'll master several crucial skills:

  • Chain models with dynamic logic - Learn to create intelligent decision trees that determine how different AI models should interact and when to invoke specific capabilities
  • Handle input across multiple formats - Develop robust systems that can process and validate various types of input, from audio files to text prompts to image data
  • Return meaningful output across modalities - Create sophisticated response handling that can deliver results in multiple formats while maintaining context and coherence
  • Build real-time or near real-time pipelines - Optimize your applications for performance, ensuring quick response times even when multiple AI models are working together

And we begin with a foundational section:

Building sophisticated AI applications requires the careful integration of multiple specialized models to create a seamless, intelligent experience. Each AI model serves as a master of its domain: GPT excels in understanding and generating human-like text, DALL·E specializes in creating stunning visual artwork from textual descriptions, and Whisper demonstrates remarkable accuracy in converting spoken words to text. However, the true innovation emerges when these individual powerhouses are orchestrated to work together in perfect harmony.

This sophisticated integration enables the creation of applications that mirror human cognitive processes by handling multiple types of information simultaneously. Consider the natural flow of human communication: we speak, understand context, and visualize concepts seamlessly. Now imagine an AI system that matches this natural process: you describe a scene verbally, the system processes your speech into text, understands the context and details of your description, and then transforms that understanding into a visual representation - all flowing smoothly from one step to the next, just as your brain would process the same information.

In this section, we'll explore the intricacies of building such a cross-model pipeline. You'll master essential concepts like efficient data transformation between models (ensuring that the output from one model is optimally formatted for the next), sophisticated asynchronous process management (allowing multiple models to work simultaneously when possible), and the implementation of clean, maintainable code architecture. We'll dive deep into handling edge cases, managing model-specific quirks, and ensuring smooth data flow throughout the entire pipeline. Through a comprehensive practical example, you'll gain hands-on experience with these concepts, preparing you to architect and deploy your own sophisticated multi-model AI applications that can scale efficiently and maintain high performance under real-world conditions.

In this section, you'll create a Flask-based web application that integrates multiple AI models to process audio input and generate a corresponding image.  Specifically, the application will:

  • Accept an audio file uploaded by the user.
  • Transcribe the audio content into text using OpenAI's Whisper API.
  • Analyze the transcribed text using GPT-4o to extract a descriptive scene representation.
  • Generate an image based on the scene description using OpenAI's DALL·E 3 API.
  • Display both the text transcription and the generated image on a single webpage.

This project demonstrates a basic multimodal AI pipeline, combining speech-to-text and text-to-image generation.  It establishes a foundation for building more sophisticated applications.

6.1.1 Step-by-Step Implementation

Step 1: Set Up Project Structure

Download the sample file: https://files.cuantum.tech/audio/gpt-dalle-whisper-sample.mp3

Organize your project files as follows:

/multimodal_app

├── app.py
├── .env
└── templates/
    └── index.html
└── utils/
    ├── transcribe.py
    ├── generate_prompt.py
    ├── generate_image.py
    └── audio_analysis.py  # New module for audio analysis
  • /multimodal_app: The root directory for your project.
  • app.py: The main Flask application file.
  • .env: A file to store your OpenAI API key.
  • templates/: A directory for HTML templates.
  • templates/index.html: The HTML template for the user interface.
  • utils/: A directory for Python modules containing reusable functions.
    • transcribe.py: Contains the function to transcribe audio using Whisper.
    • generate_prompt.py: Contains the function to generate an image prompt using GPT-4o.
    • generate_image.py: Contains the function to generate an image with DALL·E 3.
    • audio_analysis.py: New module to analyze audio.

Step 2: Install Required Packages

Install the necessary Python libraries:

pip install flask openai python-dotenv

Step 3: Create Utility Modules

Create the following Python files in the utils/ directory:

utils/transcribe.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        audio_file.close()
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None
  • This module defines the transcribe_audio function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription.
  • The function opens the audio file in binary read mode ("rb").
  • It calls openai.audio.transcriptions.create() to perform the transcription, specifying the "whisper-1" model.
  • It extracts the transcribed text from the API response.
  • It includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors. If an error occurs, it logs the error and returns None.
  • It logs the file path before transcription and the length of the transcribed text after successful transcription.
  • The audio file is closed after transcription.

utils/generate_prompt.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def create_image_prompt(transcription: str) -> Optional[str]:
    """
    Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.

    Args:
        transcription (str): The text transcription of the audio.

    Returns:
        Optional[str]: A detailed text prompt suitable for image generation, or None on error.
    """
    try:
        logger.info("Generating image prompt from transcription")
        response = openai.chat.completions.create(
            model="gpt-4o",  #  Use a powerful chat model
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.  Incorporate scene lighting, time of day, weather, and camera angle into the description.",
                },
                {"role": "user", "content": transcription},
            ],
        )
        prompt = response.choices[0].message.content
        logger.info(f"Generated prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None
  • This module defines the create_image_prompt function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation.
  • The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle.
  • The user message provides the transcribed text as the content for the model to work with.
  • The function extracts the generated prompt from the API response.
  • It includes error handling.

utils/generate_image.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024", response_format: str = "url") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}")
        response = openai.images.generate(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None
  • This module defines the generate_dalle_image function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image.
  • It calls the openai.images.generate() method to generate the image.
  • It extracts the URL of the generated image from the API response.
  • It includes error handling.

Step 4: Create the Main App (app.py)

Create a Python file named app.py in the root directory of your project and add the following code:

from flask import Flask, request, render_template, jsonify, make_response, redirect, url_for
import os
from dotenv import load_dotenv
import logging
from typing import Optional
from werkzeug.utils import secure_filename
from werkzeug.datastructures import FileStorage

# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads'  # Store uploaded files
app.config['MAX_CONTENT_LENGTH'] = 25 * 1024 * 1024  # 25MB max file size - increased for larger audio files
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)  # Create the upload folder if it doesn't exist

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}  # Allowed audio file extensions


def allowed_file(filename: str) -> bool:
    """
    Checks if the uploaded file has an allowed extension.

    Args:
        filename (str): The name of the file.

    Returns:
        bool: True if the file has an allowed extension, False otherwise.
    """
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


@app.route("/", methods=["GET", "POST"])
def index():
    """
    Handles the main route for the web application.
    Processes audio uploads, transcribes them, generates image prompts, and displays images.
    """
    transcript = None
    image_url = None
    prompt_summary = None
    error_message = None

    if request.method == "POST":
        if 'audio_file' not in request.files:
            error_message = "No file part"
            logger.warning(error_message)
            return render_template("index.html", error=error_message, error_message=error_message)

        file: FileStorage = request.files['audio_file']  # Use type hinting
        if file.filename == '':
            error_message = "No file selected"
            logger.warning(error_message)
            return render_template("index.html", error=error_message, error_message=error_message)

        if file and allowed_file(file.filename):
            try:
                # Secure the filename and construct a safe path
                filename = secure_filename(file.filename)
                file_path = os.path.join(app.config['UPLOAD_FOLDER'], filename)
                file.save(file_path)  # Save the uploaded file

                transcript = transcribe_audio(file_path)  # Transcribe audio
                if not transcript:
                    error_message = "Audio transcription failed. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error=error_message, error_message=error_message)

                prompt_summary = create_image_prompt(transcript)  # Generate prompt
                if not prompt_summary:
                    error_message = "Failed to generate image prompt. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error=error_message, error_message=error_message)

                image_url = generate_dalle_image(prompt_summary)  # Generate image
                if not image_url:
                    error_message = "Failed to generate image. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error=error_message, error_message=error_message)

                # Optionally, delete the uploaded file after processing
                os.remove(file_path)
                logger.info(f"Successfully processed audio file and generated image.")
                return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary)

            except Exception as e:
                error_message = f"An error occurred: {e}"
                logger.error(error_message)
                return render_template("index.html", error=error_message, error_message=error_message)
        else:
            error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
            logger.warning(error_message)
            return render_template("index.html", error=error_message, error_message=error_message)

    return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary,
                           error=error_message)



@app.errorhandler(500)
def internal_server_error(e):
    """Handles internal server errors."""
    logger.error(f"Internal Server Error: {e}")
    return render_template("error.html", error="Internal Server Error"), 500


if __name__ == "__main__":
    app.run(debug=True)

Code Breakdown:

  • Import Statements: Imports necessary Flask modules, OpenAI library, osdotenvloggingOptional and Dict for type hinting, and secure_filename
  • Environment Variables: Loads the OpenAI API key from the .env file.
  • Flask Application:
    • Creates a Flask application instance.
    • Configures an upload folder and maximum file size. The UPLOAD_FOLDER is set to 'uploads', and MAX_CONTENT_LENGTH is set to 25MB. The upload folder is created if it does not exist.
  • Logging Configuration: Configures logging.
  • allowed_file Function: Checks if the uploaded file has an allowed audio extension.
  • transcribe_audio Function:
    • Takes the audio file path as input.
    • Opens the audio file in binary mode ("rb").
    • Calls the OpenAI API's openai.Audio.transcriptions.create() method to transcribe the audio.
    • Extracts the transcribed text from the API response.
    • Logs the file path before transcription and the length of the transcribed text after successful transcription.
    • Includes error handling for OpenAI API errors and other exceptions. The audio file is closed after transcription.
  • generate_image_prompt Function:
    • Takes the transcribed text as input.
    • Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a text prompt suitable for image generation.
    • The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model.
    • Extracts the generated prompt from the API response.
    • Includes error handling.
  • generate_image Function:
    • Takes the image prompt as input.
    • Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
    • Extracts the image URL from the API response.
    • Includes error handling.
  • index Route:
    • Handles both GET and POST requests.
    • For GET requests, it renders the initial HTML page.
    • For POST requests (when the user uploads an audio file):
      • It validates the uploaded file:
        • Checks if the file part exists in the request.
        • Checks if a file was selected.
        • Checks if the file type is allowed using the allowed_file function.
      • It saves the uploaded file to a temporary location using a secure filename.
      • It calls the utility functions to:
        • Transcribe the audio using transcribe_audio().
        • Generate an image prompt from the transcription using create_image_prompt().
        • Generate an image from the prompt using generate_dalle_image().
      • It handles errors that may occur during any of these steps, logging the error and rendering the index.html template with an appropriate error message.
      • If all steps are successful, it renders the index.html template, passing the transcription text, image URL, and generated prompt to be displayed.
      • It deletes the uploaded file after processing
  • @app.errorhandler(500): Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.
  • if __name__ == "__main__":: Starts the Flask development server if the script is executed directly.

Step 5: Create the HTML Template (templates/index.html)

Create a folder named templates in the same directory as app.py. Inside the templates folder, create a file named index.html with the following HTML code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Multimodal AI Assistant</title>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
    <style>
        /* --- General Styles --- */
        body {
            font-family: 'Inter', sans-serif;
            padding: 40px;
            background-color: #f9fafb; /* Tailwind's gray-50 */
            display: flex;
            justify-content: center;
            align-items: center;
            min-height: 100vh;
            margin: 0;
            color: #374151; /* Tailwind's gray-700 */
        }
        .container {
            max-width: 800px; /* Increased max-width */
            width: 95%; /* Take up most of the viewport */
            background-color: #fff;
            padding: 2rem;
            border-radius: 0.75rem; /* Tailwind's rounded-lg */
            box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
            text-align: center;
        }
        h2 {
            font-size: 2.25rem; /* Tailwind's text-3xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1.5rem; /* Tailwind's mb-6 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        p{
            color: #6b7280; /* Tailwind's gray-500 */
            margin-bottom: 1rem;
        }

        /* --- Form Styles --- */
        form {
            margin-top: 1rem; /* Tailwind's mt-4 */
            margin-bottom: 1.5rem;
            display: flex;
            flex-direction: column;
            align-items: center; /* Center form elements */
            gap: 0.5rem; /* Tailwind's gap-2 */
        }
        label {
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600;  /* Tailwind's font-semibold */
            color: #4b5563; /* Tailwind's gray-600 */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="file"] {
            width: 100%;
            max-width: 400px; /* Added max-width for file input */
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            font-size: 1rem; /* Tailwind's text-base */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="submit"] {
            padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
            color: #fff;
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600; /* Tailwind's font-semibold */
            cursor: pointer;
            transition: background-color 0.3s ease; /* Smooth transition */
            border: none;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
            margin-top: 1rem;
        }
        input[type="submit"]:hover {
            background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
        }
        input[type="submit"]:focus {
            outline: none;
            box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
        }

        /* --- Result Styles --- */
        .result-container {
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
            text-align: left;
        }
        h3 {
            font-size: 1.5rem; /* Tailwind's text-2xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1rem; /* Tailwind's mb-4 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        textarea {
            width: 100%;
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            resize: none;
            font-size: 1rem; /* Tailwind's text-base */
            line-height: 1.5rem; /* Tailwind's leading-relaxed */
            margin-top: 0.5rem; /* Tailwind's mt-2 */
            margin-bottom: 0;
            box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
            min-height: 100px;
        }
        textarea:focus {
            outline: none;
            border-color: #3b82f6; /* Tailwind's border-blue-500 */
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
        }
        img {
            max-width: 100%;
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            margin-top: 1.5rem; /* Tailwind's mt-6 */
            box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
        }

        /* --- Error Styles --- */
        .error-message {
            color: #dc2626; /* Tailwind's text-red-600 */
            margin-top: 1rem; /* Tailwind's mt-4 */
            padding: 0.75rem;
            background-color: #fee2e2; /* Tailwind's bg-red-100 */
            border-radius: 0.375rem; /* Tailwind's rounded-md */
            border: 1px solid #fecaca; /* Tailwind's border-red-300 */
            text-align: center;
        }
    </style>
</head>
<body>
    <div class="container">
        <h2>🎤🧠🎨 Multimodal Assistant</h2>
        <p> Upload an audio file to transcribe and generate a corresponding image. </p>
        <form method="POST" enctype="multipart/form-data">
            <label for="audio_file">Upload your voice note:</label><br>
            <input type="file" name="audio_file" accept="audio/*" required><br><br>
            <input type="submit" value="Generate Visual Response">
        </form>

        {% if transcript %}
            <div class="result-container">
                <h3>📝 Transcript:</h3>
                <textarea readonly>{{ transcript }}</textarea>
            </div>
        {% endif %}

        {% if prompt_summary %}
            <div class="result-container">
                <h3>🎯 Scene Prompt:</h3>
                <p>{{ prompt_summary }}</p>
            </div>
        {% endif %}

        {% if image_url %}
            <div class="result-container">
                <h3>🖼️ Generated Image:</h3>
                <img src="{{ image_url }}" alt="Generated image">
            </div>
        {% endif %}
        {% if error %}
            <div class="error-message">{{ error }}</div>
        {% endif %}
    </div>
</body>
</html>

Key elements in the HTML template:

  • HTML Structure:
    • The <head> section defines the title, links a CSS stylesheet, and sets the viewport for responsiveness.
    • The <body> contains the visible content, including a form for uploading audio and sections to display the transcription and generated image.
  • CSS Styling:
    • Modern, responsive design.
    • Styled form and input elements.
    • Clear presentation of results (transcription and image).
    • User-friendly error message display.
  • Form:
    • <form> with enctype="multipart/form-data" is used to handle file uploads.
    • <label> and <input type="file"> allow the user to select an audio file. The accept="audio/*" attribute restricts the user to uploading audio files.
    • <input type="submit"> button allows the user to submit the form.
  • Transcription and Image Display:
    • The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a textarea, and the image is displayed using an <img> tag.
  • Error Handling:
    • <div class="error-message"> is used to display any error messages to the user.

In this section, you've gained valuable insights into advanced AI integration techniques. Let's break down what you've learned:

  • Organize multimodal logic into reusable modules
    • Create clean, maintainable code structures
    • Develop modular components that can be easily updated and reused
    • Implement proper error handling and logging
  • Chain audio ➝ text ➝ prompt ➝ image cleanly
    • Process audio inputs using Whisper for accurate transcription
    • Transform transcribed text into meaningful prompts with GPT
    • Generate relevant images using DALL·E based on processed text
  • Build a professional Flask app that uses all three major OpenAI models in one flow
    • Set up proper routing and request handling
    • Manage API interactions efficiently
    • Create an intuitive user interface for seamless interaction

You now understand the power of chaining models to create sophisticated AI experiences. This knowledge opens up countless possibilities for innovation. Whether you're building an AI journaling tool that converts voice notes into illustrated entries, a voice-controlled design app that transforms spoken descriptions into visual art, or a multimodal content assistant that helps create rich media content, this foundational workflow can take you far. The skills you've learned here form the basis for creating complex, user-friendly AI applications that combine multiple modalities effectively.

6.1 Combining GPT + DALL·E + Whisper

In previous chapters, you've explored the individual capabilities of three powerful AI models: GPT for natural language processing and text generation, DALL·E for creating detailed images from text descriptions, and Whisper for accurate speech-to-text conversion. You've learned to build standalone applications like chatbots that engage in natural conversations, visual generators that bring ideas to life, and transcription tools that convert spoken words to text. You've even dabbled in simple multimodal applications that use two of these technologies together.

Now, we're taking a significant leap forward: you'll learn to create cohesive AI suites — sophisticated systems that seamlessly integrate speech recognition, text processing, and image generation into unified, powerful applications. These suites can handle complex workflows, such as converting a spoken description into a written narrative and then transforming that narrative into a visual representation, all in one smooth process.

Consider this chapter your advancement from being an API-proficient developer to becoming an architect of modular, orchestrated AI workflows. This is the same technology that powers many of today's leading software solutions. Major companies implement these integrated AI systems in their productivity tools (like advanced document processors), customer experience platforms (such as intelligent support systems), and creative applications (including AI-powered design tools) that serve massive user bases ranging from thousands to millions.

Throughout this chapter, you'll master several crucial skills:

  • Chain models with dynamic logic - Learn to create intelligent decision trees that determine how different AI models should interact and when to invoke specific capabilities
  • Handle input across multiple formats - Develop robust systems that can process and validate various types of input, from audio files to text prompts to image data
  • Return meaningful output across modalities - Create sophisticated response handling that can deliver results in multiple formats while maintaining context and coherence
  • Build real-time or near real-time pipelines - Optimize your applications for performance, ensuring quick response times even when multiple AI models are working together

And we begin with a foundational section:

Building sophisticated AI applications requires the careful integration of multiple specialized models to create a seamless, intelligent experience. Each AI model serves as a master of its domain: GPT excels in understanding and generating human-like text, DALL·E specializes in creating stunning visual artwork from textual descriptions, and Whisper demonstrates remarkable accuracy in converting spoken words to text. However, the true innovation emerges when these individual powerhouses are orchestrated to work together in perfect harmony.

This sophisticated integration enables the creation of applications that mirror human cognitive processes by handling multiple types of information simultaneously. Consider the natural flow of human communication: we speak, understand context, and visualize concepts seamlessly. Now imagine an AI system that matches this natural process: you describe a scene verbally, the system processes your speech into text, understands the context and details of your description, and then transforms that understanding into a visual representation - all flowing smoothly from one step to the next, just as your brain would process the same information.

In this section, we'll explore the intricacies of building such a cross-model pipeline. You'll master essential concepts like efficient data transformation between models (ensuring that the output from one model is optimally formatted for the next), sophisticated asynchronous process management (allowing multiple models to work simultaneously when possible), and the implementation of clean, maintainable code architecture. We'll dive deep into handling edge cases, managing model-specific quirks, and ensuring smooth data flow throughout the entire pipeline. Through a comprehensive practical example, you'll gain hands-on experience with these concepts, preparing you to architect and deploy your own sophisticated multi-model AI applications that can scale efficiently and maintain high performance under real-world conditions.

In this section, you'll create a Flask-based web application that integrates multiple AI models to process audio input and generate a corresponding image.  Specifically, the application will:

  • Accept an audio file uploaded by the user.
  • Transcribe the audio content into text using OpenAI's Whisper API.
  • Analyze the transcribed text using GPT-4o to extract a descriptive scene representation.
  • Generate an image based on the scene description using OpenAI's DALL·E 3 API.
  • Display both the text transcription and the generated image on a single webpage.

This project demonstrates a basic multimodal AI pipeline, combining speech-to-text and text-to-image generation.  It establishes a foundation for building more sophisticated applications.

6.1.1 Step-by-Step Implementation

Step 1: Set Up Project Structure

Download the sample file: https://files.cuantum.tech/audio/gpt-dalle-whisper-sample.mp3

Organize your project files as follows:

/multimodal_app

├── app.py
├── .env
└── templates/
    └── index.html
└── utils/
    ├── transcribe.py
    ├── generate_prompt.py
    ├── generate_image.py
    └── audio_analysis.py  # New module for audio analysis
  • /multimodal_app: The root directory for your project.
  • app.py: The main Flask application file.
  • .env: A file to store your OpenAI API key.
  • templates/: A directory for HTML templates.
  • templates/index.html: The HTML template for the user interface.
  • utils/: A directory for Python modules containing reusable functions.
    • transcribe.py: Contains the function to transcribe audio using Whisper.
    • generate_prompt.py: Contains the function to generate an image prompt using GPT-4o.
    • generate_image.py: Contains the function to generate an image with DALL·E 3.
    • audio_analysis.py: New module to analyze audio.

Step 2: Install Required Packages

Install the necessary Python libraries:

pip install flask openai python-dotenv

Step 3: Create Utility Modules

Create the following Python files in the utils/ directory:

utils/transcribe.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        audio_file.close()
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None
  • This module defines the transcribe_audio function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription.
  • The function opens the audio file in binary read mode ("rb").
  • It calls openai.audio.transcriptions.create() to perform the transcription, specifying the "whisper-1" model.
  • It extracts the transcribed text from the API response.
  • It includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors. If an error occurs, it logs the error and returns None.
  • It logs the file path before transcription and the length of the transcribed text after successful transcription.
  • The audio file is closed after transcription.

utils/generate_prompt.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def create_image_prompt(transcription: str) -> Optional[str]:
    """
    Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.

    Args:
        transcription (str): The text transcription of the audio.

    Returns:
        Optional[str]: A detailed text prompt suitable for image generation, or None on error.
    """
    try:
        logger.info("Generating image prompt from transcription")
        response = openai.chat.completions.create(
            model="gpt-4o",  #  Use a powerful chat model
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.  Incorporate scene lighting, time of day, weather, and camera angle into the description.",
                },
                {"role": "user", "content": transcription},
            ],
        )
        prompt = response.choices[0].message.content
        logger.info(f"Generated prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None
  • This module defines the create_image_prompt function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation.
  • The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle.
  • The user message provides the transcribed text as the content for the model to work with.
  • The function extracts the generated prompt from the API response.
  • It includes error handling.

utils/generate_image.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024", response_format: str = "url") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}")
        response = openai.images.generate(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None
  • This module defines the generate_dalle_image function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image.
  • It calls the openai.images.generate() method to generate the image.
  • It extracts the URL of the generated image from the API response.
  • It includes error handling.

Step 4: Create the Main App (app.py)

Create a Python file named app.py in the root directory of your project and add the following code:

from flask import Flask, request, render_template, jsonify, make_response, redirect, url_for
import os
from dotenv import load_dotenv
import logging
from typing import Optional
from werkzeug.utils import secure_filename
from werkzeug.datastructures import FileStorage

# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads'  # Store uploaded files
app.config['MAX_CONTENT_LENGTH'] = 25 * 1024 * 1024  # 25MB max file size - increased for larger audio files
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)  # Create the upload folder if it doesn't exist

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}  # Allowed audio file extensions


def allowed_file(filename: str) -> bool:
    """
    Checks if the uploaded file has an allowed extension.

    Args:
        filename (str): The name of the file.

    Returns:
        bool: True if the file has an allowed extension, False otherwise.
    """
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


@app.route("/", methods=["GET", "POST"])
def index():
    """
    Handles the main route for the web application.
    Processes audio uploads, transcribes them, generates image prompts, and displays images.
    """
    transcript = None
    image_url = None
    prompt_summary = None
    error_message = None

    if request.method == "POST":
        if 'audio_file' not in request.files:
            error_message = "No file part"
            logger.warning(error_message)
            return render_template("index.html", error=error_message, error_message=error_message)

        file: FileStorage = request.files['audio_file']  # Use type hinting
        if file.filename == '':
            error_message = "No file selected"
            logger.warning(error_message)
            return render_template("index.html", error=error_message, error_message=error_message)

        if file and allowed_file(file.filename):
            try:
                # Secure the filename and construct a safe path
                filename = secure_filename(file.filename)
                file_path = os.path.join(app.config['UPLOAD_FOLDER'], filename)
                file.save(file_path)  # Save the uploaded file

                transcript = transcribe_audio(file_path)  # Transcribe audio
                if not transcript:
                    error_message = "Audio transcription failed. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error=error_message, error_message=error_message)

                prompt_summary = create_image_prompt(transcript)  # Generate prompt
                if not prompt_summary:
                    error_message = "Failed to generate image prompt. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error=error_message, error_message=error_message)

                image_url = generate_dalle_image(prompt_summary)  # Generate image
                if not image_url:
                    error_message = "Failed to generate image. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error=error_message, error_message=error_message)

                # Optionally, delete the uploaded file after processing
                os.remove(file_path)
                logger.info(f"Successfully processed audio file and generated image.")
                return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary)

            except Exception as e:
                error_message = f"An error occurred: {e}"
                logger.error(error_message)
                return render_template("index.html", error=error_message, error_message=error_message)
        else:
            error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
            logger.warning(error_message)
            return render_template("index.html", error=error_message, error_message=error_message)

    return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary,
                           error=error_message)



@app.errorhandler(500)
def internal_server_error(e):
    """Handles internal server errors."""
    logger.error(f"Internal Server Error: {e}")
    return render_template("error.html", error="Internal Server Error"), 500


if __name__ == "__main__":
    app.run(debug=True)

Code Breakdown:

  • Import Statements: Imports necessary Flask modules, OpenAI library, osdotenvloggingOptional and Dict for type hinting, and secure_filename
  • Environment Variables: Loads the OpenAI API key from the .env file.
  • Flask Application:
    • Creates a Flask application instance.
    • Configures an upload folder and maximum file size. The UPLOAD_FOLDER is set to 'uploads', and MAX_CONTENT_LENGTH is set to 25MB. The upload folder is created if it does not exist.
  • Logging Configuration: Configures logging.
  • allowed_file Function: Checks if the uploaded file has an allowed audio extension.
  • transcribe_audio Function:
    • Takes the audio file path as input.
    • Opens the audio file in binary mode ("rb").
    • Calls the OpenAI API's openai.Audio.transcriptions.create() method to transcribe the audio.
    • Extracts the transcribed text from the API response.
    • Logs the file path before transcription and the length of the transcribed text after successful transcription.
    • Includes error handling for OpenAI API errors and other exceptions. The audio file is closed after transcription.
  • generate_image_prompt Function:
    • Takes the transcribed text as input.
    • Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a text prompt suitable for image generation.
    • The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model.
    • Extracts the generated prompt from the API response.
    • Includes error handling.
  • generate_image Function:
    • Takes the image prompt as input.
    • Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
    • Extracts the image URL from the API response.
    • Includes error handling.
  • index Route:
    • Handles both GET and POST requests.
    • For GET requests, it renders the initial HTML page.
    • For POST requests (when the user uploads an audio file):
      • It validates the uploaded file:
        • Checks if the file part exists in the request.
        • Checks if a file was selected.
        • Checks if the file type is allowed using the allowed_file function.
      • It saves the uploaded file to a temporary location using a secure filename.
      • It calls the utility functions to:
        • Transcribe the audio using transcribe_audio().
        • Generate an image prompt from the transcription using create_image_prompt().
        • Generate an image from the prompt using generate_dalle_image().
      • It handles errors that may occur during any of these steps, logging the error and rendering the index.html template with an appropriate error message.
      • If all steps are successful, it renders the index.html template, passing the transcription text, image URL, and generated prompt to be displayed.
      • It deletes the uploaded file after processing
  • @app.errorhandler(500): Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.
  • if __name__ == "__main__":: Starts the Flask development server if the script is executed directly.

Step 5: Create the HTML Template (templates/index.html)

Create a folder named templates in the same directory as app.py. Inside the templates folder, create a file named index.html with the following HTML code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Multimodal AI Assistant</title>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
    <style>
        /* --- General Styles --- */
        body {
            font-family: 'Inter', sans-serif;
            padding: 40px;
            background-color: #f9fafb; /* Tailwind's gray-50 */
            display: flex;
            justify-content: center;
            align-items: center;
            min-height: 100vh;
            margin: 0;
            color: #374151; /* Tailwind's gray-700 */
        }
        .container {
            max-width: 800px; /* Increased max-width */
            width: 95%; /* Take up most of the viewport */
            background-color: #fff;
            padding: 2rem;
            border-radius: 0.75rem; /* Tailwind's rounded-lg */
            box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
            text-align: center;
        }
        h2 {
            font-size: 2.25rem; /* Tailwind's text-3xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1.5rem; /* Tailwind's mb-6 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        p{
            color: #6b7280; /* Tailwind's gray-500 */
            margin-bottom: 1rem;
        }

        /* --- Form Styles --- */
        form {
            margin-top: 1rem; /* Tailwind's mt-4 */
            margin-bottom: 1.5rem;
            display: flex;
            flex-direction: column;
            align-items: center; /* Center form elements */
            gap: 0.5rem; /* Tailwind's gap-2 */
        }
        label {
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600;  /* Tailwind's font-semibold */
            color: #4b5563; /* Tailwind's gray-600 */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="file"] {
            width: 100%;
            max-width: 400px; /* Added max-width for file input */
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            font-size: 1rem; /* Tailwind's text-base */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="submit"] {
            padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
            color: #fff;
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600; /* Tailwind's font-semibold */
            cursor: pointer;
            transition: background-color 0.3s ease; /* Smooth transition */
            border: none;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
            margin-top: 1rem;
        }
        input[type="submit"]:hover {
            background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
        }
        input[type="submit"]:focus {
            outline: none;
            box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
        }

        /* --- Result Styles --- */
        .result-container {
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
            text-align: left;
        }
        h3 {
            font-size: 1.5rem; /* Tailwind's text-2xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1rem; /* Tailwind's mb-4 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        textarea {
            width: 100%;
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            resize: none;
            font-size: 1rem; /* Tailwind's text-base */
            line-height: 1.5rem; /* Tailwind's leading-relaxed */
            margin-top: 0.5rem; /* Tailwind's mt-2 */
            margin-bottom: 0;
            box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
            min-height: 100px;
        }
        textarea:focus {
            outline: none;
            border-color: #3b82f6; /* Tailwind's border-blue-500 */
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
        }
        img {
            max-width: 100%;
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            margin-top: 1.5rem; /* Tailwind's mt-6 */
            box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
        }

        /* --- Error Styles --- */
        .error-message {
            color: #dc2626; /* Tailwind's text-red-600 */
            margin-top: 1rem; /* Tailwind's mt-4 */
            padding: 0.75rem;
            background-color: #fee2e2; /* Tailwind's bg-red-100 */
            border-radius: 0.375rem; /* Tailwind's rounded-md */
            border: 1px solid #fecaca; /* Tailwind's border-red-300 */
            text-align: center;
        }
    </style>
</head>
<body>
    <div class="container">
        <h2>🎤🧠🎨 Multimodal Assistant</h2>
        <p> Upload an audio file to transcribe and generate a corresponding image. </p>
        <form method="POST" enctype="multipart/form-data">
            <label for="audio_file">Upload your voice note:</label><br>
            <input type="file" name="audio_file" accept="audio/*" required><br><br>
            <input type="submit" value="Generate Visual Response">
        </form>

        {% if transcript %}
            <div class="result-container">
                <h3>📝 Transcript:</h3>
                <textarea readonly>{{ transcript }}</textarea>
            </div>
        {% endif %}

        {% if prompt_summary %}
            <div class="result-container">
                <h3>🎯 Scene Prompt:</h3>
                <p>{{ prompt_summary }}</p>
            </div>
        {% endif %}

        {% if image_url %}
            <div class="result-container">
                <h3>🖼️ Generated Image:</h3>
                <img src="{{ image_url }}" alt="Generated image">
            </div>
        {% endif %}
        {% if error %}
            <div class="error-message">{{ error }}</div>
        {% endif %}
    </div>
</body>
</html>

Key elements in the HTML template:

  • HTML Structure:
    • The <head> section defines the title, links a CSS stylesheet, and sets the viewport for responsiveness.
    • The <body> contains the visible content, including a form for uploading audio and sections to display the transcription and generated image.
  • CSS Styling:
    • Modern, responsive design.
    • Styled form and input elements.
    • Clear presentation of results (transcription and image).
    • User-friendly error message display.
  • Form:
    • <form> with enctype="multipart/form-data" is used to handle file uploads.
    • <label> and <input type="file"> allow the user to select an audio file. The accept="audio/*" attribute restricts the user to uploading audio files.
    • <input type="submit"> button allows the user to submit the form.
  • Transcription and Image Display:
    • The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a textarea, and the image is displayed using an <img> tag.
  • Error Handling:
    • <div class="error-message"> is used to display any error messages to the user.

In this section, you've gained valuable insights into advanced AI integration techniques. Let's break down what you've learned:

  • Organize multimodal logic into reusable modules
    • Create clean, maintainable code structures
    • Develop modular components that can be easily updated and reused
    • Implement proper error handling and logging
  • Chain audio ➝ text ➝ prompt ➝ image cleanly
    • Process audio inputs using Whisper for accurate transcription
    • Transform transcribed text into meaningful prompts with GPT
    • Generate relevant images using DALL·E based on processed text
  • Build a professional Flask app that uses all three major OpenAI models in one flow
    • Set up proper routing and request handling
    • Manage API interactions efficiently
    • Create an intuitive user interface for seamless interaction

You now understand the power of chaining models to create sophisticated AI experiences. This knowledge opens up countless possibilities for innovation. Whether you're building an AI journaling tool that converts voice notes into illustrated entries, a voice-controlled design app that transforms spoken descriptions into visual art, or a multimodal content assistant that helps create rich media content, this foundational workflow can take you far. The skills you've learned here form the basis for creating complex, user-friendly AI applications that combine multiple modalities effectively.

6.1 Combining GPT + DALL·E + Whisper

In previous chapters, you've explored the individual capabilities of three powerful AI models: GPT for natural language processing and text generation, DALL·E for creating detailed images from text descriptions, and Whisper for accurate speech-to-text conversion. You've learned to build standalone applications like chatbots that engage in natural conversations, visual generators that bring ideas to life, and transcription tools that convert spoken words to text. You've even dabbled in simple multimodal applications that use two of these technologies together.

Now, we're taking a significant leap forward: you'll learn to create cohesive AI suites — sophisticated systems that seamlessly integrate speech recognition, text processing, and image generation into unified, powerful applications. These suites can handle complex workflows, such as converting a spoken description into a written narrative and then transforming that narrative into a visual representation, all in one smooth process.

Consider this chapter your advancement from being an API-proficient developer to becoming an architect of modular, orchestrated AI workflows. This is the same technology that powers many of today's leading software solutions. Major companies implement these integrated AI systems in their productivity tools (like advanced document processors), customer experience platforms (such as intelligent support systems), and creative applications (including AI-powered design tools) that serve massive user bases ranging from thousands to millions.

Throughout this chapter, you'll master several crucial skills:

  • Chain models with dynamic logic - Learn to create intelligent decision trees that determine how different AI models should interact and when to invoke specific capabilities
  • Handle input across multiple formats - Develop robust systems that can process and validate various types of input, from audio files to text prompts to image data
  • Return meaningful output across modalities - Create sophisticated response handling that can deliver results in multiple formats while maintaining context and coherence
  • Build real-time or near real-time pipelines - Optimize your applications for performance, ensuring quick response times even when multiple AI models are working together

And we begin with a foundational section:

Building sophisticated AI applications requires the careful integration of multiple specialized models to create a seamless, intelligent experience. Each AI model serves as a master of its domain: GPT excels in understanding and generating human-like text, DALL·E specializes in creating stunning visual artwork from textual descriptions, and Whisper demonstrates remarkable accuracy in converting spoken words to text. However, the true innovation emerges when these individual powerhouses are orchestrated to work together in perfect harmony.

This sophisticated integration enables the creation of applications that mirror human cognitive processes by handling multiple types of information simultaneously. Consider the natural flow of human communication: we speak, understand context, and visualize concepts seamlessly. Now imagine an AI system that matches this natural process: you describe a scene verbally, the system processes your speech into text, understands the context and details of your description, and then transforms that understanding into a visual representation - all flowing smoothly from one step to the next, just as your brain would process the same information.

In this section, we'll explore the intricacies of building such a cross-model pipeline. You'll master essential concepts like efficient data transformation between models (ensuring that the output from one model is optimally formatted for the next), sophisticated asynchronous process management (allowing multiple models to work simultaneously when possible), and the implementation of clean, maintainable code architecture. We'll dive deep into handling edge cases, managing model-specific quirks, and ensuring smooth data flow throughout the entire pipeline. Through a comprehensive practical example, you'll gain hands-on experience with these concepts, preparing you to architect and deploy your own sophisticated multi-model AI applications that can scale efficiently and maintain high performance under real-world conditions.

In this section, you'll create a Flask-based web application that integrates multiple AI models to process audio input and generate a corresponding image.  Specifically, the application will:

  • Accept an audio file uploaded by the user.
  • Transcribe the audio content into text using OpenAI's Whisper API.
  • Analyze the transcribed text using GPT-4o to extract a descriptive scene representation.
  • Generate an image based on the scene description using OpenAI's DALL·E 3 API.
  • Display both the text transcription and the generated image on a single webpage.

This project demonstrates a basic multimodal AI pipeline, combining speech-to-text and text-to-image generation.  It establishes a foundation for building more sophisticated applications.

6.1.1 Step-by-Step Implementation

Step 1: Set Up Project Structure

Download the sample file: https://files.cuantum.tech/audio/gpt-dalle-whisper-sample.mp3

Organize your project files as follows:

/multimodal_app

├── app.py
├── .env
└── templates/
    └── index.html
└── utils/
    ├── transcribe.py
    ├── generate_prompt.py
    ├── generate_image.py
    └── audio_analysis.py  # New module for audio analysis
  • /multimodal_app: The root directory for your project.
  • app.py: The main Flask application file.
  • .env: A file to store your OpenAI API key.
  • templates/: A directory for HTML templates.
  • templates/index.html: The HTML template for the user interface.
  • utils/: A directory for Python modules containing reusable functions.
    • transcribe.py: Contains the function to transcribe audio using Whisper.
    • generate_prompt.py: Contains the function to generate an image prompt using GPT-4o.
    • generate_image.py: Contains the function to generate an image with DALL·E 3.
    • audio_analysis.py: New module to analyze audio.

Step 2: Install Required Packages

Install the necessary Python libraries:

pip install flask openai python-dotenv

Step 3: Create Utility Modules

Create the following Python files in the utils/ directory:

utils/transcribe.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        audio_file.close()
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None
  • This module defines the transcribe_audio function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription.
  • The function opens the audio file in binary read mode ("rb").
  • It calls openai.audio.transcriptions.create() to perform the transcription, specifying the "whisper-1" model.
  • It extracts the transcribed text from the API response.
  • It includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors. If an error occurs, it logs the error and returns None.
  • It logs the file path before transcription and the length of the transcribed text after successful transcription.
  • The audio file is closed after transcription.

utils/generate_prompt.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def create_image_prompt(transcription: str) -> Optional[str]:
    """
    Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.

    Args:
        transcription (str): The text transcription of the audio.

    Returns:
        Optional[str]: A detailed text prompt suitable for image generation, or None on error.
    """
    try:
        logger.info("Generating image prompt from transcription")
        response = openai.chat.completions.create(
            model="gpt-4o",  #  Use a powerful chat model
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.  Incorporate scene lighting, time of day, weather, and camera angle into the description.",
                },
                {"role": "user", "content": transcription},
            ],
        )
        prompt = response.choices[0].message.content
        logger.info(f"Generated prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None
  • This module defines the create_image_prompt function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation.
  • The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle.
  • The user message provides the transcribed text as the content for the model to work with.
  • The function extracts the generated prompt from the API response.
  • It includes error handling.

utils/generate_image.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024", response_format: str = "url") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}")
        response = openai.images.generate(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None
  • This module defines the generate_dalle_image function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image.
  • It calls the openai.images.generate() method to generate the image.
  • It extracts the URL of the generated image from the API response.
  • It includes error handling.

Step 4: Create the Main App (app.py)

Create a Python file named app.py in the root directory of your project and add the following code:

from flask import Flask, request, render_template, jsonify, make_response, redirect, url_for
import os
from dotenv import load_dotenv
import logging
from typing import Optional
from werkzeug.utils import secure_filename
from werkzeug.datastructures import FileStorage

# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads'  # Store uploaded files
app.config['MAX_CONTENT_LENGTH'] = 25 * 1024 * 1024  # 25MB max file size - increased for larger audio files
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)  # Create the upload folder if it doesn't exist

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}  # Allowed audio file extensions


def allowed_file(filename: str) -> bool:
    """
    Checks if the uploaded file has an allowed extension.

    Args:
        filename (str): The name of the file.

    Returns:
        bool: True if the file has an allowed extension, False otherwise.
    """
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


@app.route("/", methods=["GET", "POST"])
def index():
    """
    Handles the main route for the web application.
    Processes audio uploads, transcribes them, generates image prompts, and displays images.
    """
    transcript = None
    image_url = None
    prompt_summary = None
    error_message = None

    if request.method == "POST":
        if 'audio_file' not in request.files:
            error_message = "No file part"
            logger.warning(error_message)
            return render_template("index.html", error=error_message, error_message=error_message)

        file: FileStorage = request.files['audio_file']  # Use type hinting
        if file.filename == '':
            error_message = "No file selected"
            logger.warning(error_message)
            return render_template("index.html", error=error_message, error_message=error_message)

        if file and allowed_file(file.filename):
            try:
                # Secure the filename and construct a safe path
                filename = secure_filename(file.filename)
                file_path = os.path.join(app.config['UPLOAD_FOLDER'], filename)
                file.save(file_path)  # Save the uploaded file

                transcript = transcribe_audio(file_path)  # Transcribe audio
                if not transcript:
                    error_message = "Audio transcription failed. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error=error_message, error_message=error_message)

                prompt_summary = create_image_prompt(transcript)  # Generate prompt
                if not prompt_summary:
                    error_message = "Failed to generate image prompt. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error=error_message, error_message=error_message)

                image_url = generate_dalle_image(prompt_summary)  # Generate image
                if not image_url:
                    error_message = "Failed to generate image. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error=error_message, error_message=error_message)

                # Optionally, delete the uploaded file after processing
                os.remove(file_path)
                logger.info(f"Successfully processed audio file and generated image.")
                return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary)

            except Exception as e:
                error_message = f"An error occurred: {e}"
                logger.error(error_message)
                return render_template("index.html", error=error_message, error_message=error_message)
        else:
            error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
            logger.warning(error_message)
            return render_template("index.html", error=error_message, error_message=error_message)

    return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary,
                           error=error_message)



@app.errorhandler(500)
def internal_server_error(e):
    """Handles internal server errors."""
    logger.error(f"Internal Server Error: {e}")
    return render_template("error.html", error="Internal Server Error"), 500


if __name__ == "__main__":
    app.run(debug=True)

Code Breakdown:

  • Import Statements: Imports necessary Flask modules, OpenAI library, osdotenvloggingOptional and Dict for type hinting, and secure_filename
  • Environment Variables: Loads the OpenAI API key from the .env file.
  • Flask Application:
    • Creates a Flask application instance.
    • Configures an upload folder and maximum file size. The UPLOAD_FOLDER is set to 'uploads', and MAX_CONTENT_LENGTH is set to 25MB. The upload folder is created if it does not exist.
  • Logging Configuration: Configures logging.
  • allowed_file Function: Checks if the uploaded file has an allowed audio extension.
  • transcribe_audio Function:
    • Takes the audio file path as input.
    • Opens the audio file in binary mode ("rb").
    • Calls the OpenAI API's openai.Audio.transcriptions.create() method to transcribe the audio.
    • Extracts the transcribed text from the API response.
    • Logs the file path before transcription and the length of the transcribed text after successful transcription.
    • Includes error handling for OpenAI API errors and other exceptions. The audio file is closed after transcription.
  • generate_image_prompt Function:
    • Takes the transcribed text as input.
    • Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a text prompt suitable for image generation.
    • The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model.
    • Extracts the generated prompt from the API response.
    • Includes error handling.
  • generate_image Function:
    • Takes the image prompt as input.
    • Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
    • Extracts the image URL from the API response.
    • Includes error handling.
  • index Route:
    • Handles both GET and POST requests.
    • For GET requests, it renders the initial HTML page.
    • For POST requests (when the user uploads an audio file):
      • It validates the uploaded file:
        • Checks if the file part exists in the request.
        • Checks if a file was selected.
        • Checks if the file type is allowed using the allowed_file function.
      • It saves the uploaded file to a temporary location using a secure filename.
      • It calls the utility functions to:
        • Transcribe the audio using transcribe_audio().
        • Generate an image prompt from the transcription using create_image_prompt().
        • Generate an image from the prompt using generate_dalle_image().
      • It handles errors that may occur during any of these steps, logging the error and rendering the index.html template with an appropriate error message.
      • If all steps are successful, it renders the index.html template, passing the transcription text, image URL, and generated prompt to be displayed.
      • It deletes the uploaded file after processing
  • @app.errorhandler(500): Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.
  • if __name__ == "__main__":: Starts the Flask development server if the script is executed directly.

Step 5: Create the HTML Template (templates/index.html)

Create a folder named templates in the same directory as app.py. Inside the templates folder, create a file named index.html with the following HTML code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Multimodal AI Assistant</title>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
    <style>
        /* --- General Styles --- */
        body {
            font-family: 'Inter', sans-serif;
            padding: 40px;
            background-color: #f9fafb; /* Tailwind's gray-50 */
            display: flex;
            justify-content: center;
            align-items: center;
            min-height: 100vh;
            margin: 0;
            color: #374151; /* Tailwind's gray-700 */
        }
        .container {
            max-width: 800px; /* Increased max-width */
            width: 95%; /* Take up most of the viewport */
            background-color: #fff;
            padding: 2rem;
            border-radius: 0.75rem; /* Tailwind's rounded-lg */
            box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
            text-align: center;
        }
        h2 {
            font-size: 2.25rem; /* Tailwind's text-3xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1.5rem; /* Tailwind's mb-6 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        p{
            color: #6b7280; /* Tailwind's gray-500 */
            margin-bottom: 1rem;
        }

        /* --- Form Styles --- */
        form {
            margin-top: 1rem; /* Tailwind's mt-4 */
            margin-bottom: 1.5rem;
            display: flex;
            flex-direction: column;
            align-items: center; /* Center form elements */
            gap: 0.5rem; /* Tailwind's gap-2 */
        }
        label {
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600;  /* Tailwind's font-semibold */
            color: #4b5563; /* Tailwind's gray-600 */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="file"] {
            width: 100%;
            max-width: 400px; /* Added max-width for file input */
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            font-size: 1rem; /* Tailwind's text-base */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="submit"] {
            padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
            color: #fff;
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600; /* Tailwind's font-semibold */
            cursor: pointer;
            transition: background-color 0.3s ease; /* Smooth transition */
            border: none;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
            margin-top: 1rem;
        }
        input[type="submit"]:hover {
            background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
        }
        input[type="submit"]:focus {
            outline: none;
            box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
        }

        /* --- Result Styles --- */
        .result-container {
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
            text-align: left;
        }
        h3 {
            font-size: 1.5rem; /* Tailwind's text-2xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1rem; /* Tailwind's mb-4 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        textarea {
            width: 100%;
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            resize: none;
            font-size: 1rem; /* Tailwind's text-base */
            line-height: 1.5rem; /* Tailwind's leading-relaxed */
            margin-top: 0.5rem; /* Tailwind's mt-2 */
            margin-bottom: 0;
            box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
            min-height: 100px;
        }
        textarea:focus {
            outline: none;
            border-color: #3b82f6; /* Tailwind's border-blue-500 */
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
        }
        img {
            max-width: 100%;
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            margin-top: 1.5rem; /* Tailwind's mt-6 */
            box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
        }

        /* --- Error Styles --- */
        .error-message {
            color: #dc2626; /* Tailwind's text-red-600 */
            margin-top: 1rem; /* Tailwind's mt-4 */
            padding: 0.75rem;
            background-color: #fee2e2; /* Tailwind's bg-red-100 */
            border-radius: 0.375rem; /* Tailwind's rounded-md */
            border: 1px solid #fecaca; /* Tailwind's border-red-300 */
            text-align: center;
        }
    </style>
</head>
<body>
    <div class="container">
        <h2>🎤🧠🎨 Multimodal Assistant</h2>
        <p> Upload an audio file to transcribe and generate a corresponding image. </p>
        <form method="POST" enctype="multipart/form-data">
            <label for="audio_file">Upload your voice note:</label><br>
            <input type="file" name="audio_file" accept="audio/*" required><br><br>
            <input type="submit" value="Generate Visual Response">
        </form>

        {% if transcript %}
            <div class="result-container">
                <h3>📝 Transcript:</h3>
                <textarea readonly>{{ transcript }}</textarea>
            </div>
        {% endif %}

        {% if prompt_summary %}
            <div class="result-container">
                <h3>🎯 Scene Prompt:</h3>
                <p>{{ prompt_summary }}</p>
            </div>
        {% endif %}

        {% if image_url %}
            <div class="result-container">
                <h3>🖼️ Generated Image:</h3>
                <img src="{{ image_url }}" alt="Generated image">
            </div>
        {% endif %}
        {% if error %}
            <div class="error-message">{{ error }}</div>
        {% endif %}
    </div>
</body>
</html>

Key elements in the HTML template:

  • HTML Structure:
    • The <head> section defines the title, links a CSS stylesheet, and sets the viewport for responsiveness.
    • The <body> contains the visible content, including a form for uploading audio and sections to display the transcription and generated image.
  • CSS Styling:
    • Modern, responsive design.
    • Styled form and input elements.
    • Clear presentation of results (transcription and image).
    • User-friendly error message display.
  • Form:
    • <form> with enctype="multipart/form-data" is used to handle file uploads.
    • <label> and <input type="file"> allow the user to select an audio file. The accept="audio/*" attribute restricts the user to uploading audio files.
    • <input type="submit"> button allows the user to submit the form.
  • Transcription and Image Display:
    • The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a textarea, and the image is displayed using an <img> tag.
  • Error Handling:
    • <div class="error-message"> is used to display any error messages to the user.

In this section, you've gained valuable insights into advanced AI integration techniques. Let's break down what you've learned:

  • Organize multimodal logic into reusable modules
    • Create clean, maintainable code structures
    • Develop modular components that can be easily updated and reused
    • Implement proper error handling and logging
  • Chain audio ➝ text ➝ prompt ➝ image cleanly
    • Process audio inputs using Whisper for accurate transcription
    • Transform transcribed text into meaningful prompts with GPT
    • Generate relevant images using DALL·E based on processed text
  • Build a professional Flask app that uses all three major OpenAI models in one flow
    • Set up proper routing and request handling
    • Manage API interactions efficiently
    • Create an intuitive user interface for seamless interaction

You now understand the power of chaining models to create sophisticated AI experiences. This knowledge opens up countless possibilities for innovation. Whether you're building an AI journaling tool that converts voice notes into illustrated entries, a voice-controlled design app that transforms spoken descriptions into visual art, or a multimodal content assistant that helps create rich media content, this foundational workflow can take you far. The skills you've learned here form the basis for creating complex, user-friendly AI applications that combine multiple modalities effectively.

6.1 Combining GPT + DALL·E + Whisper

In previous chapters, you've explored the individual capabilities of three powerful AI models: GPT for natural language processing and text generation, DALL·E for creating detailed images from text descriptions, and Whisper for accurate speech-to-text conversion. You've learned to build standalone applications like chatbots that engage in natural conversations, visual generators that bring ideas to life, and transcription tools that convert spoken words to text. You've even dabbled in simple multimodal applications that use two of these technologies together.

Now, we're taking a significant leap forward: you'll learn to create cohesive AI suites — sophisticated systems that seamlessly integrate speech recognition, text processing, and image generation into unified, powerful applications. These suites can handle complex workflows, such as converting a spoken description into a written narrative and then transforming that narrative into a visual representation, all in one smooth process.

Consider this chapter your advancement from being an API-proficient developer to becoming an architect of modular, orchestrated AI workflows. This is the same technology that powers many of today's leading software solutions. Major companies implement these integrated AI systems in their productivity tools (like advanced document processors), customer experience platforms (such as intelligent support systems), and creative applications (including AI-powered design tools) that serve massive user bases ranging from thousands to millions.

Throughout this chapter, you'll master several crucial skills:

  • Chain models with dynamic logic - Learn to create intelligent decision trees that determine how different AI models should interact and when to invoke specific capabilities
  • Handle input across multiple formats - Develop robust systems that can process and validate various types of input, from audio files to text prompts to image data
  • Return meaningful output across modalities - Create sophisticated response handling that can deliver results in multiple formats while maintaining context and coherence
  • Build real-time or near real-time pipelines - Optimize your applications for performance, ensuring quick response times even when multiple AI models are working together

And we begin with a foundational section:

Building sophisticated AI applications requires the careful integration of multiple specialized models to create a seamless, intelligent experience. Each AI model serves as a master of its domain: GPT excels in understanding and generating human-like text, DALL·E specializes in creating stunning visual artwork from textual descriptions, and Whisper demonstrates remarkable accuracy in converting spoken words to text. However, the true innovation emerges when these individual powerhouses are orchestrated to work together in perfect harmony.

This sophisticated integration enables the creation of applications that mirror human cognitive processes by handling multiple types of information simultaneously. Consider the natural flow of human communication: we speak, understand context, and visualize concepts seamlessly. Now imagine an AI system that matches this natural process: you describe a scene verbally, the system processes your speech into text, understands the context and details of your description, and then transforms that understanding into a visual representation - all flowing smoothly from one step to the next, just as your brain would process the same information.

In this section, we'll explore the intricacies of building such a cross-model pipeline. You'll master essential concepts like efficient data transformation between models (ensuring that the output from one model is optimally formatted for the next), sophisticated asynchronous process management (allowing multiple models to work simultaneously when possible), and the implementation of clean, maintainable code architecture. We'll dive deep into handling edge cases, managing model-specific quirks, and ensuring smooth data flow throughout the entire pipeline. Through a comprehensive practical example, you'll gain hands-on experience with these concepts, preparing you to architect and deploy your own sophisticated multi-model AI applications that can scale efficiently and maintain high performance under real-world conditions.

In this section, you'll create a Flask-based web application that integrates multiple AI models to process audio input and generate a corresponding image.  Specifically, the application will:

  • Accept an audio file uploaded by the user.
  • Transcribe the audio content into text using OpenAI's Whisper API.
  • Analyze the transcribed text using GPT-4o to extract a descriptive scene representation.
  • Generate an image based on the scene description using OpenAI's DALL·E 3 API.
  • Display both the text transcription and the generated image on a single webpage.

This project demonstrates a basic multimodal AI pipeline, combining speech-to-text and text-to-image generation.  It establishes a foundation for building more sophisticated applications.

6.1.1 Step-by-Step Implementation

Step 1: Set Up Project Structure

Download the sample file: https://files.cuantum.tech/audio/gpt-dalle-whisper-sample.mp3

Organize your project files as follows:

/multimodal_app

├── app.py
├── .env
└── templates/
    └── index.html
└── utils/
    ├── transcribe.py
    ├── generate_prompt.py
    ├── generate_image.py
    └── audio_analysis.py  # New module for audio analysis
  • /multimodal_app: The root directory for your project.
  • app.py: The main Flask application file.
  • .env: A file to store your OpenAI API key.
  • templates/: A directory for HTML templates.
  • templates/index.html: The HTML template for the user interface.
  • utils/: A directory for Python modules containing reusable functions.
    • transcribe.py: Contains the function to transcribe audio using Whisper.
    • generate_prompt.py: Contains the function to generate an image prompt using GPT-4o.
    • generate_image.py: Contains the function to generate an image with DALL·E 3.
    • audio_analysis.py: New module to analyze audio.

Step 2: Install Required Packages

Install the necessary Python libraries:

pip install flask openai python-dotenv

Step 3: Create Utility Modules

Create the following Python files in the utils/ directory:

utils/transcribe.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        audio_file.close()
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None
  • This module defines the transcribe_audio function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription.
  • The function opens the audio file in binary read mode ("rb").
  • It calls openai.audio.transcriptions.create() to perform the transcription, specifying the "whisper-1" model.
  • It extracts the transcribed text from the API response.
  • It includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors. If an error occurs, it logs the error and returns None.
  • It logs the file path before transcription and the length of the transcribed text after successful transcription.
  • The audio file is closed after transcription.

utils/generate_prompt.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def create_image_prompt(transcription: str) -> Optional[str]:
    """
    Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.

    Args:
        transcription (str): The text transcription of the audio.

    Returns:
        Optional[str]: A detailed text prompt suitable for image generation, or None on error.
    """
    try:
        logger.info("Generating image prompt from transcription")
        response = openai.chat.completions.create(
            model="gpt-4o",  #  Use a powerful chat model
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.  Incorporate scene lighting, time of day, weather, and camera angle into the description.",
                },
                {"role": "user", "content": transcription},
            ],
        )
        prompt = response.choices[0].message.content
        logger.info(f"Generated prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None
  • This module defines the create_image_prompt function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation.
  • The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle.
  • The user message provides the transcribed text as the content for the model to work with.
  • The function extracts the generated prompt from the API response.
  • It includes error handling.

utils/generate_image.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024", response_format: str = "url") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}")
        response = openai.images.generate(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None
  • This module defines the generate_dalle_image function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image.
  • It calls the openai.images.generate() method to generate the image.
  • It extracts the URL of the generated image from the API response.
  • It includes error handling.

Step 4: Create the Main App (app.py)

Create a Python file named app.py in the root directory of your project and add the following code:

from flask import Flask, request, render_template, jsonify, make_response, redirect, url_for
import os
from dotenv import load_dotenv
import logging
from typing import Optional
from werkzeug.utils import secure_filename
from werkzeug.datastructures import FileStorage

# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads'  # Store uploaded files
app.config['MAX_CONTENT_LENGTH'] = 25 * 1024 * 1024  # 25MB max file size - increased for larger audio files
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)  # Create the upload folder if it doesn't exist

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}  # Allowed audio file extensions


def allowed_file(filename: str) -> bool:
    """
    Checks if the uploaded file has an allowed extension.

    Args:
        filename (str): The name of the file.

    Returns:
        bool: True if the file has an allowed extension, False otherwise.
    """
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


@app.route("/", methods=["GET", "POST"])
def index():
    """
    Handles the main route for the web application.
    Processes audio uploads, transcribes them, generates image prompts, and displays images.
    """
    transcript = None
    image_url = None
    prompt_summary = None
    error_message = None

    if request.method == "POST":
        if 'audio_file' not in request.files:
            error_message = "No file part"
            logger.warning(error_message)
            return render_template("index.html", error=error_message, error_message=error_message)

        file: FileStorage = request.files['audio_file']  # Use type hinting
        if file.filename == '':
            error_message = "No file selected"
            logger.warning(error_message)
            return render_template("index.html", error=error_message, error_message=error_message)

        if file and allowed_file(file.filename):
            try:
                # Secure the filename and construct a safe path
                filename = secure_filename(file.filename)
                file_path = os.path.join(app.config['UPLOAD_FOLDER'], filename)
                file.save(file_path)  # Save the uploaded file

                transcript = transcribe_audio(file_path)  # Transcribe audio
                if not transcript:
                    error_message = "Audio transcription failed. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error=error_message, error_message=error_message)

                prompt_summary = create_image_prompt(transcript)  # Generate prompt
                if not prompt_summary:
                    error_message = "Failed to generate image prompt. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error=error_message, error_message=error_message)

                image_url = generate_dalle_image(prompt_summary)  # Generate image
                if not image_url:
                    error_message = "Failed to generate image. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error=error_message, error_message=error_message)

                # Optionally, delete the uploaded file after processing
                os.remove(file_path)
                logger.info(f"Successfully processed audio file and generated image.")
                return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary)

            except Exception as e:
                error_message = f"An error occurred: {e}"
                logger.error(error_message)
                return render_template("index.html", error=error_message, error_message=error_message)
        else:
            error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
            logger.warning(error_message)
            return render_template("index.html", error=error_message, error_message=error_message)

    return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary,
                           error=error_message)



@app.errorhandler(500)
def internal_server_error(e):
    """Handles internal server errors."""
    logger.error(f"Internal Server Error: {e}")
    return render_template("error.html", error="Internal Server Error"), 500


if __name__ == "__main__":
    app.run(debug=True)

Code Breakdown:

  • Import Statements: Imports necessary Flask modules, OpenAI library, osdotenvloggingOptional and Dict for type hinting, and secure_filename
  • Environment Variables: Loads the OpenAI API key from the .env file.
  • Flask Application:
    • Creates a Flask application instance.
    • Configures an upload folder and maximum file size. The UPLOAD_FOLDER is set to 'uploads', and MAX_CONTENT_LENGTH is set to 25MB. The upload folder is created if it does not exist.
  • Logging Configuration: Configures logging.
  • allowed_file Function: Checks if the uploaded file has an allowed audio extension.
  • transcribe_audio Function:
    • Takes the audio file path as input.
    • Opens the audio file in binary mode ("rb").
    • Calls the OpenAI API's openai.Audio.transcriptions.create() method to transcribe the audio.
    • Extracts the transcribed text from the API response.
    • Logs the file path before transcription and the length of the transcribed text after successful transcription.
    • Includes error handling for OpenAI API errors and other exceptions. The audio file is closed after transcription.
  • generate_image_prompt Function:
    • Takes the transcribed text as input.
    • Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a text prompt suitable for image generation.
    • The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model.
    • Extracts the generated prompt from the API response.
    • Includes error handling.
  • generate_image Function:
    • Takes the image prompt as input.
    • Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
    • Extracts the image URL from the API response.
    • Includes error handling.
  • index Route:
    • Handles both GET and POST requests.
    • For GET requests, it renders the initial HTML page.
    • For POST requests (when the user uploads an audio file):
      • It validates the uploaded file:
        • Checks if the file part exists in the request.
        • Checks if a file was selected.
        • Checks if the file type is allowed using the allowed_file function.
      • It saves the uploaded file to a temporary location using a secure filename.
      • It calls the utility functions to:
        • Transcribe the audio using transcribe_audio().
        • Generate an image prompt from the transcription using create_image_prompt().
        • Generate an image from the prompt using generate_dalle_image().
      • It handles errors that may occur during any of these steps, logging the error and rendering the index.html template with an appropriate error message.
      • If all steps are successful, it renders the index.html template, passing the transcription text, image URL, and generated prompt to be displayed.
      • It deletes the uploaded file after processing
  • @app.errorhandler(500): Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.
  • if __name__ == "__main__":: Starts the Flask development server if the script is executed directly.

Step 5: Create the HTML Template (templates/index.html)

Create a folder named templates in the same directory as app.py. Inside the templates folder, create a file named index.html with the following HTML code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Multimodal AI Assistant</title>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
    <style>
        /* --- General Styles --- */
        body {
            font-family: 'Inter', sans-serif;
            padding: 40px;
            background-color: #f9fafb; /* Tailwind's gray-50 */
            display: flex;
            justify-content: center;
            align-items: center;
            min-height: 100vh;
            margin: 0;
            color: #374151; /* Tailwind's gray-700 */
        }
        .container {
            max-width: 800px; /* Increased max-width */
            width: 95%; /* Take up most of the viewport */
            background-color: #fff;
            padding: 2rem;
            border-radius: 0.75rem; /* Tailwind's rounded-lg */
            box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
            text-align: center;
        }
        h2 {
            font-size: 2.25rem; /* Tailwind's text-3xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1.5rem; /* Tailwind's mb-6 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        p{
            color: #6b7280; /* Tailwind's gray-500 */
            margin-bottom: 1rem;
        }

        /* --- Form Styles --- */
        form {
            margin-top: 1rem; /* Tailwind's mt-4 */
            margin-bottom: 1.5rem;
            display: flex;
            flex-direction: column;
            align-items: center; /* Center form elements */
            gap: 0.5rem; /* Tailwind's gap-2 */
        }
        label {
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600;  /* Tailwind's font-semibold */
            color: #4b5563; /* Tailwind's gray-600 */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="file"] {
            width: 100%;
            max-width: 400px; /* Added max-width for file input */
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            font-size: 1rem; /* Tailwind's text-base */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="submit"] {
            padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
            color: #fff;
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600; /* Tailwind's font-semibold */
            cursor: pointer;
            transition: background-color 0.3s ease; /* Smooth transition */
            border: none;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
            margin-top: 1rem;
        }
        input[type="submit"]:hover {
            background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
        }
        input[type="submit"]:focus {
            outline: none;
            box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
        }

        /* --- Result Styles --- */
        .result-container {
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
            text-align: left;
        }
        h3 {
            font-size: 1.5rem; /* Tailwind's text-2xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1rem; /* Tailwind's mb-4 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        textarea {
            width: 100%;
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            resize: none;
            font-size: 1rem; /* Tailwind's text-base */
            line-height: 1.5rem; /* Tailwind's leading-relaxed */
            margin-top: 0.5rem; /* Tailwind's mt-2 */
            margin-bottom: 0;
            box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
            min-height: 100px;
        }
        textarea:focus {
            outline: none;
            border-color: #3b82f6; /* Tailwind's border-blue-500 */
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
        }
        img {
            max-width: 100%;
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            margin-top: 1.5rem; /* Tailwind's mt-6 */
            box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
        }

        /* --- Error Styles --- */
        .error-message {
            color: #dc2626; /* Tailwind's text-red-600 */
            margin-top: 1rem; /* Tailwind's mt-4 */
            padding: 0.75rem;
            background-color: #fee2e2; /* Tailwind's bg-red-100 */
            border-radius: 0.375rem; /* Tailwind's rounded-md */
            border: 1px solid #fecaca; /* Tailwind's border-red-300 */
            text-align: center;
        }
    </style>
</head>
<body>
    <div class="container">
        <h2>🎤🧠🎨 Multimodal Assistant</h2>
        <p> Upload an audio file to transcribe and generate a corresponding image. </p>
        <form method="POST" enctype="multipart/form-data">
            <label for="audio_file">Upload your voice note:</label><br>
            <input type="file" name="audio_file" accept="audio/*" required><br><br>
            <input type="submit" value="Generate Visual Response">
        </form>

        {% if transcript %}
            <div class="result-container">
                <h3>📝 Transcript:</h3>
                <textarea readonly>{{ transcript }}</textarea>
            </div>
        {% endif %}

        {% if prompt_summary %}
            <div class="result-container">
                <h3>🎯 Scene Prompt:</h3>
                <p>{{ prompt_summary }}</p>
            </div>
        {% endif %}

        {% if image_url %}
            <div class="result-container">
                <h3>🖼️ Generated Image:</h3>
                <img src="{{ image_url }}" alt="Generated image">
            </div>
        {% endif %}
        {% if error %}
            <div class="error-message">{{ error }}</div>
        {% endif %}
    </div>
</body>
</html>

Key elements in the HTML template:

  • HTML Structure:
    • The <head> section defines the title, links a CSS stylesheet, and sets the viewport for responsiveness.
    • The <body> contains the visible content, including a form for uploading audio and sections to display the transcription and generated image.
  • CSS Styling:
    • Modern, responsive design.
    • Styled form and input elements.
    • Clear presentation of results (transcription and image).
    • User-friendly error message display.
  • Form:
    • <form> with enctype="multipart/form-data" is used to handle file uploads.
    • <label> and <input type="file"> allow the user to select an audio file. The accept="audio/*" attribute restricts the user to uploading audio files.
    • <input type="submit"> button allows the user to submit the form.
  • Transcription and Image Display:
    • The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a textarea, and the image is displayed using an <img> tag.
  • Error Handling:
    • <div class="error-message"> is used to display any error messages to the user.

In this section, you've gained valuable insights into advanced AI integration techniques. Let's break down what you've learned:

  • Organize multimodal logic into reusable modules
    • Create clean, maintainable code structures
    • Develop modular components that can be easily updated and reused
    • Implement proper error handling and logging
  • Chain audio ➝ text ➝ prompt ➝ image cleanly
    • Process audio inputs using Whisper for accurate transcription
    • Transform transcribed text into meaningful prompts with GPT
    • Generate relevant images using DALL·E based on processed text
  • Build a professional Flask app that uses all three major OpenAI models in one flow
    • Set up proper routing and request handling
    • Manage API interactions efficiently
    • Create an intuitive user interface for seamless interaction

You now understand the power of chaining models to create sophisticated AI experiences. This knowledge opens up countless possibilities for innovation. Whether you're building an AI journaling tool that converts voice notes into illustrated entries, a voice-controlled design app that transforms spoken descriptions into visual art, or a multimodal content assistant that helps create rich media content, this foundational workflow can take you far. The skills you've learned here form the basis for creating complex, user-friendly AI applications that combine multiple modalities effectively.