Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconOpenAI API Bible Volume 2
OpenAI API Bible Volume 2

Chapter 6: Cross-Model AI Suites

6.2 Building a Creator Dashboard

This is where all the capabilities you've developed so far come together to create a powerful, unified system. By integrating multiple AI technologies, we can create applications that are greater than the sum of their parts. Let's explore these core capabilities in detail:

  • Transcription turns spoken words into written textUsing advanced speech recognition models like Whisper, we can accurately convert audio recordings into text, preserving the speaker's intent and context. This forms the foundation for further processing.
  • Content generation creates new, contextually relevant materialLarge language models can analyze the transcribed text and generate new content that maintains consistency with the original message while adding valuable insights or expanding on key points.
  • Prompt engineering crafts precise instructions for AI modelsThrough careful prompt construction, we can guide AI models to produce more accurate and relevant outputs. This involves understanding both the technical capabilities of the models and the nuanced ways to communicate with them.
  • Image creation transforms text descriptions into visual artModels like DALL·E can interpret textual descriptions and create corresponding images, adding a visual dimension to our applications and making abstract concepts more tangible.

These components don't just exist side by side - they form an interconnected pipeline where each step enhances the next. The output from transcription feeds into content generation, which informs prompt engineering, ultimately leading to image creation. This seamless integration creates a fluid workflow where users can start with a simple voice recording and end with a rich multimedia output, all within a single, cohesive system. By eliminating the need to switch between different tools or interfaces, users can focus on their creative process rather than technical implementation details.

6.2.1 What You'll Build

In this section, you'll design and implement a Creator Dashboard - a sophisticated web interface that transforms how creators work with AI. This comprehensive platform serves as a central hub for content creation, combining multiple AI technologies into one seamless experience. Let's explore the key features that make this dashboard powerful:

  • Upload a voice recordingCreators can easily upload audio files in various formats, making it simple to start their creative process with spoken ideas or narration.
  • Transcribe the voice recording into text using AIUsing advanced AI speech recognition technology, the system accurately converts spoken words into written text, maintaining the nuances and context of the original recording.
  • Turn that transcription into an editable promptThe system intelligently processes the transcribed text to create structured, AI-ready prompts that can be customized to achieve the desired creative output.
  • Generate images using DALL·E based on the promptLeveraging DALL·E's powerful image generation capabilities, the system creates visual representations that match the specified prompts, bringing ideas to life through AI-generated artwork.
  • Summarize the transcriptThe dashboard employs AI to distill long transcriptions into concise, meaningful summaries, helping creators quickly grasp the core concepts and themes.
  • Display all the results for review and further use in content productionAll generated content - from transcripts to images - is presented in an organized, easy-to-review format, allowing creators to efficiently manage and utilize their assets.

To build this robust system, you'll implement a modern tech stack using Flask for the backend operations and a clean, responsive combination of HTML and CSS for the frontend interface. This architecture ensures both modularity and maintainability, making it easy to update and scale the dashboard as needed.

6.2.2 Step-by-Step Implementation

Step 1: Project Setup

Download the audio sample: https://files.cuantum.tech/audio/dashboard-project.mp3

Create a new directory for your project and navigate into it:

mkdir creator_dashboard
cd creator_dashboard

It's recommended to set up a virtual environment:

python -m venv venv
source venv/bin/activate  # On macOS/Linux
venv\\Scripts\\activate  # On Windows

Install the required Python packages:

pip install flask openai python-dotenv

Organize your project files as follows:

/creator_dashboard

├── app.py
├── .env
└── templates/
    └── dashboard.html
└── utils/
    ├── __init__.py
    ├── transcribe.py
    ├── summarize.py
    ├── generate_prompt.py
    └── generate_image.py
  • app.py: The main Flask application file.
  • .env: A file to store your OpenAI API key.
  • templates/: A directory for HTML templates.
  • templates/dashboard.html: The HTML template for the user interface.
  • utils/: A directory for Python modules containing reusable functions.
    • __init__.py: Makes the utils directory a Python package.
    • transcribe.py: Contains the function to transcribe audio using Whisper.
    • summarize.py: Contains the function to summarize the transcription using a Large Language Model.
    • generate_prompt.py: Contains the function to generate an image prompt from the summary using a Large Language Model.
    • generate_image.py: Contains the function to generate an image with DALL·E 3.

Step 2: Create the Utility Modules

Create the following Python files in the utils/ directory:

utils/transcribe.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.Audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        audio_file.close()
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None
  • This module defines the transcribe_audio function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription.
  • The function opens the audio file in binary read mode ("rb").
  • It calls openai.Audio.transcriptions.create() to perform the transcription, specifying the "whisper-1" model.
  • It extracts the transcribed text from the API response.
  • It includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors. If an error occurs, it logs the error and returns None.
  • It logs the file path before transcription and the length of the transcribed text after successful transcription.
  • The audio file is closed after transcription.

utils/summarize.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def summarize_transcript(text: str) -> Optional[str]:
    """
    Summarizes a text transcript using OpenAI's Chat Completion API.

    Args:
        text (str): The text transcript to summarize.

    Returns:
        Optional[str]: The summarized text, or None on error.
    """
    try:
        logger.info("Summarizing transcript")
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system",
                 "content": "You are a helpful assistant.  Provide a concise summary of the text, suitable for generating a visual representation."},
                {"role": "user", "content": text}
            ],
        )
        summary = response.choices[0].message.content
        logger.info(f"Summary: {summary}")
        return summary
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating summary: {e}")
        return None
  • This module defines the summarize_transcript function, which takes a text transcript as input and uses OpenAI's Chat Completion API to generate a concise summary.
  • The system message instructs the model to act as a helpful assistant and to provide a concise summary of the text, suitable for generating a visual representation.
  • The user message provides the transcript as the content for the model to summarize.
  • The function extracts the summary from the API response.
  • It includes error handling.

utils/generate_prompt.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def create_image_prompt(transcription: str) -> Optional[str]:
    """
    Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.

    Args:
        transcription (str): The text transcription of the audio.

    Returns:
        Optional[str]: A detailed text prompt suitable for image generation, or None on error.
    """
    try:
        logger.info("Generating image prompt from transcription")
        response = openai.chat.completions.create(
            model="gpt-4o",  # Use a powerful chat model
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.  Incorporate scene lighting, time of day, weather, and camera angle into the description.  Limit the description to 200 words.",
                },
                {"role": "user", "content": transcription},
            ],
        )
        prompt = response.choices[0].message.content
        prompt = prompt.strip()  # Remove leading/trailing spaces
        logger.info(f"Generated prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None
  • This module defines the create_image_prompt function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation.
  • The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
  • The user message provides the transcribed text as the content for the model to work with.
  • The function extracts the generated prompt from the API response.
  • It strips any leading/trailing spaces from the generated prompt.
  • It includes error handling.

utils/generate_image.py:

import openai
import logging
from typing import Optional, Dict

logger = logging.getLogger(__name__)


def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024",
                       response_format: str = "url", quality: str = "standard") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".
        quality (str, optional): The quality of the image. Defaults to "standard".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}, quality: {quality}")
        response = openai.images.generate(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
            quality=quality
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None
  • This module defines the generate_dalle_image function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image.
  • It calls the openai.images.generate() method to generate the image.
  • It accepts optional modelsizeresponse_format, and quality parameters, allowing the user to configure the image generation.
  • It extracts the URL of the generated image from the API response.
  • It includes error handling.

Step 5: Create the Main App (app.py)

Create a Python file named app.py in the root directory of your project and add the following code:

from flask import Flask, request, render_template, jsonify, make_response, redirect, url_for
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict
from werkzeug.utils import secure_filename
from werkzeug.datastructures import FileStorage

# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image
from utils.summarize import summarize_transcript

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads'  # Store uploaded files
app.config['MAX_CONTENT_LENGTH'] = 25 * 1024 * 1024  # 25MB max file size - increased for larger audio files
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)  # Create the upload folder if it doesn't exist

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}  # Allowed audio file extensions


def allowed_file(filename: str) -> bool:
    """
    Checks if the uploaded file has an allowed extension.

    Args:
        filename (str): The name of the file.

    Returns:
        bool: True if the file has an allowed extension, False otherwise.
    """
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


@app.route("/", methods=["GET", "POST"])
def index():
    """
    Handles the main route for the web application.
    Processes audio uploads, transcribes them, generates image prompts, and displays images.
    """
    transcript = None
    image_url = None
    prompt_summary = None
    error_message = None
    summary = None # Initialize summary

    if request.method == "POST":
        if 'audio_file' not in request.files:
            error_message = "No file part"
            logger.warning(error_message)
            return render_template("index.html", error_message=error_message)

        file: FileStorage = request.files['audio_file']  # Use type hinting
        if file.filename == '':
            error_message = "No file selected"
            logger.warning(request)
            return render_template("index.html", error_message=error_message)

        if file and allowed_file(file.filename):
            try:
                # Secure the filename and construct a safe path
                filename = secure_filename(file.filename)
                file_path = os.path.join(app.config['UPLOAD_FOLDER'], filename)
                file.save(file_path)  # Save the uploaded file

                transcript = transcribe_audio(file_path)  # Transcribe audio
                if not transcript:
                    error_message = "Audio transcription failed. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                summary = summarize_transcript(transcript) # Summarize the transcript
                if not summary:
                    error_message = "Audio summary failed. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                prompt_summary = generate_image_prompt(transcript)  # Generate prompt
                if not prompt_summary:
                    error_message = "Failed to generate image prompt. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                image_url = generate_dalle_image(prompt_summary, model=request.form.get('model', 'dall-e-3'),
                                                size=request.form.get('size', '1024x1024'),
                                                response_format=request.form.get('format', 'url'),
                                                quality=request.form.get('quality', 'standard'))  # Generate image
                if not image_url:
                    error_message = "Failed to generate image. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                # Optionally, delete the uploaded file after processing
                os.remove(file_path)
                logger.info(f"Successfully processed audio file and generated image.")
                return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary, summary=summary)

            except Exception as e:
                error_message = f"An error occurred: {e}"
                logger.error(error_message)
                return render_template("index.html",  error_message=error_message)
        else:
            error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
            logger.warning(request)
            return render_template("index.html",  error_message=error_message)

    return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary,
                           error=error_message, summary=summary)



@app.errorhandler(500)
def internal_server_error(e):
    """Handles internal server errors."""
    logger.error(f"Internal Server Error: {e}")
    return render_template("error.html", error="Internal Server Error"), 500


if __name__ == "__main__":
    app.run(debug=True)

Code Breakdown:

  • Import Statements: Imports necessary Flask modules, OpenAI library, osdotenvloggingOptional and Dict for type hinting, and secure_filename and FileStorage from Werkzeug.
  • Environment Variables: Loads the OpenAI API key from the .env file.
  • Flask Application:
    • Creates a Flask application instance.
    • Configures an upload folder and maximum file size. The UPLOAD_FOLDER is set to 'uploads', and MAX_CONTENT_LENGTH is set to 25MB. The upload folder is created if it does not exist.
  • Logging Configuration: Configures logging.
  • allowed_file Function: Checks if the uploaded file has an allowed audio extension.
  • transcribe_audio Function:
    • Takes the audio file path as input.
    • Opens the audio file in binary read mode ("rb").
    • Calls the OpenAI API's openai.Audio.transcriptions.create() method to transcribe the audio.
    • Extracts the transcribed text from the API response.
    • Logs the file path before transcription and the length of the transcribed text after successful transcription.
    • Includes error handling for OpenAI API errors and other exceptions. The audio file is closed after transcription.
  • generate_image_prompt Function:
    • Takes the transcribed text as input.
    • Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a detailed text prompt suitable for image generation.
    • The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
    • Extracts the generated prompt from the API response.
    • It strips any leading/trailing spaces from the generated prompt.
    • Includes error handling.
  • generate_image Function:
    • Takes the image prompt as input.
    • Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
    • Accepts optional modelsizeresponse_format, and quality parameters, allowing the user to configure the image generation.
    • Extracts the URL of the generated image from the API response.
    • Includes error handling.
  • index Route:
    • Handles both GET and POST requests.
    • For GET requests, it renders the initial HTML page.
    • For POST requests (when the user uploads an audio file):
      • It validates the uploaded file:
        • Checks if the file part exists in the request.
        • Checks if a file was selected.
        • Checks if the file type is allowed using the allowed_file function.
      • It saves the uploaded file to a temporary location using a secure filename.
      • It calls the utility functions to:
        • Transcribe the audio using transcribe_audio().
        • Generate an image prompt from the transcription using create_image_prompt().
        • Generate an image from the prompt using generate_dalle_image().
      • It summarizes the transcript using openai chat completions api.
      • It handles errors that may occur during any of these steps, logging the error and rendering the index.html template with an appropriate error message. The temporary file is deleted before rendering the error page.
      • If all steps are successful, it renders the index.html template, passing the transcription text, image URL, and generated prompt to be displayed.
  • @app.errorhandler(500): Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.
  • if __name__ == "__main__":: Starts the Flask development server if the script is executed directly.

Step 6: Create the HTML Template (templates/dashboard.html)

Create a folder named templates in the same directory as app.py. Inside the templates folder, create a file named dashboard.html with the following HTML code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Creator Dashboard</title>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
    <style>
        /* --- General Styles --- */
        body {
            font-family: 'Inter', sans-serif;
            padding: 40px;
            background-color: #f9fafb; /* Tailwind's gray-50 */
            display: flex;
            justify-content: center;
            align-items: center;
            min-height: 100vh;
            margin: 0;
            color: #374151; /* Tailwind's gray-700 */
        }
        .container {
            max-width: 800px; /* Increased max-width */
            width: 95%; /* Take up most of the viewport */
            background-color: #fff;
            padding: 2rem;
            border-radius: 0.75rem; /* Tailwind's rounded-lg */
            box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
            text-align: center;
        }
        h2 {
            font-size: 2.25rem; /* Tailwind's text-3xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1.5rem; /* Tailwind's mb-6 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        p{
            color: #6b7280; /* Tailwind's gray-500 */
            margin-bottom: 1rem;
        }

        /* --- Form Styles --- */
        form {
            margin-top: 1rem; /* Tailwind's mt-4 */
            margin-bottom: 1.5rem;
            display: flex;
            flex-direction: column;
            align-items: center; /* Center form elements */
            gap: 0.5rem; /* Tailwind's gap-2 */
        }
        label {
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600;  /* Tailwind's font-semibold */
            color: #4b5563; /* Tailwind's gray-600 */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="file"] {
            width: 100%;
            max-width: 400px; /* Added max-width for file input */
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            font-size: 1rem; /* Tailwind's text-base */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="submit"] {
            padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
            color: #fff;
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600; /* Tailwind's font-semibold */
            cursor: pointer;
            transition: background-color 0.3s ease; /* Smooth transition */
            border: none;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
            margin-top: 1rem;
        }
        input[type="submit"]:hover {
            background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
        }
        input[type="submit"]:focus {
            outline: none;
            box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
        }

        /* --- Result Styles --- */
        .result-container {
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
            text-align: left;
        }
        h3 {
            font-size: 1.5rem; /* Tailwind's text-2xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1rem; /* Tailwind's mb-4 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        textarea {
            width: 100%;
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            resize: none;
            font-size: 1rem; /* Tailwind's text-base */
            line-height: 1.5rem; /* Tailwind's leading-relaxed */
            margin-top: 0.5rem; /* Tailwind's mt-2 */
            margin-bottom: 0;
            box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
            min-height: 100px;
        }
        textarea:focus {
            outline: none;
            border-color: #3b82f6; /* Tailwind's border-blue-500 */
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
        }
        img {
            max-width: 100%;
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            margin-top: 1.5rem; /* Tailwind's mt-6 */
            box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
        }

        /* --- Error Styles --- */
        .error-message {
            color: #dc2626; /* Tailwind's text-red-600 */
            margin-top: 1rem; /* Tailwind's mt-4 */
            padding: 0.75rem;
            background-color: #fee2e2; /* Tailwind's bg-red-100 */
            border-radius: 0.375rem; /* Tailwind's rounded-md */
            border: 1px solid #fecaca; /* Tailwind's border-red-300 */
            text-align: center;
        }

        .prompt-select {
            margin-top: 1rem; /* Tailwind's mt-4 */
            display: flex;
            flex-direction: column;
            align-items: center;
            gap: 0.5rem;
            width: 100%;
        }

        .prompt-select label {
            font-size: 1rem;
            font-weight: 600;
            color: #4b5563;
            margin-bottom: 0.25rem;
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }

        .prompt-select select {
            width: 100%;
            max-width: 400px;
            padding: 0.75rem;
            border-radius: 0.5rem;
            border: 1px solid #d1d5db;
            font-size: 1rem;
            margin-bottom: 0.25rem;
            margin-left: auto;
            margin-right: auto;
            appearance: none;  /* Remove default arrow */
            background-image: url("data:image/svg+xml,%3Csvgxmlns='http://www.w3.org/2000/svg' viewBox='0 0 20 20' fill='none' stroke='currentColor' stroke-width='1.5' stroke-linecap='round' stroke-linejoin='round'%3E%3Cpath d='M6 9l4 4 4-4'%3E%3C/path%3E%3C/svg%3E"); /* Add custom arrow */
            background-repeat: no-repeat;
            background-position: right 0.75rem center;
            background-size: 1rem;
            padding-right: 2.5rem; /* Make space for the arrow */
        }

        .prompt-select select:focus {
            outline: none;
            border-color: #3b82f6;
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15);
        }


    </style>
</head>
<body>
    <div class="container">
        <h2>🎤🧠🎨 Multimodal Assistant</h2>
        <p> Upload an audio file to transcribe and generate a corresponding image. </p>
        <form method="POST" enctype="multipart/form-data">
            <label for="audio_file">Upload your voice note:</label><br>
            <input type="file" name="audio_file" accept="audio/*" required><br><br>

            <div class = "prompt-select">
                <label for="prompt_mode">Image Prompt Mode:</label>
                <select id="prompt_mode" name="prompt_mode">
                    <option value="detailed">Detailed Scene Description</option>
                    <option value="keywords">Keywords</option>
                    <option value="creative">Creative Interpretation</option>
                </select>
            </div>

            <input type="submit" value="Generate Visual Response">
        </form>

        {% if transcript %}
            <div class="result-container">
                <h3>📝 Transcript:</h3>
                <textarea readonly>{{ transcript }}</textarea>
            </div>
        {% endif %}

        {% if summary %}
            <div class="result-container">
                <h3>🔎 Summary:</h3>
                <p>{{ summary }}</p>
            </div>
        {% endif %}

        {% if prompt %}
            <div class="result-container">
                <h3>🎯 Scene Prompt:</h3>
                <p>{{ prompt }}</p>
            </div>
        {% endif %}

        {% if image_url %}
            <div class="result-container">
                <h3>🖼️ Generated Image:</h3>
                <img src="{{ image_url }}" alt="Generated image">
            </div>
        {% endif %}
        {% if error %}
            <div class="error-message">{{ error }}</div>
        {% endif %}
    </div>
</body>
</html>


6.2.3 What Makes This a Dashboard?

This layout combines several key elements that transform it from a simple interface into a comprehensive dashboard:

  • Multiple output zones (text, summary, prompt, image)The interface is divided into distinct sections, each dedicated to displaying different types of processed data. This organization allows users to easily track the progression from speech input to visual output.
  • Simple user interaction (one-click processing)Despite the complex processing happening behind the scenes, users only need to perform one action to initiate the entire workflow. This simplicity makes the tool accessible to users of all technical levels.
  • Clean, readable formattingThe interface uses consistent spacing, typography, and visual hierarchy to ensure information is easily digestible. Each section is clearly labeled and visually separated from others.
  • Visual feedback to reinforce model outputThe dashboard provides immediate visual confirmation at each step of the process, helping users understand how their input is being transformed across different AI models.
  • Reusable architecture, thanks to the utils/ structureThe modular design separates core functionality into utility functions, making the code easier to maintain and adapt for different use cases.

6.2.4 Use Case Ideas

This versatile dashboard has numerous potential applications. Let's explore some key use cases in detail:

  • content creator's AI toolkit (turn thoughts into blogs + visuals)
    • Record brainstorming sessions and convert them into structured blog posts
    • Generate matching illustrations for key concepts
    • Create social media content bundles with matching visuals
  • teacher's assistant (record voice ➝ summarize ➝ illustrate)
    • Transform lesson plans into visual learning materials
    • Create engaging educational content with matching illustrations
    • Generate visual aids for complex concepts
  • journaling tool (log voice entries ➝ summarize + visualize)
    • Convert daily voice memos into organized written entries
    • Create mood boards based on journal content
    • Track emotional patterns through visual representations

Summary

In this section, you elevated your multimodal assistant into a professional-grade dashboard. Here's what you accomplished:

  • Break down your logic into reusable utilities
    • Created modular, maintainable code structure
    • Implemented clean separation of concerns
  • Accept audio input and process it across models
    • Seamless integration of multiple AI technologies
    • Efficient processing pipeline
  • Present everything clearly in a cohesive UI
    • User-friendly interface design
    • Intuitive information hierarchy
  • Move from "demo" to tool
    • Production-ready implementation
    • Scalable architecture

This dashboard represents a professional-grade interface that delivers real value to users. With its robust architecture and intuitive design, it's ready to be transformed into a full-fledged product with minimal additional development.

6.2 Building a Creator Dashboard

This is where all the capabilities you've developed so far come together to create a powerful, unified system. By integrating multiple AI technologies, we can create applications that are greater than the sum of their parts. Let's explore these core capabilities in detail:

  • Transcription turns spoken words into written textUsing advanced speech recognition models like Whisper, we can accurately convert audio recordings into text, preserving the speaker's intent and context. This forms the foundation for further processing.
  • Content generation creates new, contextually relevant materialLarge language models can analyze the transcribed text and generate new content that maintains consistency with the original message while adding valuable insights or expanding on key points.
  • Prompt engineering crafts precise instructions for AI modelsThrough careful prompt construction, we can guide AI models to produce more accurate and relevant outputs. This involves understanding both the technical capabilities of the models and the nuanced ways to communicate with them.
  • Image creation transforms text descriptions into visual artModels like DALL·E can interpret textual descriptions and create corresponding images, adding a visual dimension to our applications and making abstract concepts more tangible.

These components don't just exist side by side - they form an interconnected pipeline where each step enhances the next. The output from transcription feeds into content generation, which informs prompt engineering, ultimately leading to image creation. This seamless integration creates a fluid workflow where users can start with a simple voice recording and end with a rich multimedia output, all within a single, cohesive system. By eliminating the need to switch between different tools or interfaces, users can focus on their creative process rather than technical implementation details.

6.2.1 What You'll Build

In this section, you'll design and implement a Creator Dashboard - a sophisticated web interface that transforms how creators work with AI. This comprehensive platform serves as a central hub for content creation, combining multiple AI technologies into one seamless experience. Let's explore the key features that make this dashboard powerful:

  • Upload a voice recordingCreators can easily upload audio files in various formats, making it simple to start their creative process with spoken ideas or narration.
  • Transcribe the voice recording into text using AIUsing advanced AI speech recognition technology, the system accurately converts spoken words into written text, maintaining the nuances and context of the original recording.
  • Turn that transcription into an editable promptThe system intelligently processes the transcribed text to create structured, AI-ready prompts that can be customized to achieve the desired creative output.
  • Generate images using DALL·E based on the promptLeveraging DALL·E's powerful image generation capabilities, the system creates visual representations that match the specified prompts, bringing ideas to life through AI-generated artwork.
  • Summarize the transcriptThe dashboard employs AI to distill long transcriptions into concise, meaningful summaries, helping creators quickly grasp the core concepts and themes.
  • Display all the results for review and further use in content productionAll generated content - from transcripts to images - is presented in an organized, easy-to-review format, allowing creators to efficiently manage and utilize their assets.

To build this robust system, you'll implement a modern tech stack using Flask for the backend operations and a clean, responsive combination of HTML and CSS for the frontend interface. This architecture ensures both modularity and maintainability, making it easy to update and scale the dashboard as needed.

6.2.2 Step-by-Step Implementation

Step 1: Project Setup

Download the audio sample: https://files.cuantum.tech/audio/dashboard-project.mp3

Create a new directory for your project and navigate into it:

mkdir creator_dashboard
cd creator_dashboard

It's recommended to set up a virtual environment:

python -m venv venv
source venv/bin/activate  # On macOS/Linux
venv\\Scripts\\activate  # On Windows

Install the required Python packages:

pip install flask openai python-dotenv

Organize your project files as follows:

/creator_dashboard

├── app.py
├── .env
└── templates/
    └── dashboard.html
└── utils/
    ├── __init__.py
    ├── transcribe.py
    ├── summarize.py
    ├── generate_prompt.py
    └── generate_image.py
  • app.py: The main Flask application file.
  • .env: A file to store your OpenAI API key.
  • templates/: A directory for HTML templates.
  • templates/dashboard.html: The HTML template for the user interface.
  • utils/: A directory for Python modules containing reusable functions.
    • __init__.py: Makes the utils directory a Python package.
    • transcribe.py: Contains the function to transcribe audio using Whisper.
    • summarize.py: Contains the function to summarize the transcription using a Large Language Model.
    • generate_prompt.py: Contains the function to generate an image prompt from the summary using a Large Language Model.
    • generate_image.py: Contains the function to generate an image with DALL·E 3.

Step 2: Create the Utility Modules

Create the following Python files in the utils/ directory:

utils/transcribe.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.Audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        audio_file.close()
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None
  • This module defines the transcribe_audio function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription.
  • The function opens the audio file in binary read mode ("rb").
  • It calls openai.Audio.transcriptions.create() to perform the transcription, specifying the "whisper-1" model.
  • It extracts the transcribed text from the API response.
  • It includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors. If an error occurs, it logs the error and returns None.
  • It logs the file path before transcription and the length of the transcribed text after successful transcription.
  • The audio file is closed after transcription.

utils/summarize.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def summarize_transcript(text: str) -> Optional[str]:
    """
    Summarizes a text transcript using OpenAI's Chat Completion API.

    Args:
        text (str): The text transcript to summarize.

    Returns:
        Optional[str]: The summarized text, or None on error.
    """
    try:
        logger.info("Summarizing transcript")
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system",
                 "content": "You are a helpful assistant.  Provide a concise summary of the text, suitable for generating a visual representation."},
                {"role": "user", "content": text}
            ],
        )
        summary = response.choices[0].message.content
        logger.info(f"Summary: {summary}")
        return summary
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating summary: {e}")
        return None
  • This module defines the summarize_transcript function, which takes a text transcript as input and uses OpenAI's Chat Completion API to generate a concise summary.
  • The system message instructs the model to act as a helpful assistant and to provide a concise summary of the text, suitable for generating a visual representation.
  • The user message provides the transcript as the content for the model to summarize.
  • The function extracts the summary from the API response.
  • It includes error handling.

utils/generate_prompt.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def create_image_prompt(transcription: str) -> Optional[str]:
    """
    Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.

    Args:
        transcription (str): The text transcription of the audio.

    Returns:
        Optional[str]: A detailed text prompt suitable for image generation, or None on error.
    """
    try:
        logger.info("Generating image prompt from transcription")
        response = openai.chat.completions.create(
            model="gpt-4o",  # Use a powerful chat model
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.  Incorporate scene lighting, time of day, weather, and camera angle into the description.  Limit the description to 200 words.",
                },
                {"role": "user", "content": transcription},
            ],
        )
        prompt = response.choices[0].message.content
        prompt = prompt.strip()  # Remove leading/trailing spaces
        logger.info(f"Generated prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None
  • This module defines the create_image_prompt function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation.
  • The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
  • The user message provides the transcribed text as the content for the model to work with.
  • The function extracts the generated prompt from the API response.
  • It strips any leading/trailing spaces from the generated prompt.
  • It includes error handling.

utils/generate_image.py:

import openai
import logging
from typing import Optional, Dict

logger = logging.getLogger(__name__)


def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024",
                       response_format: str = "url", quality: str = "standard") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".
        quality (str, optional): The quality of the image. Defaults to "standard".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}, quality: {quality}")
        response = openai.images.generate(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
            quality=quality
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None
  • This module defines the generate_dalle_image function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image.
  • It calls the openai.images.generate() method to generate the image.
  • It accepts optional modelsizeresponse_format, and quality parameters, allowing the user to configure the image generation.
  • It extracts the URL of the generated image from the API response.
  • It includes error handling.

Step 5: Create the Main App (app.py)

Create a Python file named app.py in the root directory of your project and add the following code:

from flask import Flask, request, render_template, jsonify, make_response, redirect, url_for
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict
from werkzeug.utils import secure_filename
from werkzeug.datastructures import FileStorage

# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image
from utils.summarize import summarize_transcript

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads'  # Store uploaded files
app.config['MAX_CONTENT_LENGTH'] = 25 * 1024 * 1024  # 25MB max file size - increased for larger audio files
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)  # Create the upload folder if it doesn't exist

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}  # Allowed audio file extensions


def allowed_file(filename: str) -> bool:
    """
    Checks if the uploaded file has an allowed extension.

    Args:
        filename (str): The name of the file.

    Returns:
        bool: True if the file has an allowed extension, False otherwise.
    """
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


@app.route("/", methods=["GET", "POST"])
def index():
    """
    Handles the main route for the web application.
    Processes audio uploads, transcribes them, generates image prompts, and displays images.
    """
    transcript = None
    image_url = None
    prompt_summary = None
    error_message = None
    summary = None # Initialize summary

    if request.method == "POST":
        if 'audio_file' not in request.files:
            error_message = "No file part"
            logger.warning(error_message)
            return render_template("index.html", error_message=error_message)

        file: FileStorage = request.files['audio_file']  # Use type hinting
        if file.filename == '':
            error_message = "No file selected"
            logger.warning(request)
            return render_template("index.html", error_message=error_message)

        if file and allowed_file(file.filename):
            try:
                # Secure the filename and construct a safe path
                filename = secure_filename(file.filename)
                file_path = os.path.join(app.config['UPLOAD_FOLDER'], filename)
                file.save(file_path)  # Save the uploaded file

                transcript = transcribe_audio(file_path)  # Transcribe audio
                if not transcript:
                    error_message = "Audio transcription failed. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                summary = summarize_transcript(transcript) # Summarize the transcript
                if not summary:
                    error_message = "Audio summary failed. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                prompt_summary = generate_image_prompt(transcript)  # Generate prompt
                if not prompt_summary:
                    error_message = "Failed to generate image prompt. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                image_url = generate_dalle_image(prompt_summary, model=request.form.get('model', 'dall-e-3'),
                                                size=request.form.get('size', '1024x1024'),
                                                response_format=request.form.get('format', 'url'),
                                                quality=request.form.get('quality', 'standard'))  # Generate image
                if not image_url:
                    error_message = "Failed to generate image. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                # Optionally, delete the uploaded file after processing
                os.remove(file_path)
                logger.info(f"Successfully processed audio file and generated image.")
                return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary, summary=summary)

            except Exception as e:
                error_message = f"An error occurred: {e}"
                logger.error(error_message)
                return render_template("index.html",  error_message=error_message)
        else:
            error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
            logger.warning(request)
            return render_template("index.html",  error_message=error_message)

    return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary,
                           error=error_message, summary=summary)



@app.errorhandler(500)
def internal_server_error(e):
    """Handles internal server errors."""
    logger.error(f"Internal Server Error: {e}")
    return render_template("error.html", error="Internal Server Error"), 500


if __name__ == "__main__":
    app.run(debug=True)

Code Breakdown:

  • Import Statements: Imports necessary Flask modules, OpenAI library, osdotenvloggingOptional and Dict for type hinting, and secure_filename and FileStorage from Werkzeug.
  • Environment Variables: Loads the OpenAI API key from the .env file.
  • Flask Application:
    • Creates a Flask application instance.
    • Configures an upload folder and maximum file size. The UPLOAD_FOLDER is set to 'uploads', and MAX_CONTENT_LENGTH is set to 25MB. The upload folder is created if it does not exist.
  • Logging Configuration: Configures logging.
  • allowed_file Function: Checks if the uploaded file has an allowed audio extension.
  • transcribe_audio Function:
    • Takes the audio file path as input.
    • Opens the audio file in binary read mode ("rb").
    • Calls the OpenAI API's openai.Audio.transcriptions.create() method to transcribe the audio.
    • Extracts the transcribed text from the API response.
    • Logs the file path before transcription and the length of the transcribed text after successful transcription.
    • Includes error handling for OpenAI API errors and other exceptions. The audio file is closed after transcription.
  • generate_image_prompt Function:
    • Takes the transcribed text as input.
    • Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a detailed text prompt suitable for image generation.
    • The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
    • Extracts the generated prompt from the API response.
    • It strips any leading/trailing spaces from the generated prompt.
    • Includes error handling.
  • generate_image Function:
    • Takes the image prompt as input.
    • Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
    • Accepts optional modelsizeresponse_format, and quality parameters, allowing the user to configure the image generation.
    • Extracts the URL of the generated image from the API response.
    • Includes error handling.
  • index Route:
    • Handles both GET and POST requests.
    • For GET requests, it renders the initial HTML page.
    • For POST requests (when the user uploads an audio file):
      • It validates the uploaded file:
        • Checks if the file part exists in the request.
        • Checks if a file was selected.
        • Checks if the file type is allowed using the allowed_file function.
      • It saves the uploaded file to a temporary location using a secure filename.
      • It calls the utility functions to:
        • Transcribe the audio using transcribe_audio().
        • Generate an image prompt from the transcription using create_image_prompt().
        • Generate an image from the prompt using generate_dalle_image().
      • It summarizes the transcript using openai chat completions api.
      • It handles errors that may occur during any of these steps, logging the error and rendering the index.html template with an appropriate error message. The temporary file is deleted before rendering the error page.
      • If all steps are successful, it renders the index.html template, passing the transcription text, image URL, and generated prompt to be displayed.
  • @app.errorhandler(500): Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.
  • if __name__ == "__main__":: Starts the Flask development server if the script is executed directly.

Step 6: Create the HTML Template (templates/dashboard.html)

Create a folder named templates in the same directory as app.py. Inside the templates folder, create a file named dashboard.html with the following HTML code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Creator Dashboard</title>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
    <style>
        /* --- General Styles --- */
        body {
            font-family: 'Inter', sans-serif;
            padding: 40px;
            background-color: #f9fafb; /* Tailwind's gray-50 */
            display: flex;
            justify-content: center;
            align-items: center;
            min-height: 100vh;
            margin: 0;
            color: #374151; /* Tailwind's gray-700 */
        }
        .container {
            max-width: 800px; /* Increased max-width */
            width: 95%; /* Take up most of the viewport */
            background-color: #fff;
            padding: 2rem;
            border-radius: 0.75rem; /* Tailwind's rounded-lg */
            box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
            text-align: center;
        }
        h2 {
            font-size: 2.25rem; /* Tailwind's text-3xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1.5rem; /* Tailwind's mb-6 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        p{
            color: #6b7280; /* Tailwind's gray-500 */
            margin-bottom: 1rem;
        }

        /* --- Form Styles --- */
        form {
            margin-top: 1rem; /* Tailwind's mt-4 */
            margin-bottom: 1.5rem;
            display: flex;
            flex-direction: column;
            align-items: center; /* Center form elements */
            gap: 0.5rem; /* Tailwind's gap-2 */
        }
        label {
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600;  /* Tailwind's font-semibold */
            color: #4b5563; /* Tailwind's gray-600 */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="file"] {
            width: 100%;
            max-width: 400px; /* Added max-width for file input */
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            font-size: 1rem; /* Tailwind's text-base */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="submit"] {
            padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
            color: #fff;
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600; /* Tailwind's font-semibold */
            cursor: pointer;
            transition: background-color 0.3s ease; /* Smooth transition */
            border: none;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
            margin-top: 1rem;
        }
        input[type="submit"]:hover {
            background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
        }
        input[type="submit"]:focus {
            outline: none;
            box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
        }

        /* --- Result Styles --- */
        .result-container {
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
            text-align: left;
        }
        h3 {
            font-size: 1.5rem; /* Tailwind's text-2xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1rem; /* Tailwind's mb-4 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        textarea {
            width: 100%;
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            resize: none;
            font-size: 1rem; /* Tailwind's text-base */
            line-height: 1.5rem; /* Tailwind's leading-relaxed */
            margin-top: 0.5rem; /* Tailwind's mt-2 */
            margin-bottom: 0;
            box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
            min-height: 100px;
        }
        textarea:focus {
            outline: none;
            border-color: #3b82f6; /* Tailwind's border-blue-500 */
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
        }
        img {
            max-width: 100%;
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            margin-top: 1.5rem; /* Tailwind's mt-6 */
            box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
        }

        /* --- Error Styles --- */
        .error-message {
            color: #dc2626; /* Tailwind's text-red-600 */
            margin-top: 1rem; /* Tailwind's mt-4 */
            padding: 0.75rem;
            background-color: #fee2e2; /* Tailwind's bg-red-100 */
            border-radius: 0.375rem; /* Tailwind's rounded-md */
            border: 1px solid #fecaca; /* Tailwind's border-red-300 */
            text-align: center;
        }

        .prompt-select {
            margin-top: 1rem; /* Tailwind's mt-4 */
            display: flex;
            flex-direction: column;
            align-items: center;
            gap: 0.5rem;
            width: 100%;
        }

        .prompt-select label {
            font-size: 1rem;
            font-weight: 600;
            color: #4b5563;
            margin-bottom: 0.25rem;
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }

        .prompt-select select {
            width: 100%;
            max-width: 400px;
            padding: 0.75rem;
            border-radius: 0.5rem;
            border: 1px solid #d1d5db;
            font-size: 1rem;
            margin-bottom: 0.25rem;
            margin-left: auto;
            margin-right: auto;
            appearance: none;  /* Remove default arrow */
            background-image: url("data:image/svg+xml,%3Csvgxmlns='http://www.w3.org/2000/svg' viewBox='0 0 20 20' fill='none' stroke='currentColor' stroke-width='1.5' stroke-linecap='round' stroke-linejoin='round'%3E%3Cpath d='M6 9l4 4 4-4'%3E%3C/path%3E%3C/svg%3E"); /* Add custom arrow */
            background-repeat: no-repeat;
            background-position: right 0.75rem center;
            background-size: 1rem;
            padding-right: 2.5rem; /* Make space for the arrow */
        }

        .prompt-select select:focus {
            outline: none;
            border-color: #3b82f6;
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15);
        }


    </style>
</head>
<body>
    <div class="container">
        <h2>🎤🧠🎨 Multimodal Assistant</h2>
        <p> Upload an audio file to transcribe and generate a corresponding image. </p>
        <form method="POST" enctype="multipart/form-data">
            <label for="audio_file">Upload your voice note:</label><br>
            <input type="file" name="audio_file" accept="audio/*" required><br><br>

            <div class = "prompt-select">
                <label for="prompt_mode">Image Prompt Mode:</label>
                <select id="prompt_mode" name="prompt_mode">
                    <option value="detailed">Detailed Scene Description</option>
                    <option value="keywords">Keywords</option>
                    <option value="creative">Creative Interpretation</option>
                </select>
            </div>

            <input type="submit" value="Generate Visual Response">
        </form>

        {% if transcript %}
            <div class="result-container">
                <h3>📝 Transcript:</h3>
                <textarea readonly>{{ transcript }}</textarea>
            </div>
        {% endif %}

        {% if summary %}
            <div class="result-container">
                <h3>🔎 Summary:</h3>
                <p>{{ summary }}</p>
            </div>
        {% endif %}

        {% if prompt %}
            <div class="result-container">
                <h3>🎯 Scene Prompt:</h3>
                <p>{{ prompt }}</p>
            </div>
        {% endif %}

        {% if image_url %}
            <div class="result-container">
                <h3>🖼️ Generated Image:</h3>
                <img src="{{ image_url }}" alt="Generated image">
            </div>
        {% endif %}
        {% if error %}
            <div class="error-message">{{ error }}</div>
        {% endif %}
    </div>
</body>
</html>


6.2.3 What Makes This a Dashboard?

This layout combines several key elements that transform it from a simple interface into a comprehensive dashboard:

  • Multiple output zones (text, summary, prompt, image)The interface is divided into distinct sections, each dedicated to displaying different types of processed data. This organization allows users to easily track the progression from speech input to visual output.
  • Simple user interaction (one-click processing)Despite the complex processing happening behind the scenes, users only need to perform one action to initiate the entire workflow. This simplicity makes the tool accessible to users of all technical levels.
  • Clean, readable formattingThe interface uses consistent spacing, typography, and visual hierarchy to ensure information is easily digestible. Each section is clearly labeled and visually separated from others.
  • Visual feedback to reinforce model outputThe dashboard provides immediate visual confirmation at each step of the process, helping users understand how their input is being transformed across different AI models.
  • Reusable architecture, thanks to the utils/ structureThe modular design separates core functionality into utility functions, making the code easier to maintain and adapt for different use cases.

6.2.4 Use Case Ideas

This versatile dashboard has numerous potential applications. Let's explore some key use cases in detail:

  • content creator's AI toolkit (turn thoughts into blogs + visuals)
    • Record brainstorming sessions and convert them into structured blog posts
    • Generate matching illustrations for key concepts
    • Create social media content bundles with matching visuals
  • teacher's assistant (record voice ➝ summarize ➝ illustrate)
    • Transform lesson plans into visual learning materials
    • Create engaging educational content with matching illustrations
    • Generate visual aids for complex concepts
  • journaling tool (log voice entries ➝ summarize + visualize)
    • Convert daily voice memos into organized written entries
    • Create mood boards based on journal content
    • Track emotional patterns through visual representations

Summary

In this section, you elevated your multimodal assistant into a professional-grade dashboard. Here's what you accomplished:

  • Break down your logic into reusable utilities
    • Created modular, maintainable code structure
    • Implemented clean separation of concerns
  • Accept audio input and process it across models
    • Seamless integration of multiple AI technologies
    • Efficient processing pipeline
  • Present everything clearly in a cohesive UI
    • User-friendly interface design
    • Intuitive information hierarchy
  • Move from "demo" to tool
    • Production-ready implementation
    • Scalable architecture

This dashboard represents a professional-grade interface that delivers real value to users. With its robust architecture and intuitive design, it's ready to be transformed into a full-fledged product with minimal additional development.

6.2 Building a Creator Dashboard

This is where all the capabilities you've developed so far come together to create a powerful, unified system. By integrating multiple AI technologies, we can create applications that are greater than the sum of their parts. Let's explore these core capabilities in detail:

  • Transcription turns spoken words into written textUsing advanced speech recognition models like Whisper, we can accurately convert audio recordings into text, preserving the speaker's intent and context. This forms the foundation for further processing.
  • Content generation creates new, contextually relevant materialLarge language models can analyze the transcribed text and generate new content that maintains consistency with the original message while adding valuable insights or expanding on key points.
  • Prompt engineering crafts precise instructions for AI modelsThrough careful prompt construction, we can guide AI models to produce more accurate and relevant outputs. This involves understanding both the technical capabilities of the models and the nuanced ways to communicate with them.
  • Image creation transforms text descriptions into visual artModels like DALL·E can interpret textual descriptions and create corresponding images, adding a visual dimension to our applications and making abstract concepts more tangible.

These components don't just exist side by side - they form an interconnected pipeline where each step enhances the next. The output from transcription feeds into content generation, which informs prompt engineering, ultimately leading to image creation. This seamless integration creates a fluid workflow where users can start with a simple voice recording and end with a rich multimedia output, all within a single, cohesive system. By eliminating the need to switch between different tools or interfaces, users can focus on their creative process rather than technical implementation details.

6.2.1 What You'll Build

In this section, you'll design and implement a Creator Dashboard - a sophisticated web interface that transforms how creators work with AI. This comprehensive platform serves as a central hub for content creation, combining multiple AI technologies into one seamless experience. Let's explore the key features that make this dashboard powerful:

  • Upload a voice recordingCreators can easily upload audio files in various formats, making it simple to start their creative process with spoken ideas or narration.
  • Transcribe the voice recording into text using AIUsing advanced AI speech recognition technology, the system accurately converts spoken words into written text, maintaining the nuances and context of the original recording.
  • Turn that transcription into an editable promptThe system intelligently processes the transcribed text to create structured, AI-ready prompts that can be customized to achieve the desired creative output.
  • Generate images using DALL·E based on the promptLeveraging DALL·E's powerful image generation capabilities, the system creates visual representations that match the specified prompts, bringing ideas to life through AI-generated artwork.
  • Summarize the transcriptThe dashboard employs AI to distill long transcriptions into concise, meaningful summaries, helping creators quickly grasp the core concepts and themes.
  • Display all the results for review and further use in content productionAll generated content - from transcripts to images - is presented in an organized, easy-to-review format, allowing creators to efficiently manage and utilize their assets.

To build this robust system, you'll implement a modern tech stack using Flask for the backend operations and a clean, responsive combination of HTML and CSS for the frontend interface. This architecture ensures both modularity and maintainability, making it easy to update and scale the dashboard as needed.

6.2.2 Step-by-Step Implementation

Step 1: Project Setup

Download the audio sample: https://files.cuantum.tech/audio/dashboard-project.mp3

Create a new directory for your project and navigate into it:

mkdir creator_dashboard
cd creator_dashboard

It's recommended to set up a virtual environment:

python -m venv venv
source venv/bin/activate  # On macOS/Linux
venv\\Scripts\\activate  # On Windows

Install the required Python packages:

pip install flask openai python-dotenv

Organize your project files as follows:

/creator_dashboard

├── app.py
├── .env
└── templates/
    └── dashboard.html
└── utils/
    ├── __init__.py
    ├── transcribe.py
    ├── summarize.py
    ├── generate_prompt.py
    └── generate_image.py
  • app.py: The main Flask application file.
  • .env: A file to store your OpenAI API key.
  • templates/: A directory for HTML templates.
  • templates/dashboard.html: The HTML template for the user interface.
  • utils/: A directory for Python modules containing reusable functions.
    • __init__.py: Makes the utils directory a Python package.
    • transcribe.py: Contains the function to transcribe audio using Whisper.
    • summarize.py: Contains the function to summarize the transcription using a Large Language Model.
    • generate_prompt.py: Contains the function to generate an image prompt from the summary using a Large Language Model.
    • generate_image.py: Contains the function to generate an image with DALL·E 3.

Step 2: Create the Utility Modules

Create the following Python files in the utils/ directory:

utils/transcribe.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.Audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        audio_file.close()
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None
  • This module defines the transcribe_audio function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription.
  • The function opens the audio file in binary read mode ("rb").
  • It calls openai.Audio.transcriptions.create() to perform the transcription, specifying the "whisper-1" model.
  • It extracts the transcribed text from the API response.
  • It includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors. If an error occurs, it logs the error and returns None.
  • It logs the file path before transcription and the length of the transcribed text after successful transcription.
  • The audio file is closed after transcription.

utils/summarize.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def summarize_transcript(text: str) -> Optional[str]:
    """
    Summarizes a text transcript using OpenAI's Chat Completion API.

    Args:
        text (str): The text transcript to summarize.

    Returns:
        Optional[str]: The summarized text, or None on error.
    """
    try:
        logger.info("Summarizing transcript")
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system",
                 "content": "You are a helpful assistant.  Provide a concise summary of the text, suitable for generating a visual representation."},
                {"role": "user", "content": text}
            ],
        )
        summary = response.choices[0].message.content
        logger.info(f"Summary: {summary}")
        return summary
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating summary: {e}")
        return None
  • This module defines the summarize_transcript function, which takes a text transcript as input and uses OpenAI's Chat Completion API to generate a concise summary.
  • The system message instructs the model to act as a helpful assistant and to provide a concise summary of the text, suitable for generating a visual representation.
  • The user message provides the transcript as the content for the model to summarize.
  • The function extracts the summary from the API response.
  • It includes error handling.

utils/generate_prompt.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def create_image_prompt(transcription: str) -> Optional[str]:
    """
    Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.

    Args:
        transcription (str): The text transcription of the audio.

    Returns:
        Optional[str]: A detailed text prompt suitable for image generation, or None on error.
    """
    try:
        logger.info("Generating image prompt from transcription")
        response = openai.chat.completions.create(
            model="gpt-4o",  # Use a powerful chat model
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.  Incorporate scene lighting, time of day, weather, and camera angle into the description.  Limit the description to 200 words.",
                },
                {"role": "user", "content": transcription},
            ],
        )
        prompt = response.choices[0].message.content
        prompt = prompt.strip()  # Remove leading/trailing spaces
        logger.info(f"Generated prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None
  • This module defines the create_image_prompt function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation.
  • The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
  • The user message provides the transcribed text as the content for the model to work with.
  • The function extracts the generated prompt from the API response.
  • It strips any leading/trailing spaces from the generated prompt.
  • It includes error handling.

utils/generate_image.py:

import openai
import logging
from typing import Optional, Dict

logger = logging.getLogger(__name__)


def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024",
                       response_format: str = "url", quality: str = "standard") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".
        quality (str, optional): The quality of the image. Defaults to "standard".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}, quality: {quality}")
        response = openai.images.generate(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
            quality=quality
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None
  • This module defines the generate_dalle_image function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image.
  • It calls the openai.images.generate() method to generate the image.
  • It accepts optional modelsizeresponse_format, and quality parameters, allowing the user to configure the image generation.
  • It extracts the URL of the generated image from the API response.
  • It includes error handling.

Step 5: Create the Main App (app.py)

Create a Python file named app.py in the root directory of your project and add the following code:

from flask import Flask, request, render_template, jsonify, make_response, redirect, url_for
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict
from werkzeug.utils import secure_filename
from werkzeug.datastructures import FileStorage

# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image
from utils.summarize import summarize_transcript

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads'  # Store uploaded files
app.config['MAX_CONTENT_LENGTH'] = 25 * 1024 * 1024  # 25MB max file size - increased for larger audio files
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)  # Create the upload folder if it doesn't exist

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}  # Allowed audio file extensions


def allowed_file(filename: str) -> bool:
    """
    Checks if the uploaded file has an allowed extension.

    Args:
        filename (str): The name of the file.

    Returns:
        bool: True if the file has an allowed extension, False otherwise.
    """
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


@app.route("/", methods=["GET", "POST"])
def index():
    """
    Handles the main route for the web application.
    Processes audio uploads, transcribes them, generates image prompts, and displays images.
    """
    transcript = None
    image_url = None
    prompt_summary = None
    error_message = None
    summary = None # Initialize summary

    if request.method == "POST":
        if 'audio_file' not in request.files:
            error_message = "No file part"
            logger.warning(error_message)
            return render_template("index.html", error_message=error_message)

        file: FileStorage = request.files['audio_file']  # Use type hinting
        if file.filename == '':
            error_message = "No file selected"
            logger.warning(request)
            return render_template("index.html", error_message=error_message)

        if file and allowed_file(file.filename):
            try:
                # Secure the filename and construct a safe path
                filename = secure_filename(file.filename)
                file_path = os.path.join(app.config['UPLOAD_FOLDER'], filename)
                file.save(file_path)  # Save the uploaded file

                transcript = transcribe_audio(file_path)  # Transcribe audio
                if not transcript:
                    error_message = "Audio transcription failed. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                summary = summarize_transcript(transcript) # Summarize the transcript
                if not summary:
                    error_message = "Audio summary failed. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                prompt_summary = generate_image_prompt(transcript)  # Generate prompt
                if not prompt_summary:
                    error_message = "Failed to generate image prompt. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                image_url = generate_dalle_image(prompt_summary, model=request.form.get('model', 'dall-e-3'),
                                                size=request.form.get('size', '1024x1024'),
                                                response_format=request.form.get('format', 'url'),
                                                quality=request.form.get('quality', 'standard'))  # Generate image
                if not image_url:
                    error_message = "Failed to generate image. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                # Optionally, delete the uploaded file after processing
                os.remove(file_path)
                logger.info(f"Successfully processed audio file and generated image.")
                return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary, summary=summary)

            except Exception as e:
                error_message = f"An error occurred: {e}"
                logger.error(error_message)
                return render_template("index.html",  error_message=error_message)
        else:
            error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
            logger.warning(request)
            return render_template("index.html",  error_message=error_message)

    return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary,
                           error=error_message, summary=summary)



@app.errorhandler(500)
def internal_server_error(e):
    """Handles internal server errors."""
    logger.error(f"Internal Server Error: {e}")
    return render_template("error.html", error="Internal Server Error"), 500


if __name__ == "__main__":
    app.run(debug=True)

Code Breakdown:

  • Import Statements: Imports necessary Flask modules, OpenAI library, osdotenvloggingOptional and Dict for type hinting, and secure_filename and FileStorage from Werkzeug.
  • Environment Variables: Loads the OpenAI API key from the .env file.
  • Flask Application:
    • Creates a Flask application instance.
    • Configures an upload folder and maximum file size. The UPLOAD_FOLDER is set to 'uploads', and MAX_CONTENT_LENGTH is set to 25MB. The upload folder is created if it does not exist.
  • Logging Configuration: Configures logging.
  • allowed_file Function: Checks if the uploaded file has an allowed audio extension.
  • transcribe_audio Function:
    • Takes the audio file path as input.
    • Opens the audio file in binary read mode ("rb").
    • Calls the OpenAI API's openai.Audio.transcriptions.create() method to transcribe the audio.
    • Extracts the transcribed text from the API response.
    • Logs the file path before transcription and the length of the transcribed text after successful transcription.
    • Includes error handling for OpenAI API errors and other exceptions. The audio file is closed after transcription.
  • generate_image_prompt Function:
    • Takes the transcribed text as input.
    • Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a detailed text prompt suitable for image generation.
    • The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
    • Extracts the generated prompt from the API response.
    • It strips any leading/trailing spaces from the generated prompt.
    • Includes error handling.
  • generate_image Function:
    • Takes the image prompt as input.
    • Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
    • Accepts optional modelsizeresponse_format, and quality parameters, allowing the user to configure the image generation.
    • Extracts the URL of the generated image from the API response.
    • Includes error handling.
  • index Route:
    • Handles both GET and POST requests.
    • For GET requests, it renders the initial HTML page.
    • For POST requests (when the user uploads an audio file):
      • It validates the uploaded file:
        • Checks if the file part exists in the request.
        • Checks if a file was selected.
        • Checks if the file type is allowed using the allowed_file function.
      • It saves the uploaded file to a temporary location using a secure filename.
      • It calls the utility functions to:
        • Transcribe the audio using transcribe_audio().
        • Generate an image prompt from the transcription using create_image_prompt().
        • Generate an image from the prompt using generate_dalle_image().
      • It summarizes the transcript using openai chat completions api.
      • It handles errors that may occur during any of these steps, logging the error and rendering the index.html template with an appropriate error message. The temporary file is deleted before rendering the error page.
      • If all steps are successful, it renders the index.html template, passing the transcription text, image URL, and generated prompt to be displayed.
  • @app.errorhandler(500): Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.
  • if __name__ == "__main__":: Starts the Flask development server if the script is executed directly.

Step 6: Create the HTML Template (templates/dashboard.html)

Create a folder named templates in the same directory as app.py. Inside the templates folder, create a file named dashboard.html with the following HTML code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Creator Dashboard</title>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
    <style>
        /* --- General Styles --- */
        body {
            font-family: 'Inter', sans-serif;
            padding: 40px;
            background-color: #f9fafb; /* Tailwind's gray-50 */
            display: flex;
            justify-content: center;
            align-items: center;
            min-height: 100vh;
            margin: 0;
            color: #374151; /* Tailwind's gray-700 */
        }
        .container {
            max-width: 800px; /* Increased max-width */
            width: 95%; /* Take up most of the viewport */
            background-color: #fff;
            padding: 2rem;
            border-radius: 0.75rem; /* Tailwind's rounded-lg */
            box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
            text-align: center;
        }
        h2 {
            font-size: 2.25rem; /* Tailwind's text-3xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1.5rem; /* Tailwind's mb-6 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        p{
            color: #6b7280; /* Tailwind's gray-500 */
            margin-bottom: 1rem;
        }

        /* --- Form Styles --- */
        form {
            margin-top: 1rem; /* Tailwind's mt-4 */
            margin-bottom: 1.5rem;
            display: flex;
            flex-direction: column;
            align-items: center; /* Center form elements */
            gap: 0.5rem; /* Tailwind's gap-2 */
        }
        label {
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600;  /* Tailwind's font-semibold */
            color: #4b5563; /* Tailwind's gray-600 */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="file"] {
            width: 100%;
            max-width: 400px; /* Added max-width for file input */
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            font-size: 1rem; /* Tailwind's text-base */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="submit"] {
            padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
            color: #fff;
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600; /* Tailwind's font-semibold */
            cursor: pointer;
            transition: background-color 0.3s ease; /* Smooth transition */
            border: none;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
            margin-top: 1rem;
        }
        input[type="submit"]:hover {
            background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
        }
        input[type="submit"]:focus {
            outline: none;
            box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
        }

        /* --- Result Styles --- */
        .result-container {
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
            text-align: left;
        }
        h3 {
            font-size: 1.5rem; /* Tailwind's text-2xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1rem; /* Tailwind's mb-4 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        textarea {
            width: 100%;
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            resize: none;
            font-size: 1rem; /* Tailwind's text-base */
            line-height: 1.5rem; /* Tailwind's leading-relaxed */
            margin-top: 0.5rem; /* Tailwind's mt-2 */
            margin-bottom: 0;
            box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
            min-height: 100px;
        }
        textarea:focus {
            outline: none;
            border-color: #3b82f6; /* Tailwind's border-blue-500 */
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
        }
        img {
            max-width: 100%;
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            margin-top: 1.5rem; /* Tailwind's mt-6 */
            box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
        }

        /* --- Error Styles --- */
        .error-message {
            color: #dc2626; /* Tailwind's text-red-600 */
            margin-top: 1rem; /* Tailwind's mt-4 */
            padding: 0.75rem;
            background-color: #fee2e2; /* Tailwind's bg-red-100 */
            border-radius: 0.375rem; /* Tailwind's rounded-md */
            border: 1px solid #fecaca; /* Tailwind's border-red-300 */
            text-align: center;
        }

        .prompt-select {
            margin-top: 1rem; /* Tailwind's mt-4 */
            display: flex;
            flex-direction: column;
            align-items: center;
            gap: 0.5rem;
            width: 100%;
        }

        .prompt-select label {
            font-size: 1rem;
            font-weight: 600;
            color: #4b5563;
            margin-bottom: 0.25rem;
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }

        .prompt-select select {
            width: 100%;
            max-width: 400px;
            padding: 0.75rem;
            border-radius: 0.5rem;
            border: 1px solid #d1d5db;
            font-size: 1rem;
            margin-bottom: 0.25rem;
            margin-left: auto;
            margin-right: auto;
            appearance: none;  /* Remove default arrow */
            background-image: url("data:image/svg+xml,%3Csvgxmlns='http://www.w3.org/2000/svg' viewBox='0 0 20 20' fill='none' stroke='currentColor' stroke-width='1.5' stroke-linecap='round' stroke-linejoin='round'%3E%3Cpath d='M6 9l4 4 4-4'%3E%3C/path%3E%3C/svg%3E"); /* Add custom arrow */
            background-repeat: no-repeat;
            background-position: right 0.75rem center;
            background-size: 1rem;
            padding-right: 2.5rem; /* Make space for the arrow */
        }

        .prompt-select select:focus {
            outline: none;
            border-color: #3b82f6;
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15);
        }


    </style>
</head>
<body>
    <div class="container">
        <h2>🎤🧠🎨 Multimodal Assistant</h2>
        <p> Upload an audio file to transcribe and generate a corresponding image. </p>
        <form method="POST" enctype="multipart/form-data">
            <label for="audio_file">Upload your voice note:</label><br>
            <input type="file" name="audio_file" accept="audio/*" required><br><br>

            <div class = "prompt-select">
                <label for="prompt_mode">Image Prompt Mode:</label>
                <select id="prompt_mode" name="prompt_mode">
                    <option value="detailed">Detailed Scene Description</option>
                    <option value="keywords">Keywords</option>
                    <option value="creative">Creative Interpretation</option>
                </select>
            </div>

            <input type="submit" value="Generate Visual Response">
        </form>

        {% if transcript %}
            <div class="result-container">
                <h3>📝 Transcript:</h3>
                <textarea readonly>{{ transcript }}</textarea>
            </div>
        {% endif %}

        {% if summary %}
            <div class="result-container">
                <h3>🔎 Summary:</h3>
                <p>{{ summary }}</p>
            </div>
        {% endif %}

        {% if prompt %}
            <div class="result-container">
                <h3>🎯 Scene Prompt:</h3>
                <p>{{ prompt }}</p>
            </div>
        {% endif %}

        {% if image_url %}
            <div class="result-container">
                <h3>🖼️ Generated Image:</h3>
                <img src="{{ image_url }}" alt="Generated image">
            </div>
        {% endif %}
        {% if error %}
            <div class="error-message">{{ error }}</div>
        {% endif %}
    </div>
</body>
</html>


6.2.3 What Makes This a Dashboard?

This layout combines several key elements that transform it from a simple interface into a comprehensive dashboard:

  • Multiple output zones (text, summary, prompt, image)The interface is divided into distinct sections, each dedicated to displaying different types of processed data. This organization allows users to easily track the progression from speech input to visual output.
  • Simple user interaction (one-click processing)Despite the complex processing happening behind the scenes, users only need to perform one action to initiate the entire workflow. This simplicity makes the tool accessible to users of all technical levels.
  • Clean, readable formattingThe interface uses consistent spacing, typography, and visual hierarchy to ensure information is easily digestible. Each section is clearly labeled and visually separated from others.
  • Visual feedback to reinforce model outputThe dashboard provides immediate visual confirmation at each step of the process, helping users understand how their input is being transformed across different AI models.
  • Reusable architecture, thanks to the utils/ structureThe modular design separates core functionality into utility functions, making the code easier to maintain and adapt for different use cases.

6.2.4 Use Case Ideas

This versatile dashboard has numerous potential applications. Let's explore some key use cases in detail:

  • content creator's AI toolkit (turn thoughts into blogs + visuals)
    • Record brainstorming sessions and convert them into structured blog posts
    • Generate matching illustrations for key concepts
    • Create social media content bundles with matching visuals
  • teacher's assistant (record voice ➝ summarize ➝ illustrate)
    • Transform lesson plans into visual learning materials
    • Create engaging educational content with matching illustrations
    • Generate visual aids for complex concepts
  • journaling tool (log voice entries ➝ summarize + visualize)
    • Convert daily voice memos into organized written entries
    • Create mood boards based on journal content
    • Track emotional patterns through visual representations

Summary

In this section, you elevated your multimodal assistant into a professional-grade dashboard. Here's what you accomplished:

  • Break down your logic into reusable utilities
    • Created modular, maintainable code structure
    • Implemented clean separation of concerns
  • Accept audio input and process it across models
    • Seamless integration of multiple AI technologies
    • Efficient processing pipeline
  • Present everything clearly in a cohesive UI
    • User-friendly interface design
    • Intuitive information hierarchy
  • Move from "demo" to tool
    • Production-ready implementation
    • Scalable architecture

This dashboard represents a professional-grade interface that delivers real value to users. With its robust architecture and intuitive design, it's ready to be transformed into a full-fledged product with minimal additional development.

6.2 Building a Creator Dashboard

This is where all the capabilities you've developed so far come together to create a powerful, unified system. By integrating multiple AI technologies, we can create applications that are greater than the sum of their parts. Let's explore these core capabilities in detail:

  • Transcription turns spoken words into written textUsing advanced speech recognition models like Whisper, we can accurately convert audio recordings into text, preserving the speaker's intent and context. This forms the foundation for further processing.
  • Content generation creates new, contextually relevant materialLarge language models can analyze the transcribed text and generate new content that maintains consistency with the original message while adding valuable insights or expanding on key points.
  • Prompt engineering crafts precise instructions for AI modelsThrough careful prompt construction, we can guide AI models to produce more accurate and relevant outputs. This involves understanding both the technical capabilities of the models and the nuanced ways to communicate with them.
  • Image creation transforms text descriptions into visual artModels like DALL·E can interpret textual descriptions and create corresponding images, adding a visual dimension to our applications and making abstract concepts more tangible.

These components don't just exist side by side - they form an interconnected pipeline where each step enhances the next. The output from transcription feeds into content generation, which informs prompt engineering, ultimately leading to image creation. This seamless integration creates a fluid workflow where users can start with a simple voice recording and end with a rich multimedia output, all within a single, cohesive system. By eliminating the need to switch between different tools or interfaces, users can focus on their creative process rather than technical implementation details.

6.2.1 What You'll Build

In this section, you'll design and implement a Creator Dashboard - a sophisticated web interface that transforms how creators work with AI. This comprehensive platform serves as a central hub for content creation, combining multiple AI technologies into one seamless experience. Let's explore the key features that make this dashboard powerful:

  • Upload a voice recordingCreators can easily upload audio files in various formats, making it simple to start their creative process with spoken ideas or narration.
  • Transcribe the voice recording into text using AIUsing advanced AI speech recognition technology, the system accurately converts spoken words into written text, maintaining the nuances and context of the original recording.
  • Turn that transcription into an editable promptThe system intelligently processes the transcribed text to create structured, AI-ready prompts that can be customized to achieve the desired creative output.
  • Generate images using DALL·E based on the promptLeveraging DALL·E's powerful image generation capabilities, the system creates visual representations that match the specified prompts, bringing ideas to life through AI-generated artwork.
  • Summarize the transcriptThe dashboard employs AI to distill long transcriptions into concise, meaningful summaries, helping creators quickly grasp the core concepts and themes.
  • Display all the results for review and further use in content productionAll generated content - from transcripts to images - is presented in an organized, easy-to-review format, allowing creators to efficiently manage and utilize their assets.

To build this robust system, you'll implement a modern tech stack using Flask for the backend operations and a clean, responsive combination of HTML and CSS for the frontend interface. This architecture ensures both modularity and maintainability, making it easy to update and scale the dashboard as needed.

6.2.2 Step-by-Step Implementation

Step 1: Project Setup

Download the audio sample: https://files.cuantum.tech/audio/dashboard-project.mp3

Create a new directory for your project and navigate into it:

mkdir creator_dashboard
cd creator_dashboard

It's recommended to set up a virtual environment:

python -m venv venv
source venv/bin/activate  # On macOS/Linux
venv\\Scripts\\activate  # On Windows

Install the required Python packages:

pip install flask openai python-dotenv

Organize your project files as follows:

/creator_dashboard

├── app.py
├── .env
└── templates/
    └── dashboard.html
└── utils/
    ├── __init__.py
    ├── transcribe.py
    ├── summarize.py
    ├── generate_prompt.py
    └── generate_image.py
  • app.py: The main Flask application file.
  • .env: A file to store your OpenAI API key.
  • templates/: A directory for HTML templates.
  • templates/dashboard.html: The HTML template for the user interface.
  • utils/: A directory for Python modules containing reusable functions.
    • __init__.py: Makes the utils directory a Python package.
    • transcribe.py: Contains the function to transcribe audio using Whisper.
    • summarize.py: Contains the function to summarize the transcription using a Large Language Model.
    • generate_prompt.py: Contains the function to generate an image prompt from the summary using a Large Language Model.
    • generate_image.py: Contains the function to generate an image with DALL·E 3.

Step 2: Create the Utility Modules

Create the following Python files in the utils/ directory:

utils/transcribe.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.Audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        audio_file.close()
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None
  • This module defines the transcribe_audio function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription.
  • The function opens the audio file in binary read mode ("rb").
  • It calls openai.Audio.transcriptions.create() to perform the transcription, specifying the "whisper-1" model.
  • It extracts the transcribed text from the API response.
  • It includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors. If an error occurs, it logs the error and returns None.
  • It logs the file path before transcription and the length of the transcribed text after successful transcription.
  • The audio file is closed after transcription.

utils/summarize.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def summarize_transcript(text: str) -> Optional[str]:
    """
    Summarizes a text transcript using OpenAI's Chat Completion API.

    Args:
        text (str): The text transcript to summarize.

    Returns:
        Optional[str]: The summarized text, or None on error.
    """
    try:
        logger.info("Summarizing transcript")
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system",
                 "content": "You are a helpful assistant.  Provide a concise summary of the text, suitable for generating a visual representation."},
                {"role": "user", "content": text}
            ],
        )
        summary = response.choices[0].message.content
        logger.info(f"Summary: {summary}")
        return summary
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating summary: {e}")
        return None
  • This module defines the summarize_transcript function, which takes a text transcript as input and uses OpenAI's Chat Completion API to generate a concise summary.
  • The system message instructs the model to act as a helpful assistant and to provide a concise summary of the text, suitable for generating a visual representation.
  • The user message provides the transcript as the content for the model to summarize.
  • The function extracts the summary from the API response.
  • It includes error handling.

utils/generate_prompt.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def create_image_prompt(transcription: str) -> Optional[str]:
    """
    Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.

    Args:
        transcription (str): The text transcription of the audio.

    Returns:
        Optional[str]: A detailed text prompt suitable for image generation, or None on error.
    """
    try:
        logger.info("Generating image prompt from transcription")
        response = openai.chat.completions.create(
            model="gpt-4o",  # Use a powerful chat model
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.  Incorporate scene lighting, time of day, weather, and camera angle into the description.  Limit the description to 200 words.",
                },
                {"role": "user", "content": transcription},
            ],
        )
        prompt = response.choices[0].message.content
        prompt = prompt.strip()  # Remove leading/trailing spaces
        logger.info(f"Generated prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None
  • This module defines the create_image_prompt function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation.
  • The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
  • The user message provides the transcribed text as the content for the model to work with.
  • The function extracts the generated prompt from the API response.
  • It strips any leading/trailing spaces from the generated prompt.
  • It includes error handling.

utils/generate_image.py:

import openai
import logging
from typing import Optional, Dict

logger = logging.getLogger(__name__)


def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024",
                       response_format: str = "url", quality: str = "standard") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".
        quality (str, optional): The quality of the image. Defaults to "standard".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}, quality: {quality}")
        response = openai.images.generate(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
            quality=quality
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None
  • This module defines the generate_dalle_image function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image.
  • It calls the openai.images.generate() method to generate the image.
  • It accepts optional modelsizeresponse_format, and quality parameters, allowing the user to configure the image generation.
  • It extracts the URL of the generated image from the API response.
  • It includes error handling.

Step 5: Create the Main App (app.py)

Create a Python file named app.py in the root directory of your project and add the following code:

from flask import Flask, request, render_template, jsonify, make_response, redirect, url_for
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict
from werkzeug.utils import secure_filename
from werkzeug.datastructures import FileStorage

# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image
from utils.summarize import summarize_transcript

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads'  # Store uploaded files
app.config['MAX_CONTENT_LENGTH'] = 25 * 1024 * 1024  # 25MB max file size - increased for larger audio files
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)  # Create the upload folder if it doesn't exist

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}  # Allowed audio file extensions


def allowed_file(filename: str) -> bool:
    """
    Checks if the uploaded file has an allowed extension.

    Args:
        filename (str): The name of the file.

    Returns:
        bool: True if the file has an allowed extension, False otherwise.
    """
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


@app.route("/", methods=["GET", "POST"])
def index():
    """
    Handles the main route for the web application.
    Processes audio uploads, transcribes them, generates image prompts, and displays images.
    """
    transcript = None
    image_url = None
    prompt_summary = None
    error_message = None
    summary = None # Initialize summary

    if request.method == "POST":
        if 'audio_file' not in request.files:
            error_message = "No file part"
            logger.warning(error_message)
            return render_template("index.html", error_message=error_message)

        file: FileStorage = request.files['audio_file']  # Use type hinting
        if file.filename == '':
            error_message = "No file selected"
            logger.warning(request)
            return render_template("index.html", error_message=error_message)

        if file and allowed_file(file.filename):
            try:
                # Secure the filename and construct a safe path
                filename = secure_filename(file.filename)
                file_path = os.path.join(app.config['UPLOAD_FOLDER'], filename)
                file.save(file_path)  # Save the uploaded file

                transcript = transcribe_audio(file_path)  # Transcribe audio
                if not transcript:
                    error_message = "Audio transcription failed. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                summary = summarize_transcript(transcript) # Summarize the transcript
                if not summary:
                    error_message = "Audio summary failed. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                prompt_summary = generate_image_prompt(transcript)  # Generate prompt
                if not prompt_summary:
                    error_message = "Failed to generate image prompt. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                image_url = generate_dalle_image(prompt_summary, model=request.form.get('model', 'dall-e-3'),
                                                size=request.form.get('size', '1024x1024'),
                                                response_format=request.form.get('format', 'url'),
                                                quality=request.form.get('quality', 'standard'))  # Generate image
                if not image_url:
                    error_message = "Failed to generate image. Please try again."
                    os.remove(file_path)
                    return render_template("index.html", error_message=error_message)

                # Optionally, delete the uploaded file after processing
                os.remove(file_path)
                logger.info(f"Successfully processed audio file and generated image.")
                return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary, summary=summary)

            except Exception as e:
                error_message = f"An error occurred: {e}"
                logger.error(error_message)
                return render_template("index.html",  error_message=error_message)
        else:
            error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
            logger.warning(request)
            return render_template("index.html",  error_message=error_message)

    return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary,
                           error=error_message, summary=summary)



@app.errorhandler(500)
def internal_server_error(e):
    """Handles internal server errors."""
    logger.error(f"Internal Server Error: {e}")
    return render_template("error.html", error="Internal Server Error"), 500


if __name__ == "__main__":
    app.run(debug=True)

Code Breakdown:

  • Import Statements: Imports necessary Flask modules, OpenAI library, osdotenvloggingOptional and Dict for type hinting, and secure_filename and FileStorage from Werkzeug.
  • Environment Variables: Loads the OpenAI API key from the .env file.
  • Flask Application:
    • Creates a Flask application instance.
    • Configures an upload folder and maximum file size. The UPLOAD_FOLDER is set to 'uploads', and MAX_CONTENT_LENGTH is set to 25MB. The upload folder is created if it does not exist.
  • Logging Configuration: Configures logging.
  • allowed_file Function: Checks if the uploaded file has an allowed audio extension.
  • transcribe_audio Function:
    • Takes the audio file path as input.
    • Opens the audio file in binary read mode ("rb").
    • Calls the OpenAI API's openai.Audio.transcriptions.create() method to transcribe the audio.
    • Extracts the transcribed text from the API response.
    • Logs the file path before transcription and the length of the transcribed text after successful transcription.
    • Includes error handling for OpenAI API errors and other exceptions. The audio file is closed after transcription.
  • generate_image_prompt Function:
    • Takes the transcribed text as input.
    • Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a detailed text prompt suitable for image generation.
    • The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
    • Extracts the generated prompt from the API response.
    • It strips any leading/trailing spaces from the generated prompt.
    • Includes error handling.
  • generate_image Function:
    • Takes the image prompt as input.
    • Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
    • Accepts optional modelsizeresponse_format, and quality parameters, allowing the user to configure the image generation.
    • Extracts the URL of the generated image from the API response.
    • Includes error handling.
  • index Route:
    • Handles both GET and POST requests.
    • For GET requests, it renders the initial HTML page.
    • For POST requests (when the user uploads an audio file):
      • It validates the uploaded file:
        • Checks if the file part exists in the request.
        • Checks if a file was selected.
        • Checks if the file type is allowed using the allowed_file function.
      • It saves the uploaded file to a temporary location using a secure filename.
      • It calls the utility functions to:
        • Transcribe the audio using transcribe_audio().
        • Generate an image prompt from the transcription using create_image_prompt().
        • Generate an image from the prompt using generate_dalle_image().
      • It summarizes the transcript using openai chat completions api.
      • It handles errors that may occur during any of these steps, logging the error and rendering the index.html template with an appropriate error message. The temporary file is deleted before rendering the error page.
      • If all steps are successful, it renders the index.html template, passing the transcription text, image URL, and generated prompt to be displayed.
  • @app.errorhandler(500): Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.
  • if __name__ == "__main__":: Starts the Flask development server if the script is executed directly.

Step 6: Create the HTML Template (templates/dashboard.html)

Create a folder named templates in the same directory as app.py. Inside the templates folder, create a file named dashboard.html with the following HTML code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Creator Dashboard</title>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
    <style>
        /* --- General Styles --- */
        body {
            font-family: 'Inter', sans-serif;
            padding: 40px;
            background-color: #f9fafb; /* Tailwind's gray-50 */
            display: flex;
            justify-content: center;
            align-items: center;
            min-height: 100vh;
            margin: 0;
            color: #374151; /* Tailwind's gray-700 */
        }
        .container {
            max-width: 800px; /* Increased max-width */
            width: 95%; /* Take up most of the viewport */
            background-color: #fff;
            padding: 2rem;
            border-radius: 0.75rem; /* Tailwind's rounded-lg */
            box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
            text-align: center;
        }
        h2 {
            font-size: 2.25rem; /* Tailwind's text-3xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1.5rem; /* Tailwind's mb-6 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        p{
            color: #6b7280; /* Tailwind's gray-500 */
            margin-bottom: 1rem;
        }

        /* --- Form Styles --- */
        form {
            margin-top: 1rem; /* Tailwind's mt-4 */
            margin-bottom: 1.5rem;
            display: flex;
            flex-direction: column;
            align-items: center; /* Center form elements */
            gap: 0.5rem; /* Tailwind's gap-2 */
        }
        label {
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600;  /* Tailwind's font-semibold */
            color: #4b5563; /* Tailwind's gray-600 */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="file"] {
            width: 100%;
            max-width: 400px; /* Added max-width for file input */
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            font-size: 1rem; /* Tailwind's text-base */
            margin-bottom: 0.25rem; /* Tailwind's mb-1 */
            margin-left: auto;
            margin-right: auto;
        }
        input[type="submit"] {
            padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
            color: #fff;
            font-size: 1rem; /* Tailwind's text-base */
            font-weight: 600; /* Tailwind's font-semibold */
            cursor: pointer;
            transition: background-color 0.3s ease; /* Smooth transition */
            border: none;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
            margin-top: 1rem;
        }
        input[type="submit"]:hover {
            background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
        }
        input[type="submit"]:focus {
            outline: none;
            box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
        }

        /* --- Result Styles --- */
        .result-container {
            margin-top: 2rem; /* Tailwind's mt-8 */
            padding: 1.5rem; /* Tailwind's p-6 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            background-color: #f8fafc; /* Tailwind's bg-gray-50 */
            border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
            text-align: left;
        }
        h3 {
            font-size: 1.5rem; /* Tailwind's text-2xl */
            font-weight: 600;  /* Tailwind's font-semibold */
            margin-bottom: 1rem; /* Tailwind's mb-4 */
            color: #1e293b; /* Tailwind's gray-900 */
        }
        textarea {
            width: 100%;
            padding: 0.75rem; /* Tailwind's p-3 */
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
            resize: none;
            font-size: 1rem; /* Tailwind's text-base */
            line-height: 1.5rem; /* Tailwind's leading-relaxed */
            margin-top: 0.5rem; /* Tailwind's mt-2 */
            margin-bottom: 0;
            box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
            min-height: 100px;
        }
        textarea:focus {
            outline: none;
            border-color: #3b82f6; /* Tailwind's border-blue-500 */
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
        }
        img {
            max-width: 100%;
            border-radius: 0.5rem; /* Tailwind's rounded-md */
            margin-top: 1.5rem; /* Tailwind's mt-6 */
            box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
        }

        /* --- Error Styles --- */
        .error-message {
            color: #dc2626; /* Tailwind's text-red-600 */
            margin-top: 1rem; /* Tailwind's mt-4 */
            padding: 0.75rem;
            background-color: #fee2e2; /* Tailwind's bg-red-100 */
            border-radius: 0.375rem; /* Tailwind's rounded-md */
            border: 1px solid #fecaca; /* Tailwind's border-red-300 */
            text-align: center;
        }

        .prompt-select {
            margin-top: 1rem; /* Tailwind's mt-4 */
            display: flex;
            flex-direction: column;
            align-items: center;
            gap: 0.5rem;
            width: 100%;
        }

        .prompt-select label {
            font-size: 1rem;
            font-weight: 600;
            color: #4b5563;
            margin-bottom: 0.25rem;
            display: block; /* Ensure label takes full width */
            text-align: left;
            width: 100%;
            max-width: 400px; /* Added max-width for label */
            margin-left: auto;
            margin-right: auto;
        }

        .prompt-select select {
            width: 100%;
            max-width: 400px;
            padding: 0.75rem;
            border-radius: 0.5rem;
            border: 1px solid #d1d5db;
            font-size: 1rem;
            margin-bottom: 0.25rem;
            margin-left: auto;
            margin-right: auto;
            appearance: none;  /* Remove default arrow */
            background-image: url("data:image/svg+xml,%3Csvgxmlns='http://www.w3.org/2000/svg' viewBox='0 0 20 20' fill='none' stroke='currentColor' stroke-width='1.5' stroke-linecap='round' stroke-linejoin='round'%3E%3Cpath d='M6 9l4 4 4-4'%3E%3C/path%3E%3C/svg%3E"); /* Add custom arrow */
            background-repeat: no-repeat;
            background-position: right 0.75rem center;
            background-size: 1rem;
            padding-right: 2.5rem; /* Make space for the arrow */
        }

        .prompt-select select:focus {
            outline: none;
            border-color: #3b82f6;
            box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15);
        }


    </style>
</head>
<body>
    <div class="container">
        <h2>🎤🧠🎨 Multimodal Assistant</h2>
        <p> Upload an audio file to transcribe and generate a corresponding image. </p>
        <form method="POST" enctype="multipart/form-data">
            <label for="audio_file">Upload your voice note:</label><br>
            <input type="file" name="audio_file" accept="audio/*" required><br><br>

            <div class = "prompt-select">
                <label for="prompt_mode">Image Prompt Mode:</label>
                <select id="prompt_mode" name="prompt_mode">
                    <option value="detailed">Detailed Scene Description</option>
                    <option value="keywords">Keywords</option>
                    <option value="creative">Creative Interpretation</option>
                </select>
            </div>

            <input type="submit" value="Generate Visual Response">
        </form>

        {% if transcript %}
            <div class="result-container">
                <h3>📝 Transcript:</h3>
                <textarea readonly>{{ transcript }}</textarea>
            </div>
        {% endif %}

        {% if summary %}
            <div class="result-container">
                <h3>🔎 Summary:</h3>
                <p>{{ summary }}</p>
            </div>
        {% endif %}

        {% if prompt %}
            <div class="result-container">
                <h3>🎯 Scene Prompt:</h3>
                <p>{{ prompt }}</p>
            </div>
        {% endif %}

        {% if image_url %}
            <div class="result-container">
                <h3>🖼️ Generated Image:</h3>
                <img src="{{ image_url }}" alt="Generated image">
            </div>
        {% endif %}
        {% if error %}
            <div class="error-message">{{ error }}</div>
        {% endif %}
    </div>
</body>
</html>


6.2.3 What Makes This a Dashboard?

This layout combines several key elements that transform it from a simple interface into a comprehensive dashboard:

  • Multiple output zones (text, summary, prompt, image)The interface is divided into distinct sections, each dedicated to displaying different types of processed data. This organization allows users to easily track the progression from speech input to visual output.
  • Simple user interaction (one-click processing)Despite the complex processing happening behind the scenes, users only need to perform one action to initiate the entire workflow. This simplicity makes the tool accessible to users of all technical levels.
  • Clean, readable formattingThe interface uses consistent spacing, typography, and visual hierarchy to ensure information is easily digestible. Each section is clearly labeled and visually separated from others.
  • Visual feedback to reinforce model outputThe dashboard provides immediate visual confirmation at each step of the process, helping users understand how their input is being transformed across different AI models.
  • Reusable architecture, thanks to the utils/ structureThe modular design separates core functionality into utility functions, making the code easier to maintain and adapt for different use cases.

6.2.4 Use Case Ideas

This versatile dashboard has numerous potential applications. Let's explore some key use cases in detail:

  • content creator's AI toolkit (turn thoughts into blogs + visuals)
    • Record brainstorming sessions and convert them into structured blog posts
    • Generate matching illustrations for key concepts
    • Create social media content bundles with matching visuals
  • teacher's assistant (record voice ➝ summarize ➝ illustrate)
    • Transform lesson plans into visual learning materials
    • Create engaging educational content with matching illustrations
    • Generate visual aids for complex concepts
  • journaling tool (log voice entries ➝ summarize + visualize)
    • Convert daily voice memos into organized written entries
    • Create mood boards based on journal content
    • Track emotional patterns through visual representations

Summary

In this section, you elevated your multimodal assistant into a professional-grade dashboard. Here's what you accomplished:

  • Break down your logic into reusable utilities
    • Created modular, maintainable code structure
    • Implemented clean separation of concerns
  • Accept audio input and process it across models
    • Seamless integration of multiple AI technologies
    • Efficient processing pipeline
  • Present everything clearly in a cohesive UI
    • User-friendly interface design
    • Intuitive information hierarchy
  • Move from "demo" to tool
    • Production-ready implementation
    • Scalable architecture

This dashboard represents a professional-grade interface that delivers real value to users. With its robust architecture and intuitive design, it's ready to be transformed into a full-fledged product with minimal additional development.