Chapter 5: Image and Audio Integration Projects
5.5 Basic Integration of Multiple Modalities
In this section, we'll explore how to combine multiple AI modalities - speech, language understanding, and image generation - into a single cohesive application. While previous sections focused on working with individual technologies, here we'll learn how these powerful tools can work together to create more sophisticated and engaging user experiences.
This integration represents a significant step forward in AI application development, moving beyond simple single-purpose tools to create systems that can process and respond across different forms of communication. By combining Whisper's audio transcription capabilities, GPT-4o's natural language understanding, and DALL·E's image generation, we can build applications that truly demonstrate the potential of modern AI technologies.
The project we'll build serves as an excellent introduction to multimodal AI integration, demonstrating how different AI models can be orchestrated to create a seamless experience. This approach opens up exciting possibilities for developers looking to create more natural and intuitive human-AI interactions.
5.5.1 What You'll Build
In this section, you'll create a sophisticated Flask-based web application that functions as a basic multimodal assistant. This application demonstrates the seamless integration of multiple AI technologies to create an interactive and intelligent user experience. The assistant is capable of:
- Accepting an audio message from the user through a clean web interface, supporting various audio formats
- Leveraging OpenAI's Whisper technology to accurately transcribe the audio message into text, handling different accents and languages
- Utilizing GPT-4o's advanced natural language processing capabilities to analyze the transcribed text, understanding context, intent, and key themes
- Employing DALL·E 3's sophisticated image generation abilities to create relevant, high-quality visuals based on the understood context
- Presenting a cohesive user experience by displaying both the original transcription and the AI-generated image on a single, well-designed webpage
This project serves as an excellent example of a multimodal assistant that seamlessly combines three different types of AI processing: audio processing, natural language understanding, and image generation. By processing audio input, extracting meaningful context, and creating visual representations, it demonstrates the potential of integrated AI technologies. This foundation opens up exciting possibilities for various real-world applications, such as:
- Spoken design prompts for visual creators - enabling artists and designers to verbally describe their vision and instantly see it rendered
- Audio journaling with illustrated output - transforming spoken diary entries into visual memories with matching AI-generated artwork
- Voice-controlled storytelling applications - creating interactive narratives where spoken words come to life through instant visual generation
5.5.2 Step-by-Step Implementation
Step 1: Install Required Packages
Download the sample audio: https://files.cuantum.tech/audio/audio-file-sample.mp3
Ensure you have the necessary Python libraries installed. Open your terminal and execute the following command:
pip install flask openai python-dotenv
This command installs:
flask
: A micro web framework for building the web application.openai
: The OpenAI Python library for interacting with the Whisper, Chat Completion, and DALL·E 3 APIs.python-dotenv
: A library to load environment variables from a.env
file.
Step 2: Set Up Project Structure
Create the following folder structure for your project:
/multimodal_app
│
├── app.py
├── .env
└── templates/
└── index.html
/multimodal_app
: The root directory for your project.app.py
: The Python file containing the Flask application code..env
: A file to store your OpenAI API key.templates/
: A directory to store your HTML templates.templates/index.html
: The HTML template for the main page of your application.
Step 3: Create the Flask App (app.py)
Create a Python file named app.py
in the root directory of your project and add the following code:
from flask import Flask, request, render_template, jsonify, make_response
import openai
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
app = Flask(__name__)
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'} # Allowed audio file extensions
def allowed_file(filename: str) -> bool:
"""
Checks if the uploaded file has an allowed extension.
Args:
filename (str): The name of the file.
Returns:
bool: True if the file has an allowed extension, False otherwise.
"""
return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
def transcribe_audio(file_path: str) -> Optional[str]:
"""
Transcribes an audio file using OpenAI's Whisper API.
Args:
file_path (str): The path to the audio file.
Returns:
Optional[str]: The transcribed text, or None on error.
"""
try:
logger.info(f"Transcribing audio file: {file_path}")
audio_file = open(file_path, "rb")
response = openai.Audio.transcriptions.create(
model="whisper-1",
file=audio_file,
)
transcript = response.text
logger.info(f"Transcription successful. Length: {len(transcript)} characters.")
return transcript
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error during transcription: {e}")
return None
def generate_image_prompt(text: str) -> Optional[str]:
"""
Generates a prompt for DALL·E 3 based on the transcribed text using GPT-4o.
Args:
text (str): The transcribed text.
Returns:
Optional[str]: The generated image prompt, or None on error.
"""
try:
logger.info("Generating image prompt using GPT-4o")
response = openai.chat.completions.create(
model="gpt-4o", # You can also experiment with other chat models
messages=[
{
"role": "system",
"content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content. Do not include any phrases like 'based on the audio' or 'from the user audio'.",
},
{"role": "user", "content": text},
],
)
prompt = response.choices[0].message.content
logger.info(f"Generated image prompt: {prompt}")
return prompt
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image prompt: {e}")
return None
def generate_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024", response_format: str = "url") -> Optional[str]:
"""
Generates an image using OpenAI's DALL·E API.
Args:
prompt (str): The text prompt to generate the image from.
model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
size (str, optional): The size of the generated image. Defaults to "1024x1024".
response_format (str, optional): The format of the response. Defaults to "url".
Returns:
Optional[str]: The URL of the generated image, or None on error.
"""
try:
logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}")
response = openai.Image.create(
prompt=prompt,
model=model,
size=size,
response_format=response_format,
)
image_url = response.data[0].url
logger.info(f"Image URL: {image_url}")
return image_url
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image: {e}")
return None
@app.route("/", methods=["GET", "POST"])
def index():
"""
Handles the main route for the web application.
Processes audio uploads, transcribes them, generates image prompts, and displays images.
"""
transcript = None
image_url = None
prompt_summary = None
error_message = None
if request.method == "POST":
if 'audio_file' not in request.files:
error_message = "No file part"
logger.warning(error_message)
return render_template("index.html", error=error_message)
file = request.files['audio_file']
if file.filename == '':
error_message = "No file selected"
logger.warning(error_message)
return render_template("index.html", error=error_message)
if file and allowed_file(file.filename):
try:
# Securely save the uploaded file to a temporary location
temp_file_path = os.path.join(app.root_path, "temp_audio." + file.filename.rsplit('.', 1)[1].lower())
file.save(temp_file_path)
transcript = transcribe_audio(temp_file_path) # Transcribe audio
if not transcript:
error_message = "Audio transcription failed. Please try again."
return render_template("index.html", error=error_message)
prompt_summary = generate_image_prompt(transcript) # Generate prompt
if not prompt_summary:
error_message = "Failed to generate image prompt. Please try again."
return render_template("index.html", error=error_message)
image_url = generate_image(prompt_summary) # Generate image
if not image_url:
error_message = "Failed to generate image. Please try again."
return render_template("index.html", error=error_message)
# Optionally, delete the temporary file after processing
os.remove(temp_file_path)
except Exception as e:
error_message = f"An error occurred: {e}"
logger.error(error_message)
return render_template("index.html", error=error_message)
else:
error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
logger.warning(error_message)
return render_template("index.html", error=error_message)
return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary, error=error_message)
@app.errorhandler(500)
def internal_server_error(e):
"""Handles internal server errors."""
logger.error(f"Internal Server Error: {e}")
return render_template("error.html", error="Internal Server Error"), 500
if __name__ == "__main__":
app.run(debug=True)
Code Breakdown:
- Import Statements: Imports the necessary Flask modules, OpenAI library,
os
,dotenv
,logging
, andOptional
andDict
for type hinting. - Environment Variables: Loads the OpenAI API key from the
.env
file. - Flask Application: Creates a Flask application instance.
- Logging Configuration: Configures logging.
allowed_file
Function: Checks if the uploaded file has an allowed audio extension.transcribe_audio
Function:- Takes the audio file path as input.
- Opens the audio file in binary mode (
"rb"
). - Calls the OpenAI API's
openai.Audio.transcribe()
method to transcribe the audio. - Extracts the transcribed text from the response.
- Logs the file path before transcription and the length of the transcribed text after successful transcription.
- Includes error handling for OpenAI API errors and other exceptions.
generate_image_prompt
Function:- Takes the transcribed text as input.
- Uses the OpenAI Chat Completion API (
openai.chat.completions.create()
) with thegpt-4o
model to generate a text prompt suitable for image generation. - The system message instructs the model to act as a creative assistant and provide a vivid description of a scene based on the audio.
- Extracts the generated prompt from the API response.
- Includes error handling.
generate_image
Function:- Takes the image prompt as input.
- Calls the OpenAI API's
openai.Image.create()
method to generate an image using DALL·E 3. - Extracts the image URL from the API response.
- Includes error handling.
index
Route:- Handles both GET and POST requests.
- For GET requests, it renders the initial HTML page.
- For POST requests (when the user uploads an audio file):
- It validates the uploaded file.
- It saves the uploaded file temporarily.
- It calls
transcribe_audio()
to transcribe the audio. - It calls
generate_image_prompt()
to generate an image prompt from the transcription. - It calls
generate_image()
to generate an image from the prompt. - It renders the
index.html
template, passing the transcription text and the image URL.
- Includes comprehensive error handling to catch potential issues during file upload, transcription, prompt generation, and image generation.
@app.errorhandler(500)
: Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.if __name__ == "__main__":
: Starts the Flask development server if the script is executed directly.
Step 4: Create HTML Template (templates/index.html)
Create a folder named templates
in the same directory as app.py
. Inside the templates
folder, create a file named index.html
with the following HTML code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Multimodal Assistant</title>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
<style>
/* --- General Styles --- */
body {
font-family: 'Inter', sans-serif;
padding: 40px;
background-color: #f9fafb; /* Tailwind's gray-50 */
display: flex;
justify-content: center;
align-items: center;
min-height: 100vh;
margin: 0;
color: #374151; /* Tailwind's gray-700 */
}
.container {
max-width: 800px; /* Increased max-width */
width: 95%; /* Take up most of the viewport */
background-color: #fff;
padding: 2rem;
border-radius: 0.75rem; /* Tailwind's rounded-lg */
box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
text-align: center;
}
h2 {
font-size: 2.25rem; /* Tailwind's text-3xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1.5rem; /* Tailwind's mb-6 */
color: #1e293b; /* Tailwind's gray-900 */
}
p{
color: #6b7280; /* Tailwind's gray-500 */
margin-bottom: 1rem;
}
/* --- Form Styles --- */
form {
margin-top: 1rem; /* Tailwind's mt-4 */
display: flex;
flex-direction: column;
align-items: center; /* Center form elements */
gap: 0.5rem; /* Tailwind's gap-2 */
}
label {
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
color: #4b5563; /* Tailwind's gray-600 */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
display: block; /* Ensure label takes full width */
text-align: left;
width: 100%;
max-width: 400px; /* Added max-width for label */
margin-left: auto;
margin-right: auto;
}
input[type="file"] {
width: 100%;
max-width: 400px; /* Added max-width for file input */
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
font-size: 1rem; /* Tailwind's text-base */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
margin-left: auto;
margin-right: auto;
}
input[type="submit"] {
padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
color: #fff;
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
cursor: pointer;
transition: background-color 0.3s ease; /* Smooth transition */
border: none;
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
margin-top: 1rem;
}
input[type="submit"]:hover {
background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
}
input[type="submit"]:focus {
outline: none;
box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
}
/* --- Result Styles --- */
.result-container {
margin-top: 2rem; /* Tailwind's mt-8 */
padding: 1.5rem; /* Tailwind's p-6 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #f8fafc; /* Tailwind's bg-gray-50 */
border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
}
.transcript-container{
margin-top: 2rem; /* Tailwind's mt-8 */
padding: 1.5rem; /* Tailwind's p-6 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #f8fafc; /* Tailwind's bg-gray-50 */
border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
}
h3 {
font-size: 1.5rem; /* Tailwind's text-2xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1rem; /* Tailwind's mb-4 */
color: #1e293b; /* Tailwind's gray-900 */
}
textarea {
width: 100%;
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
resize: none;
font-size: 1rem; /* Tailwind's text-base */
line-height: 1.5rem; /* Tailwind's leading-relaxed */
margin-top: 0.5rem; /* Tailwind's mt-2 */
margin-bottom: 0;
box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
}
textarea:focus {
outline: none;
border-color: #3b82f6; /* Tailwind's border-blue-500 */
box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
}
img {
max-width: 100%;
border-radius: 0.5rem; /* Tailwind's rounded-md */
margin-top: 1.5rem; /* Tailwind's mt-6 */
box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
}
/* --- Error Styles --- */
.error-message {
color: #dc2626; /* Tailwind's text-red-600 */
margin-top: 1rem; /* Tailwind's mt-4 */
padding: 0.75rem;
background-color: #fee2e2; /* Tailwind's bg-red-100 */
border-radius: 0.375rem; /* Tailwind's rounded-md */
border: 1px solid #fecaca; /* Tailwind's border-red-300 */
text-align: center;
}
</style>
</head>
<body>
<div class="container">
<h2>🎤🧠🎨 Multimodal Assistant</h2>
<p> Upload an audio file to transcribe and generate a corresponding image. </p>
<form method="POST" enctype="multipart/form-data">
<label for="audio_file">Upload your voice note:</label><br>
<input type="file" name="audio_file" accept="audio/*" required><br><br>
<input type="submit" value="Generate Visual Response">
</form>
{% if transcript %}
<div class = "transcript-container">
<h3>📝 Transcript:</h3>
<textarea readonly>{{ transcript }}</textarea>
</div>
{% endif %}
{% if prompt_summary %}
<div class = "result-container">
<h3>🎯 Prompt Used for Image:</h3>
<p>{{ prompt_summary }}</p>
</div>
{% endif %}
{% if image_url %}
<div class = "result-container">
<h3>🖼️ Generated Image:</h3>
<img src="{{ image_url }}" alt="Generated image">
</div>
{% endif %}
{% if error %}
<div class="error-message">{{ error }}</div>
{% endif %}
</div>
</body>
</html>
Key elements in the HTML template:
- HTML Structure:
- The
<head>
section defines the title, links a CSS stylesheet, and sets the viewport for responsiveness. - The
<body>
contains the visible content, including a form for uploading audio and sections to display the transcription and generated image.
- The
- CSS Styling:
- Modern Design: The CSS is updated to use a modern design, similar to Tailwind CSS.
- Responsive Layout: The layout is more responsive, especially for smaller screens.
- User Experience: Improved form and input styling.
- Clear Error Display: Error messages are styled to be clearly visible.
- Form:
- A
<form>
withenctype="multipart/form-data"
is used to handle file uploads. - A
<label>
and<input type="file">
allow the user to select an audio file. Theaccept="audio/*"
attribute restricts the user to uploading audio files. - A
<input type="submit">
button allows the user to submit the form.
- A
- Transcription and Image Display:
- The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a
textarea
, and the image is displayed using an<img>
tag.
- The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a
- Error Handling:
- A
<div class="error-message">
is used to display any error messages to the user.
- A
Try It Out
- Save the files as
app.py
andtemplates/index.html
. - Ensure you have your OpenAI API key in the
.env
file. - Run the application:
python app.py
- Open
http://localhost:5000
in your browser. - Upload an audio file (you can use the provided sample .mp3 file).
- View the transcription and the generated image on the page.
5.5.3 How This Works (Behind the Scenes)
This demonstrates multimodal orchestration at work - a sophisticated process where different AI models collaborate through your application's logic layer. Each model specializes in a different form of data processing (audio, text, and image), and together they create a seamless experience. The application coordinates these models, handling the data transformation between each step and ensuring proper communication flow.
Example Flow: A Detailed Walkthrough
To better understand how this multimodal system works, let's walk through a complete example. When a user uploads a voice note saying:
"I had the most peaceful morning — sitting by a lake with birds singing and the sun rising behind the trees."
The system processes this input through three distinct stages:
- Audio Processing with Transcript: First, Whisper converts the audio to text, maintaining accuracy even with background noise or accent variations. Result: "I had the most peaceful morning…"
- Scene Analysis and Enhancement: GPT-4o analyzes the transcribed text, identifying key visual elements and spatial relationships to create an optimized image prompt. Result: "A sunrise over a tranquil lake, with birds in the sky and trees reflecting in the water"
- Visual Creation: DALL·E takes this refined prompt and generates a photorealistic image, carefully balancing all the described elements into a cohesive scene
The Power of Integration
In this final section of the chapter, you created a multimodal mini-assistant that demonstrates the seamless integration of three distinct AI capabilities:
- Whisper: Advanced speech recognition that handles various accents, languages, and audio qualities with remarkable accuracy
- GPT-4o: Sophisticated language processing that understands context, emotion, and scene composition to create detailed image descriptions
- DALL·E: State-of-the-art image generation that translates text descriptions into vivid, coherent visual scenes
This integration showcases the future of AI applications where multiple models work in concert to process, understand, and respond to user input in rich, meaningful ways. By orchestrating these models together, you've created an intuitive interface that bridges the gap between human communication and AI capabilities.
5.5 Basic Integration of Multiple Modalities
In this section, we'll explore how to combine multiple AI modalities - speech, language understanding, and image generation - into a single cohesive application. While previous sections focused on working with individual technologies, here we'll learn how these powerful tools can work together to create more sophisticated and engaging user experiences.
This integration represents a significant step forward in AI application development, moving beyond simple single-purpose tools to create systems that can process and respond across different forms of communication. By combining Whisper's audio transcription capabilities, GPT-4o's natural language understanding, and DALL·E's image generation, we can build applications that truly demonstrate the potential of modern AI technologies.
The project we'll build serves as an excellent introduction to multimodal AI integration, demonstrating how different AI models can be orchestrated to create a seamless experience. This approach opens up exciting possibilities for developers looking to create more natural and intuitive human-AI interactions.
5.5.1 What You'll Build
In this section, you'll create a sophisticated Flask-based web application that functions as a basic multimodal assistant. This application demonstrates the seamless integration of multiple AI technologies to create an interactive and intelligent user experience. The assistant is capable of:
- Accepting an audio message from the user through a clean web interface, supporting various audio formats
- Leveraging OpenAI's Whisper technology to accurately transcribe the audio message into text, handling different accents and languages
- Utilizing GPT-4o's advanced natural language processing capabilities to analyze the transcribed text, understanding context, intent, and key themes
- Employing DALL·E 3's sophisticated image generation abilities to create relevant, high-quality visuals based on the understood context
- Presenting a cohesive user experience by displaying both the original transcription and the AI-generated image on a single, well-designed webpage
This project serves as an excellent example of a multimodal assistant that seamlessly combines three different types of AI processing: audio processing, natural language understanding, and image generation. By processing audio input, extracting meaningful context, and creating visual representations, it demonstrates the potential of integrated AI technologies. This foundation opens up exciting possibilities for various real-world applications, such as:
- Spoken design prompts for visual creators - enabling artists and designers to verbally describe their vision and instantly see it rendered
- Audio journaling with illustrated output - transforming spoken diary entries into visual memories with matching AI-generated artwork
- Voice-controlled storytelling applications - creating interactive narratives where spoken words come to life through instant visual generation
5.5.2 Step-by-Step Implementation
Step 1: Install Required Packages
Download the sample audio: https://files.cuantum.tech/audio/audio-file-sample.mp3
Ensure you have the necessary Python libraries installed. Open your terminal and execute the following command:
pip install flask openai python-dotenv
This command installs:
flask
: A micro web framework for building the web application.openai
: The OpenAI Python library for interacting with the Whisper, Chat Completion, and DALL·E 3 APIs.python-dotenv
: A library to load environment variables from a.env
file.
Step 2: Set Up Project Structure
Create the following folder structure for your project:
/multimodal_app
│
├── app.py
├── .env
└── templates/
└── index.html
/multimodal_app
: The root directory for your project.app.py
: The Python file containing the Flask application code..env
: A file to store your OpenAI API key.templates/
: A directory to store your HTML templates.templates/index.html
: The HTML template for the main page of your application.
Step 3: Create the Flask App (app.py)
Create a Python file named app.py
in the root directory of your project and add the following code:
from flask import Flask, request, render_template, jsonify, make_response
import openai
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
app = Flask(__name__)
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'} # Allowed audio file extensions
def allowed_file(filename: str) -> bool:
"""
Checks if the uploaded file has an allowed extension.
Args:
filename (str): The name of the file.
Returns:
bool: True if the file has an allowed extension, False otherwise.
"""
return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
def transcribe_audio(file_path: str) -> Optional[str]:
"""
Transcribes an audio file using OpenAI's Whisper API.
Args:
file_path (str): The path to the audio file.
Returns:
Optional[str]: The transcribed text, or None on error.
"""
try:
logger.info(f"Transcribing audio file: {file_path}")
audio_file = open(file_path, "rb")
response = openai.Audio.transcriptions.create(
model="whisper-1",
file=audio_file,
)
transcript = response.text
logger.info(f"Transcription successful. Length: {len(transcript)} characters.")
return transcript
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error during transcription: {e}")
return None
def generate_image_prompt(text: str) -> Optional[str]:
"""
Generates a prompt for DALL·E 3 based on the transcribed text using GPT-4o.
Args:
text (str): The transcribed text.
Returns:
Optional[str]: The generated image prompt, or None on error.
"""
try:
logger.info("Generating image prompt using GPT-4o")
response = openai.chat.completions.create(
model="gpt-4o", # You can also experiment with other chat models
messages=[
{
"role": "system",
"content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content. Do not include any phrases like 'based on the audio' or 'from the user audio'.",
},
{"role": "user", "content": text},
],
)
prompt = response.choices[0].message.content
logger.info(f"Generated image prompt: {prompt}")
return prompt
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image prompt: {e}")
return None
def generate_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024", response_format: str = "url") -> Optional[str]:
"""
Generates an image using OpenAI's DALL·E API.
Args:
prompt (str): The text prompt to generate the image from.
model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
size (str, optional): The size of the generated image. Defaults to "1024x1024".
response_format (str, optional): The format of the response. Defaults to "url".
Returns:
Optional[str]: The URL of the generated image, or None on error.
"""
try:
logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}")
response = openai.Image.create(
prompt=prompt,
model=model,
size=size,
response_format=response_format,
)
image_url = response.data[0].url
logger.info(f"Image URL: {image_url}")
return image_url
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image: {e}")
return None
@app.route("/", methods=["GET", "POST"])
def index():
"""
Handles the main route for the web application.
Processes audio uploads, transcribes them, generates image prompts, and displays images.
"""
transcript = None
image_url = None
prompt_summary = None
error_message = None
if request.method == "POST":
if 'audio_file' not in request.files:
error_message = "No file part"
logger.warning(error_message)
return render_template("index.html", error=error_message)
file = request.files['audio_file']
if file.filename == '':
error_message = "No file selected"
logger.warning(error_message)
return render_template("index.html", error=error_message)
if file and allowed_file(file.filename):
try:
# Securely save the uploaded file to a temporary location
temp_file_path = os.path.join(app.root_path, "temp_audio." + file.filename.rsplit('.', 1)[1].lower())
file.save(temp_file_path)
transcript = transcribe_audio(temp_file_path) # Transcribe audio
if not transcript:
error_message = "Audio transcription failed. Please try again."
return render_template("index.html", error=error_message)
prompt_summary = generate_image_prompt(transcript) # Generate prompt
if not prompt_summary:
error_message = "Failed to generate image prompt. Please try again."
return render_template("index.html", error=error_message)
image_url = generate_image(prompt_summary) # Generate image
if not image_url:
error_message = "Failed to generate image. Please try again."
return render_template("index.html", error=error_message)
# Optionally, delete the temporary file after processing
os.remove(temp_file_path)
except Exception as e:
error_message = f"An error occurred: {e}"
logger.error(error_message)
return render_template("index.html", error=error_message)
else:
error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
logger.warning(error_message)
return render_template("index.html", error=error_message)
return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary, error=error_message)
@app.errorhandler(500)
def internal_server_error(e):
"""Handles internal server errors."""
logger.error(f"Internal Server Error: {e}")
return render_template("error.html", error="Internal Server Error"), 500
if __name__ == "__main__":
app.run(debug=True)
Code Breakdown:
- Import Statements: Imports the necessary Flask modules, OpenAI library,
os
,dotenv
,logging
, andOptional
andDict
for type hinting. - Environment Variables: Loads the OpenAI API key from the
.env
file. - Flask Application: Creates a Flask application instance.
- Logging Configuration: Configures logging.
allowed_file
Function: Checks if the uploaded file has an allowed audio extension.transcribe_audio
Function:- Takes the audio file path as input.
- Opens the audio file in binary mode (
"rb"
). - Calls the OpenAI API's
openai.Audio.transcribe()
method to transcribe the audio. - Extracts the transcribed text from the response.
- Logs the file path before transcription and the length of the transcribed text after successful transcription.
- Includes error handling for OpenAI API errors and other exceptions.
generate_image_prompt
Function:- Takes the transcribed text as input.
- Uses the OpenAI Chat Completion API (
openai.chat.completions.create()
) with thegpt-4o
model to generate a text prompt suitable for image generation. - The system message instructs the model to act as a creative assistant and provide a vivid description of a scene based on the audio.
- Extracts the generated prompt from the API response.
- Includes error handling.
generate_image
Function:- Takes the image prompt as input.
- Calls the OpenAI API's
openai.Image.create()
method to generate an image using DALL·E 3. - Extracts the image URL from the API response.
- Includes error handling.
index
Route:- Handles both GET and POST requests.
- For GET requests, it renders the initial HTML page.
- For POST requests (when the user uploads an audio file):
- It validates the uploaded file.
- It saves the uploaded file temporarily.
- It calls
transcribe_audio()
to transcribe the audio. - It calls
generate_image_prompt()
to generate an image prompt from the transcription. - It calls
generate_image()
to generate an image from the prompt. - It renders the
index.html
template, passing the transcription text and the image URL.
- Includes comprehensive error handling to catch potential issues during file upload, transcription, prompt generation, and image generation.
@app.errorhandler(500)
: Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.if __name__ == "__main__":
: Starts the Flask development server if the script is executed directly.
Step 4: Create HTML Template (templates/index.html)
Create a folder named templates
in the same directory as app.py
. Inside the templates
folder, create a file named index.html
with the following HTML code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Multimodal Assistant</title>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
<style>
/* --- General Styles --- */
body {
font-family: 'Inter', sans-serif;
padding: 40px;
background-color: #f9fafb; /* Tailwind's gray-50 */
display: flex;
justify-content: center;
align-items: center;
min-height: 100vh;
margin: 0;
color: #374151; /* Tailwind's gray-700 */
}
.container {
max-width: 800px; /* Increased max-width */
width: 95%; /* Take up most of the viewport */
background-color: #fff;
padding: 2rem;
border-radius: 0.75rem; /* Tailwind's rounded-lg */
box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
text-align: center;
}
h2 {
font-size: 2.25rem; /* Tailwind's text-3xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1.5rem; /* Tailwind's mb-6 */
color: #1e293b; /* Tailwind's gray-900 */
}
p{
color: #6b7280; /* Tailwind's gray-500 */
margin-bottom: 1rem;
}
/* --- Form Styles --- */
form {
margin-top: 1rem; /* Tailwind's mt-4 */
display: flex;
flex-direction: column;
align-items: center; /* Center form elements */
gap: 0.5rem; /* Tailwind's gap-2 */
}
label {
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
color: #4b5563; /* Tailwind's gray-600 */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
display: block; /* Ensure label takes full width */
text-align: left;
width: 100%;
max-width: 400px; /* Added max-width for label */
margin-left: auto;
margin-right: auto;
}
input[type="file"] {
width: 100%;
max-width: 400px; /* Added max-width for file input */
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
font-size: 1rem; /* Tailwind's text-base */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
margin-left: auto;
margin-right: auto;
}
input[type="submit"] {
padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
color: #fff;
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
cursor: pointer;
transition: background-color 0.3s ease; /* Smooth transition */
border: none;
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
margin-top: 1rem;
}
input[type="submit"]:hover {
background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
}
input[type="submit"]:focus {
outline: none;
box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
}
/* --- Result Styles --- */
.result-container {
margin-top: 2rem; /* Tailwind's mt-8 */
padding: 1.5rem; /* Tailwind's p-6 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #f8fafc; /* Tailwind's bg-gray-50 */
border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
}
.transcript-container{
margin-top: 2rem; /* Tailwind's mt-8 */
padding: 1.5rem; /* Tailwind's p-6 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #f8fafc; /* Tailwind's bg-gray-50 */
border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
}
h3 {
font-size: 1.5rem; /* Tailwind's text-2xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1rem; /* Tailwind's mb-4 */
color: #1e293b; /* Tailwind's gray-900 */
}
textarea {
width: 100%;
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
resize: none;
font-size: 1rem; /* Tailwind's text-base */
line-height: 1.5rem; /* Tailwind's leading-relaxed */
margin-top: 0.5rem; /* Tailwind's mt-2 */
margin-bottom: 0;
box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
}
textarea:focus {
outline: none;
border-color: #3b82f6; /* Tailwind's border-blue-500 */
box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
}
img {
max-width: 100%;
border-radius: 0.5rem; /* Tailwind's rounded-md */
margin-top: 1.5rem; /* Tailwind's mt-6 */
box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
}
/* --- Error Styles --- */
.error-message {
color: #dc2626; /* Tailwind's text-red-600 */
margin-top: 1rem; /* Tailwind's mt-4 */
padding: 0.75rem;
background-color: #fee2e2; /* Tailwind's bg-red-100 */
border-radius: 0.375rem; /* Tailwind's rounded-md */
border: 1px solid #fecaca; /* Tailwind's border-red-300 */
text-align: center;
}
</style>
</head>
<body>
<div class="container">
<h2>🎤🧠🎨 Multimodal Assistant</h2>
<p> Upload an audio file to transcribe and generate a corresponding image. </p>
<form method="POST" enctype="multipart/form-data">
<label for="audio_file">Upload your voice note:</label><br>
<input type="file" name="audio_file" accept="audio/*" required><br><br>
<input type="submit" value="Generate Visual Response">
</form>
{% if transcript %}
<div class = "transcript-container">
<h3>📝 Transcript:</h3>
<textarea readonly>{{ transcript }}</textarea>
</div>
{% endif %}
{% if prompt_summary %}
<div class = "result-container">
<h3>🎯 Prompt Used for Image:</h3>
<p>{{ prompt_summary }}</p>
</div>
{% endif %}
{% if image_url %}
<div class = "result-container">
<h3>🖼️ Generated Image:</h3>
<img src="{{ image_url }}" alt="Generated image">
</div>
{% endif %}
{% if error %}
<div class="error-message">{{ error }}</div>
{% endif %}
</div>
</body>
</html>
Key elements in the HTML template:
- HTML Structure:
- The
<head>
section defines the title, links a CSS stylesheet, and sets the viewport for responsiveness. - The
<body>
contains the visible content, including a form for uploading audio and sections to display the transcription and generated image.
- The
- CSS Styling:
- Modern Design: The CSS is updated to use a modern design, similar to Tailwind CSS.
- Responsive Layout: The layout is more responsive, especially for smaller screens.
- User Experience: Improved form and input styling.
- Clear Error Display: Error messages are styled to be clearly visible.
- Form:
- A
<form>
withenctype="multipart/form-data"
is used to handle file uploads. - A
<label>
and<input type="file">
allow the user to select an audio file. Theaccept="audio/*"
attribute restricts the user to uploading audio files. - A
<input type="submit">
button allows the user to submit the form.
- A
- Transcription and Image Display:
- The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a
textarea
, and the image is displayed using an<img>
tag.
- The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a
- Error Handling:
- A
<div class="error-message">
is used to display any error messages to the user.
- A
Try It Out
- Save the files as
app.py
andtemplates/index.html
. - Ensure you have your OpenAI API key in the
.env
file. - Run the application:
python app.py
- Open
http://localhost:5000
in your browser. - Upload an audio file (you can use the provided sample .mp3 file).
- View the transcription and the generated image on the page.
5.5.3 How This Works (Behind the Scenes)
This demonstrates multimodal orchestration at work - a sophisticated process where different AI models collaborate through your application's logic layer. Each model specializes in a different form of data processing (audio, text, and image), and together they create a seamless experience. The application coordinates these models, handling the data transformation between each step and ensuring proper communication flow.
Example Flow: A Detailed Walkthrough
To better understand how this multimodal system works, let's walk through a complete example. When a user uploads a voice note saying:
"I had the most peaceful morning — sitting by a lake with birds singing and the sun rising behind the trees."
The system processes this input through three distinct stages:
- Audio Processing with Transcript: First, Whisper converts the audio to text, maintaining accuracy even with background noise or accent variations. Result: "I had the most peaceful morning…"
- Scene Analysis and Enhancement: GPT-4o analyzes the transcribed text, identifying key visual elements and spatial relationships to create an optimized image prompt. Result: "A sunrise over a tranquil lake, with birds in the sky and trees reflecting in the water"
- Visual Creation: DALL·E takes this refined prompt and generates a photorealistic image, carefully balancing all the described elements into a cohesive scene
The Power of Integration
In this final section of the chapter, you created a multimodal mini-assistant that demonstrates the seamless integration of three distinct AI capabilities:
- Whisper: Advanced speech recognition that handles various accents, languages, and audio qualities with remarkable accuracy
- GPT-4o: Sophisticated language processing that understands context, emotion, and scene composition to create detailed image descriptions
- DALL·E: State-of-the-art image generation that translates text descriptions into vivid, coherent visual scenes
This integration showcases the future of AI applications where multiple models work in concert to process, understand, and respond to user input in rich, meaningful ways. By orchestrating these models together, you've created an intuitive interface that bridges the gap between human communication and AI capabilities.
5.5 Basic Integration of Multiple Modalities
In this section, we'll explore how to combine multiple AI modalities - speech, language understanding, and image generation - into a single cohesive application. While previous sections focused on working with individual technologies, here we'll learn how these powerful tools can work together to create more sophisticated and engaging user experiences.
This integration represents a significant step forward in AI application development, moving beyond simple single-purpose tools to create systems that can process and respond across different forms of communication. By combining Whisper's audio transcription capabilities, GPT-4o's natural language understanding, and DALL·E's image generation, we can build applications that truly demonstrate the potential of modern AI technologies.
The project we'll build serves as an excellent introduction to multimodal AI integration, demonstrating how different AI models can be orchestrated to create a seamless experience. This approach opens up exciting possibilities for developers looking to create more natural and intuitive human-AI interactions.
5.5.1 What You'll Build
In this section, you'll create a sophisticated Flask-based web application that functions as a basic multimodal assistant. This application demonstrates the seamless integration of multiple AI technologies to create an interactive and intelligent user experience. The assistant is capable of:
- Accepting an audio message from the user through a clean web interface, supporting various audio formats
- Leveraging OpenAI's Whisper technology to accurately transcribe the audio message into text, handling different accents and languages
- Utilizing GPT-4o's advanced natural language processing capabilities to analyze the transcribed text, understanding context, intent, and key themes
- Employing DALL·E 3's sophisticated image generation abilities to create relevant, high-quality visuals based on the understood context
- Presenting a cohesive user experience by displaying both the original transcription and the AI-generated image on a single, well-designed webpage
This project serves as an excellent example of a multimodal assistant that seamlessly combines three different types of AI processing: audio processing, natural language understanding, and image generation. By processing audio input, extracting meaningful context, and creating visual representations, it demonstrates the potential of integrated AI technologies. This foundation opens up exciting possibilities for various real-world applications, such as:
- Spoken design prompts for visual creators - enabling artists and designers to verbally describe their vision and instantly see it rendered
- Audio journaling with illustrated output - transforming spoken diary entries into visual memories with matching AI-generated artwork
- Voice-controlled storytelling applications - creating interactive narratives where spoken words come to life through instant visual generation
5.5.2 Step-by-Step Implementation
Step 1: Install Required Packages
Download the sample audio: https://files.cuantum.tech/audio/audio-file-sample.mp3
Ensure you have the necessary Python libraries installed. Open your terminal and execute the following command:
pip install flask openai python-dotenv
This command installs:
flask
: A micro web framework for building the web application.openai
: The OpenAI Python library for interacting with the Whisper, Chat Completion, and DALL·E 3 APIs.python-dotenv
: A library to load environment variables from a.env
file.
Step 2: Set Up Project Structure
Create the following folder structure for your project:
/multimodal_app
│
├── app.py
├── .env
└── templates/
└── index.html
/multimodal_app
: The root directory for your project.app.py
: The Python file containing the Flask application code..env
: A file to store your OpenAI API key.templates/
: A directory to store your HTML templates.templates/index.html
: The HTML template for the main page of your application.
Step 3: Create the Flask App (app.py)
Create a Python file named app.py
in the root directory of your project and add the following code:
from flask import Flask, request, render_template, jsonify, make_response
import openai
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
app = Flask(__name__)
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'} # Allowed audio file extensions
def allowed_file(filename: str) -> bool:
"""
Checks if the uploaded file has an allowed extension.
Args:
filename (str): The name of the file.
Returns:
bool: True if the file has an allowed extension, False otherwise.
"""
return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
def transcribe_audio(file_path: str) -> Optional[str]:
"""
Transcribes an audio file using OpenAI's Whisper API.
Args:
file_path (str): The path to the audio file.
Returns:
Optional[str]: The transcribed text, or None on error.
"""
try:
logger.info(f"Transcribing audio file: {file_path}")
audio_file = open(file_path, "rb")
response = openai.Audio.transcriptions.create(
model="whisper-1",
file=audio_file,
)
transcript = response.text
logger.info(f"Transcription successful. Length: {len(transcript)} characters.")
return transcript
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error during transcription: {e}")
return None
def generate_image_prompt(text: str) -> Optional[str]:
"""
Generates a prompt for DALL·E 3 based on the transcribed text using GPT-4o.
Args:
text (str): The transcribed text.
Returns:
Optional[str]: The generated image prompt, or None on error.
"""
try:
logger.info("Generating image prompt using GPT-4o")
response = openai.chat.completions.create(
model="gpt-4o", # You can also experiment with other chat models
messages=[
{
"role": "system",
"content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content. Do not include any phrases like 'based on the audio' or 'from the user audio'.",
},
{"role": "user", "content": text},
],
)
prompt = response.choices[0].message.content
logger.info(f"Generated image prompt: {prompt}")
return prompt
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image prompt: {e}")
return None
def generate_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024", response_format: str = "url") -> Optional[str]:
"""
Generates an image using OpenAI's DALL·E API.
Args:
prompt (str): The text prompt to generate the image from.
model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
size (str, optional): The size of the generated image. Defaults to "1024x1024".
response_format (str, optional): The format of the response. Defaults to "url".
Returns:
Optional[str]: The URL of the generated image, or None on error.
"""
try:
logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}")
response = openai.Image.create(
prompt=prompt,
model=model,
size=size,
response_format=response_format,
)
image_url = response.data[0].url
logger.info(f"Image URL: {image_url}")
return image_url
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image: {e}")
return None
@app.route("/", methods=["GET", "POST"])
def index():
"""
Handles the main route for the web application.
Processes audio uploads, transcribes them, generates image prompts, and displays images.
"""
transcript = None
image_url = None
prompt_summary = None
error_message = None
if request.method == "POST":
if 'audio_file' not in request.files:
error_message = "No file part"
logger.warning(error_message)
return render_template("index.html", error=error_message)
file = request.files['audio_file']
if file.filename == '':
error_message = "No file selected"
logger.warning(error_message)
return render_template("index.html", error=error_message)
if file and allowed_file(file.filename):
try:
# Securely save the uploaded file to a temporary location
temp_file_path = os.path.join(app.root_path, "temp_audio." + file.filename.rsplit('.', 1)[1].lower())
file.save(temp_file_path)
transcript = transcribe_audio(temp_file_path) # Transcribe audio
if not transcript:
error_message = "Audio transcription failed. Please try again."
return render_template("index.html", error=error_message)
prompt_summary = generate_image_prompt(transcript) # Generate prompt
if not prompt_summary:
error_message = "Failed to generate image prompt. Please try again."
return render_template("index.html", error=error_message)
image_url = generate_image(prompt_summary) # Generate image
if not image_url:
error_message = "Failed to generate image. Please try again."
return render_template("index.html", error=error_message)
# Optionally, delete the temporary file after processing
os.remove(temp_file_path)
except Exception as e:
error_message = f"An error occurred: {e}"
logger.error(error_message)
return render_template("index.html", error=error_message)
else:
error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
logger.warning(error_message)
return render_template("index.html", error=error_message)
return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary, error=error_message)
@app.errorhandler(500)
def internal_server_error(e):
"""Handles internal server errors."""
logger.error(f"Internal Server Error: {e}")
return render_template("error.html", error="Internal Server Error"), 500
if __name__ == "__main__":
app.run(debug=True)
Code Breakdown:
- Import Statements: Imports the necessary Flask modules, OpenAI library,
os
,dotenv
,logging
, andOptional
andDict
for type hinting. - Environment Variables: Loads the OpenAI API key from the
.env
file. - Flask Application: Creates a Flask application instance.
- Logging Configuration: Configures logging.
allowed_file
Function: Checks if the uploaded file has an allowed audio extension.transcribe_audio
Function:- Takes the audio file path as input.
- Opens the audio file in binary mode (
"rb"
). - Calls the OpenAI API's
openai.Audio.transcribe()
method to transcribe the audio. - Extracts the transcribed text from the response.
- Logs the file path before transcription and the length of the transcribed text after successful transcription.
- Includes error handling for OpenAI API errors and other exceptions.
generate_image_prompt
Function:- Takes the transcribed text as input.
- Uses the OpenAI Chat Completion API (
openai.chat.completions.create()
) with thegpt-4o
model to generate a text prompt suitable for image generation. - The system message instructs the model to act as a creative assistant and provide a vivid description of a scene based on the audio.
- Extracts the generated prompt from the API response.
- Includes error handling.
generate_image
Function:- Takes the image prompt as input.
- Calls the OpenAI API's
openai.Image.create()
method to generate an image using DALL·E 3. - Extracts the image URL from the API response.
- Includes error handling.
index
Route:- Handles both GET and POST requests.
- For GET requests, it renders the initial HTML page.
- For POST requests (when the user uploads an audio file):
- It validates the uploaded file.
- It saves the uploaded file temporarily.
- It calls
transcribe_audio()
to transcribe the audio. - It calls
generate_image_prompt()
to generate an image prompt from the transcription. - It calls
generate_image()
to generate an image from the prompt. - It renders the
index.html
template, passing the transcription text and the image URL.
- Includes comprehensive error handling to catch potential issues during file upload, transcription, prompt generation, and image generation.
@app.errorhandler(500)
: Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.if __name__ == "__main__":
: Starts the Flask development server if the script is executed directly.
Step 4: Create HTML Template (templates/index.html)
Create a folder named templates
in the same directory as app.py
. Inside the templates
folder, create a file named index.html
with the following HTML code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Multimodal Assistant</title>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
<style>
/* --- General Styles --- */
body {
font-family: 'Inter', sans-serif;
padding: 40px;
background-color: #f9fafb; /* Tailwind's gray-50 */
display: flex;
justify-content: center;
align-items: center;
min-height: 100vh;
margin: 0;
color: #374151; /* Tailwind's gray-700 */
}
.container {
max-width: 800px; /* Increased max-width */
width: 95%; /* Take up most of the viewport */
background-color: #fff;
padding: 2rem;
border-radius: 0.75rem; /* Tailwind's rounded-lg */
box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
text-align: center;
}
h2 {
font-size: 2.25rem; /* Tailwind's text-3xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1.5rem; /* Tailwind's mb-6 */
color: #1e293b; /* Tailwind's gray-900 */
}
p{
color: #6b7280; /* Tailwind's gray-500 */
margin-bottom: 1rem;
}
/* --- Form Styles --- */
form {
margin-top: 1rem; /* Tailwind's mt-4 */
display: flex;
flex-direction: column;
align-items: center; /* Center form elements */
gap: 0.5rem; /* Tailwind's gap-2 */
}
label {
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
color: #4b5563; /* Tailwind's gray-600 */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
display: block; /* Ensure label takes full width */
text-align: left;
width: 100%;
max-width: 400px; /* Added max-width for label */
margin-left: auto;
margin-right: auto;
}
input[type="file"] {
width: 100%;
max-width: 400px; /* Added max-width for file input */
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
font-size: 1rem; /* Tailwind's text-base */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
margin-left: auto;
margin-right: auto;
}
input[type="submit"] {
padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
color: #fff;
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
cursor: pointer;
transition: background-color 0.3s ease; /* Smooth transition */
border: none;
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
margin-top: 1rem;
}
input[type="submit"]:hover {
background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
}
input[type="submit"]:focus {
outline: none;
box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
}
/* --- Result Styles --- */
.result-container {
margin-top: 2rem; /* Tailwind's mt-8 */
padding: 1.5rem; /* Tailwind's p-6 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #f8fafc; /* Tailwind's bg-gray-50 */
border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
}
.transcript-container{
margin-top: 2rem; /* Tailwind's mt-8 */
padding: 1.5rem; /* Tailwind's p-6 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #f8fafc; /* Tailwind's bg-gray-50 */
border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
}
h3 {
font-size: 1.5rem; /* Tailwind's text-2xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1rem; /* Tailwind's mb-4 */
color: #1e293b; /* Tailwind's gray-900 */
}
textarea {
width: 100%;
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
resize: none;
font-size: 1rem; /* Tailwind's text-base */
line-height: 1.5rem; /* Tailwind's leading-relaxed */
margin-top: 0.5rem; /* Tailwind's mt-2 */
margin-bottom: 0;
box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
}
textarea:focus {
outline: none;
border-color: #3b82f6; /* Tailwind's border-blue-500 */
box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
}
img {
max-width: 100%;
border-radius: 0.5rem; /* Tailwind's rounded-md */
margin-top: 1.5rem; /* Tailwind's mt-6 */
box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
}
/* --- Error Styles --- */
.error-message {
color: #dc2626; /* Tailwind's text-red-600 */
margin-top: 1rem; /* Tailwind's mt-4 */
padding: 0.75rem;
background-color: #fee2e2; /* Tailwind's bg-red-100 */
border-radius: 0.375rem; /* Tailwind's rounded-md */
border: 1px solid #fecaca; /* Tailwind's border-red-300 */
text-align: center;
}
</style>
</head>
<body>
<div class="container">
<h2>🎤🧠🎨 Multimodal Assistant</h2>
<p> Upload an audio file to transcribe and generate a corresponding image. </p>
<form method="POST" enctype="multipart/form-data">
<label for="audio_file">Upload your voice note:</label><br>
<input type="file" name="audio_file" accept="audio/*" required><br><br>
<input type="submit" value="Generate Visual Response">
</form>
{% if transcript %}
<div class = "transcript-container">
<h3>📝 Transcript:</h3>
<textarea readonly>{{ transcript }}</textarea>
</div>
{% endif %}
{% if prompt_summary %}
<div class = "result-container">
<h3>🎯 Prompt Used for Image:</h3>
<p>{{ prompt_summary }}</p>
</div>
{% endif %}
{% if image_url %}
<div class = "result-container">
<h3>🖼️ Generated Image:</h3>
<img src="{{ image_url }}" alt="Generated image">
</div>
{% endif %}
{% if error %}
<div class="error-message">{{ error }}</div>
{% endif %}
</div>
</body>
</html>
Key elements in the HTML template:
- HTML Structure:
- The
<head>
section defines the title, links a CSS stylesheet, and sets the viewport for responsiveness. - The
<body>
contains the visible content, including a form for uploading audio and sections to display the transcription and generated image.
- The
- CSS Styling:
- Modern Design: The CSS is updated to use a modern design, similar to Tailwind CSS.
- Responsive Layout: The layout is more responsive, especially for smaller screens.
- User Experience: Improved form and input styling.
- Clear Error Display: Error messages are styled to be clearly visible.
- Form:
- A
<form>
withenctype="multipart/form-data"
is used to handle file uploads. - A
<label>
and<input type="file">
allow the user to select an audio file. Theaccept="audio/*"
attribute restricts the user to uploading audio files. - A
<input type="submit">
button allows the user to submit the form.
- A
- Transcription and Image Display:
- The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a
textarea
, and the image is displayed using an<img>
tag.
- The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a
- Error Handling:
- A
<div class="error-message">
is used to display any error messages to the user.
- A
Try It Out
- Save the files as
app.py
andtemplates/index.html
. - Ensure you have your OpenAI API key in the
.env
file. - Run the application:
python app.py
- Open
http://localhost:5000
in your browser. - Upload an audio file (you can use the provided sample .mp3 file).
- View the transcription and the generated image on the page.
5.5.3 How This Works (Behind the Scenes)
This demonstrates multimodal orchestration at work - a sophisticated process where different AI models collaborate through your application's logic layer. Each model specializes in a different form of data processing (audio, text, and image), and together they create a seamless experience. The application coordinates these models, handling the data transformation between each step and ensuring proper communication flow.
Example Flow: A Detailed Walkthrough
To better understand how this multimodal system works, let's walk through a complete example. When a user uploads a voice note saying:
"I had the most peaceful morning — sitting by a lake with birds singing and the sun rising behind the trees."
The system processes this input through three distinct stages:
- Audio Processing with Transcript: First, Whisper converts the audio to text, maintaining accuracy even with background noise or accent variations. Result: "I had the most peaceful morning…"
- Scene Analysis and Enhancement: GPT-4o analyzes the transcribed text, identifying key visual elements and spatial relationships to create an optimized image prompt. Result: "A sunrise over a tranquil lake, with birds in the sky and trees reflecting in the water"
- Visual Creation: DALL·E takes this refined prompt and generates a photorealistic image, carefully balancing all the described elements into a cohesive scene
The Power of Integration
In this final section of the chapter, you created a multimodal mini-assistant that demonstrates the seamless integration of three distinct AI capabilities:
- Whisper: Advanced speech recognition that handles various accents, languages, and audio qualities with remarkable accuracy
- GPT-4o: Sophisticated language processing that understands context, emotion, and scene composition to create detailed image descriptions
- DALL·E: State-of-the-art image generation that translates text descriptions into vivid, coherent visual scenes
This integration showcases the future of AI applications where multiple models work in concert to process, understand, and respond to user input in rich, meaningful ways. By orchestrating these models together, you've created an intuitive interface that bridges the gap between human communication and AI capabilities.
5.5 Basic Integration of Multiple Modalities
In this section, we'll explore how to combine multiple AI modalities - speech, language understanding, and image generation - into a single cohesive application. While previous sections focused on working with individual technologies, here we'll learn how these powerful tools can work together to create more sophisticated and engaging user experiences.
This integration represents a significant step forward in AI application development, moving beyond simple single-purpose tools to create systems that can process and respond across different forms of communication. By combining Whisper's audio transcription capabilities, GPT-4o's natural language understanding, and DALL·E's image generation, we can build applications that truly demonstrate the potential of modern AI technologies.
The project we'll build serves as an excellent introduction to multimodal AI integration, demonstrating how different AI models can be orchestrated to create a seamless experience. This approach opens up exciting possibilities for developers looking to create more natural and intuitive human-AI interactions.
5.5.1 What You'll Build
In this section, you'll create a sophisticated Flask-based web application that functions as a basic multimodal assistant. This application demonstrates the seamless integration of multiple AI technologies to create an interactive and intelligent user experience. The assistant is capable of:
- Accepting an audio message from the user through a clean web interface, supporting various audio formats
- Leveraging OpenAI's Whisper technology to accurately transcribe the audio message into text, handling different accents and languages
- Utilizing GPT-4o's advanced natural language processing capabilities to analyze the transcribed text, understanding context, intent, and key themes
- Employing DALL·E 3's sophisticated image generation abilities to create relevant, high-quality visuals based on the understood context
- Presenting a cohesive user experience by displaying both the original transcription and the AI-generated image on a single, well-designed webpage
This project serves as an excellent example of a multimodal assistant that seamlessly combines three different types of AI processing: audio processing, natural language understanding, and image generation. By processing audio input, extracting meaningful context, and creating visual representations, it demonstrates the potential of integrated AI technologies. This foundation opens up exciting possibilities for various real-world applications, such as:
- Spoken design prompts for visual creators - enabling artists and designers to verbally describe their vision and instantly see it rendered
- Audio journaling with illustrated output - transforming spoken diary entries into visual memories with matching AI-generated artwork
- Voice-controlled storytelling applications - creating interactive narratives where spoken words come to life through instant visual generation
5.5.2 Step-by-Step Implementation
Step 1: Install Required Packages
Download the sample audio: https://files.cuantum.tech/audio/audio-file-sample.mp3
Ensure you have the necessary Python libraries installed. Open your terminal and execute the following command:
pip install flask openai python-dotenv
This command installs:
flask
: A micro web framework for building the web application.openai
: The OpenAI Python library for interacting with the Whisper, Chat Completion, and DALL·E 3 APIs.python-dotenv
: A library to load environment variables from a.env
file.
Step 2: Set Up Project Structure
Create the following folder structure for your project:
/multimodal_app
│
├── app.py
├── .env
└── templates/
└── index.html
/multimodal_app
: The root directory for your project.app.py
: The Python file containing the Flask application code..env
: A file to store your OpenAI API key.templates/
: A directory to store your HTML templates.templates/index.html
: The HTML template for the main page of your application.
Step 3: Create the Flask App (app.py)
Create a Python file named app.py
in the root directory of your project and add the following code:
from flask import Flask, request, render_template, jsonify, make_response
import openai
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
app = Flask(__name__)
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'} # Allowed audio file extensions
def allowed_file(filename: str) -> bool:
"""
Checks if the uploaded file has an allowed extension.
Args:
filename (str): The name of the file.
Returns:
bool: True if the file has an allowed extension, False otherwise.
"""
return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
def transcribe_audio(file_path: str) -> Optional[str]:
"""
Transcribes an audio file using OpenAI's Whisper API.
Args:
file_path (str): The path to the audio file.
Returns:
Optional[str]: The transcribed text, or None on error.
"""
try:
logger.info(f"Transcribing audio file: {file_path}")
audio_file = open(file_path, "rb")
response = openai.Audio.transcriptions.create(
model="whisper-1",
file=audio_file,
)
transcript = response.text
logger.info(f"Transcription successful. Length: {len(transcript)} characters.")
return transcript
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error during transcription: {e}")
return None
def generate_image_prompt(text: str) -> Optional[str]:
"""
Generates a prompt for DALL·E 3 based on the transcribed text using GPT-4o.
Args:
text (str): The transcribed text.
Returns:
Optional[str]: The generated image prompt, or None on error.
"""
try:
logger.info("Generating image prompt using GPT-4o")
response = openai.chat.completions.create(
model="gpt-4o", # You can also experiment with other chat models
messages=[
{
"role": "system",
"content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content. Do not include any phrases like 'based on the audio' or 'from the user audio'.",
},
{"role": "user", "content": text},
],
)
prompt = response.choices[0].message.content
logger.info(f"Generated image prompt: {prompt}")
return prompt
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image prompt: {e}")
return None
def generate_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024", response_format: str = "url") -> Optional[str]:
"""
Generates an image using OpenAI's DALL·E API.
Args:
prompt (str): The text prompt to generate the image from.
model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
size (str, optional): The size of the generated image. Defaults to "1024x1024".
response_format (str, optional): The format of the response. Defaults to "url".
Returns:
Optional[str]: The URL of the generated image, or None on error.
"""
try:
logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}")
response = openai.Image.create(
prompt=prompt,
model=model,
size=size,
response_format=response_format,
)
image_url = response.data[0].url
logger.info(f"Image URL: {image_url}")
return image_url
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image: {e}")
return None
@app.route("/", methods=["GET", "POST"])
def index():
"""
Handles the main route for the web application.
Processes audio uploads, transcribes them, generates image prompts, and displays images.
"""
transcript = None
image_url = None
prompt_summary = None
error_message = None
if request.method == "POST":
if 'audio_file' not in request.files:
error_message = "No file part"
logger.warning(error_message)
return render_template("index.html", error=error_message)
file = request.files['audio_file']
if file.filename == '':
error_message = "No file selected"
logger.warning(error_message)
return render_template("index.html", error=error_message)
if file and allowed_file(file.filename):
try:
# Securely save the uploaded file to a temporary location
temp_file_path = os.path.join(app.root_path, "temp_audio." + file.filename.rsplit('.', 1)[1].lower())
file.save(temp_file_path)
transcript = transcribe_audio(temp_file_path) # Transcribe audio
if not transcript:
error_message = "Audio transcription failed. Please try again."
return render_template("index.html", error=error_message)
prompt_summary = generate_image_prompt(transcript) # Generate prompt
if not prompt_summary:
error_message = "Failed to generate image prompt. Please try again."
return render_template("index.html", error=error_message)
image_url = generate_image(prompt_summary) # Generate image
if not image_url:
error_message = "Failed to generate image. Please try again."
return render_template("index.html", error=error_message)
# Optionally, delete the temporary file after processing
os.remove(temp_file_path)
except Exception as e:
error_message = f"An error occurred: {e}"
logger.error(error_message)
return render_template("index.html", error=error_message)
else:
error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
logger.warning(error_message)
return render_template("index.html", error=error_message)
return render_template("index.html", transcript=transcript, image_url=image_url, prompt_summary=prompt_summary, error=error_message)
@app.errorhandler(500)
def internal_server_error(e):
"""Handles internal server errors."""
logger.error(f"Internal Server Error: {e}")
return render_template("error.html", error="Internal Server Error"), 500
if __name__ == "__main__":
app.run(debug=True)
Code Breakdown:
- Import Statements: Imports the necessary Flask modules, OpenAI library,
os
,dotenv
,logging
, andOptional
andDict
for type hinting. - Environment Variables: Loads the OpenAI API key from the
.env
file. - Flask Application: Creates a Flask application instance.
- Logging Configuration: Configures logging.
allowed_file
Function: Checks if the uploaded file has an allowed audio extension.transcribe_audio
Function:- Takes the audio file path as input.
- Opens the audio file in binary mode (
"rb"
). - Calls the OpenAI API's
openai.Audio.transcribe()
method to transcribe the audio. - Extracts the transcribed text from the response.
- Logs the file path before transcription and the length of the transcribed text after successful transcription.
- Includes error handling for OpenAI API errors and other exceptions.
generate_image_prompt
Function:- Takes the transcribed text as input.
- Uses the OpenAI Chat Completion API (
openai.chat.completions.create()
) with thegpt-4o
model to generate a text prompt suitable for image generation. - The system message instructs the model to act as a creative assistant and provide a vivid description of a scene based on the audio.
- Extracts the generated prompt from the API response.
- Includes error handling.
generate_image
Function:- Takes the image prompt as input.
- Calls the OpenAI API's
openai.Image.create()
method to generate an image using DALL·E 3. - Extracts the image URL from the API response.
- Includes error handling.
index
Route:- Handles both GET and POST requests.
- For GET requests, it renders the initial HTML page.
- For POST requests (when the user uploads an audio file):
- It validates the uploaded file.
- It saves the uploaded file temporarily.
- It calls
transcribe_audio()
to transcribe the audio. - It calls
generate_image_prompt()
to generate an image prompt from the transcription. - It calls
generate_image()
to generate an image from the prompt. - It renders the
index.html
template, passing the transcription text and the image URL.
- Includes comprehensive error handling to catch potential issues during file upload, transcription, prompt generation, and image generation.
@app.errorhandler(500)
: Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.if __name__ == "__main__":
: Starts the Flask development server if the script is executed directly.
Step 4: Create HTML Template (templates/index.html)
Create a folder named templates
in the same directory as app.py
. Inside the templates
folder, create a file named index.html
with the following HTML code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Multimodal Assistant</title>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
<style>
/* --- General Styles --- */
body {
font-family: 'Inter', sans-serif;
padding: 40px;
background-color: #f9fafb; /* Tailwind's gray-50 */
display: flex;
justify-content: center;
align-items: center;
min-height: 100vh;
margin: 0;
color: #374151; /* Tailwind's gray-700 */
}
.container {
max-width: 800px; /* Increased max-width */
width: 95%; /* Take up most of the viewport */
background-color: #fff;
padding: 2rem;
border-radius: 0.75rem; /* Tailwind's rounded-lg */
box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
text-align: center;
}
h2 {
font-size: 2.25rem; /* Tailwind's text-3xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1.5rem; /* Tailwind's mb-6 */
color: #1e293b; /* Tailwind's gray-900 */
}
p{
color: #6b7280; /* Tailwind's gray-500 */
margin-bottom: 1rem;
}
/* --- Form Styles --- */
form {
margin-top: 1rem; /* Tailwind's mt-4 */
display: flex;
flex-direction: column;
align-items: center; /* Center form elements */
gap: 0.5rem; /* Tailwind's gap-2 */
}
label {
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
color: #4b5563; /* Tailwind's gray-600 */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
display: block; /* Ensure label takes full width */
text-align: left;
width: 100%;
max-width: 400px; /* Added max-width for label */
margin-left: auto;
margin-right: auto;
}
input[type="file"] {
width: 100%;
max-width: 400px; /* Added max-width for file input */
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
font-size: 1rem; /* Tailwind's text-base */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
margin-left: auto;
margin-right: auto;
}
input[type="submit"] {
padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
color: #fff;
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
cursor: pointer;
transition: background-color 0.3s ease; /* Smooth transition */
border: none;
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
margin-top: 1rem;
}
input[type="submit"]:hover {
background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
}
input[type="submit"]:focus {
outline: none;
box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
}
/* --- Result Styles --- */
.result-container {
margin-top: 2rem; /* Tailwind's mt-8 */
padding: 1.5rem; /* Tailwind's p-6 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #f8fafc; /* Tailwind's bg-gray-50 */
border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
}
.transcript-container{
margin-top: 2rem; /* Tailwind's mt-8 */
padding: 1.5rem; /* Tailwind's p-6 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #f8fafc; /* Tailwind's bg-gray-50 */
border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
}
h3 {
font-size: 1.5rem; /* Tailwind's text-2xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1rem; /* Tailwind's mb-4 */
color: #1e293b; /* Tailwind's gray-900 */
}
textarea {
width: 100%;
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
resize: none;
font-size: 1rem; /* Tailwind's text-base */
line-height: 1.5rem; /* Tailwind's leading-relaxed */
margin-top: 0.5rem; /* Tailwind's mt-2 */
margin-bottom: 0;
box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
}
textarea:focus {
outline: none;
border-color: #3b82f6; /* Tailwind's border-blue-500 */
box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
}
img {
max-width: 100%;
border-radius: 0.5rem; /* Tailwind's rounded-md */
margin-top: 1.5rem; /* Tailwind's mt-6 */
box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
}
/* --- Error Styles --- */
.error-message {
color: #dc2626; /* Tailwind's text-red-600 */
margin-top: 1rem; /* Tailwind's mt-4 */
padding: 0.75rem;
background-color: #fee2e2; /* Tailwind's bg-red-100 */
border-radius: 0.375rem; /* Tailwind's rounded-md */
border: 1px solid #fecaca; /* Tailwind's border-red-300 */
text-align: center;
}
</style>
</head>
<body>
<div class="container">
<h2>🎤🧠🎨 Multimodal Assistant</h2>
<p> Upload an audio file to transcribe and generate a corresponding image. </p>
<form method="POST" enctype="multipart/form-data">
<label for="audio_file">Upload your voice note:</label><br>
<input type="file" name="audio_file" accept="audio/*" required><br><br>
<input type="submit" value="Generate Visual Response">
</form>
{% if transcript %}
<div class = "transcript-container">
<h3>📝 Transcript:</h3>
<textarea readonly>{{ transcript }}</textarea>
</div>
{% endif %}
{% if prompt_summary %}
<div class = "result-container">
<h3>🎯 Prompt Used for Image:</h3>
<p>{{ prompt_summary }}</p>
</div>
{% endif %}
{% if image_url %}
<div class = "result-container">
<h3>🖼️ Generated Image:</h3>
<img src="{{ image_url }}" alt="Generated image">
</div>
{% endif %}
{% if error %}
<div class="error-message">{{ error }}</div>
{% endif %}
</div>
</body>
</html>
Key elements in the HTML template:
- HTML Structure:
- The
<head>
section defines the title, links a CSS stylesheet, and sets the viewport for responsiveness. - The
<body>
contains the visible content, including a form for uploading audio and sections to display the transcription and generated image.
- The
- CSS Styling:
- Modern Design: The CSS is updated to use a modern design, similar to Tailwind CSS.
- Responsive Layout: The layout is more responsive, especially for smaller screens.
- User Experience: Improved form and input styling.
- Clear Error Display: Error messages are styled to be clearly visible.
- Form:
- A
<form>
withenctype="multipart/form-data"
is used to handle file uploads. - A
<label>
and<input type="file">
allow the user to select an audio file. Theaccept="audio/*"
attribute restricts the user to uploading audio files. - A
<input type="submit">
button allows the user to submit the form.
- A
- Transcription and Image Display:
- The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a
textarea
, and the image is displayed using an<img>
tag.
- The template uses Jinja2 templating to conditionally display the transcription text and the generated image if they are available. The transcription is displayed in a
- Error Handling:
- A
<div class="error-message">
is used to display any error messages to the user.
- A
Try It Out
- Save the files as
app.py
andtemplates/index.html
. - Ensure you have your OpenAI API key in the
.env
file. - Run the application:
python app.py
- Open
http://localhost:5000
in your browser. - Upload an audio file (you can use the provided sample .mp3 file).
- View the transcription and the generated image on the page.
5.5.3 How This Works (Behind the Scenes)
This demonstrates multimodal orchestration at work - a sophisticated process where different AI models collaborate through your application's logic layer. Each model specializes in a different form of data processing (audio, text, and image), and together they create a seamless experience. The application coordinates these models, handling the data transformation between each step and ensuring proper communication flow.
Example Flow: A Detailed Walkthrough
To better understand how this multimodal system works, let's walk through a complete example. When a user uploads a voice note saying:
"I had the most peaceful morning — sitting by a lake with birds singing and the sun rising behind the trees."
The system processes this input through three distinct stages:
- Audio Processing with Transcript: First, Whisper converts the audio to text, maintaining accuracy even with background noise or accent variations. Result: "I had the most peaceful morning…"
- Scene Analysis and Enhancement: GPT-4o analyzes the transcribed text, identifying key visual elements and spatial relationships to create an optimized image prompt. Result: "A sunrise over a tranquil lake, with birds in the sky and trees reflecting in the water"
- Visual Creation: DALL·E takes this refined prompt and generates a photorealistic image, carefully balancing all the described elements into a cohesive scene
The Power of Integration
In this final section of the chapter, you created a multimodal mini-assistant that demonstrates the seamless integration of three distinct AI capabilities:
- Whisper: Advanced speech recognition that handles various accents, languages, and audio qualities with remarkable accuracy
- GPT-4o: Sophisticated language processing that understands context, emotion, and scene composition to create detailed image descriptions
- DALL·E: State-of-the-art image generation that translates text descriptions into vivid, coherent visual scenes
This integration showcases the future of AI applications where multiple models work in concert to process, understand, and respond to user input in rich, meaningful ways. By orchestrating these models together, you've created an intuitive interface that bridges the gap between human communication and AI capabilities.