Chapter 6: Cross-Model AI Suites
6.2 Building a Creator Dashboard
This is where all the capabilities you've developed so far come together to create a powerful, unified system. By integrating multiple AI technologies, we can create applications that are greater than the sum of their parts. Let's explore these core capabilities in detail:
- Transcription turns spoken words into written textUsing advanced speech recognition models like Whisper, we can accurately convert audio recordings into text, preserving the speaker's intent and context. This forms the foundation for further processing.
- Content generation creates new, contextually relevant materialLarge language models can analyze the transcribed text and generate new content that maintains consistency with the original message while adding valuable insights or expanding on key points.
- Prompt engineering crafts precise instructions for AI modelsThrough careful prompt construction, we can guide AI models to produce more accurate and relevant outputs. This involves understanding both the technical capabilities of the models and the nuanced ways to communicate with them.
- Image creation transforms text descriptions into visual artModels like DALL·E can interpret textual descriptions and create corresponding images, adding a visual dimension to our applications and making abstract concepts more tangible.
These components don't just exist side by side - they form an interconnected pipeline where each step enhances the next. The output from transcription feeds into content generation, which informs prompt engineering, ultimately leading to image creation. This seamless integration creates a fluid workflow where users can start with a simple voice recording and end with a rich multimedia output, all within a single, cohesive system. By eliminating the need to switch between different tools or interfaces, users can focus on their creative process rather than technical implementation details.
6.2.1 What You'll Build
In this section, you'll design and implement a Creator Dashboard - a sophisticated web interface that transforms how creators work with AI. This comprehensive platform serves as a central hub for content creation, combining multiple AI technologies into one seamless experience. Let's explore the key features that make this dashboard powerful:
- Upload a voice recordingCreators can easily upload audio files in various formats, making it simple to start their creative process with spoken ideas or narration.
- Transcribe the voice recording into text using AIUsing advanced AI speech recognition technology, the system accurately converts spoken words into written text, maintaining the nuances and context of the original recording.
- Turn that transcription into an editable promptThe system intelligently processes the transcribed text to create structured, AI-ready prompts that can be customized to achieve the desired creative output.
- Generate images using DALL·E based on the promptLeveraging DALL·E's powerful image generation capabilities, the system creates visual representations that match the specified prompts, bringing ideas to life through AI-generated artwork.
- Summarize the transcriptThe dashboard employs AI to distill long transcriptions into concise, meaningful summaries, helping creators quickly grasp the core concepts and themes.
- Display all the results for review and further use in content productionAll generated content - from transcripts to images - is presented in an organized, easy-to-review format, allowing creators to efficiently manage and utilize their assets.
To build this robust system, you'll implement a modern tech stack using Flask for the backend operations and a clean, responsive combination of HTML and CSS for the frontend interface. This architecture ensures both modularity and maintainability, making it easy to update and scale the dashboard as needed.
6.2.2 Step-by-Step Implementation
Step 1: Project Setup
Download the audio sample: https://files.cuantum.tech/audio/dashboard-project.mp3
Create a new directory for your project and navigate into it:
mkdir creator_dashboard
cd creator_dashboard
It's recommended to set up a virtual environment:
python -m venv venv
source venv/bin/activate # On macOS/Linux
venv\\Scripts\\activate # On Windows
Install the required Python packages:
pip install flask openai python-dotenv
Organize your project files as follows:
/creator_dashboard
│
├── app.py
├── .env
└── templates/
└── dashboard.html
└── utils/
├── __init__.py
├── transcribe.py
├── summarize.py
├── generate_prompt.py
└── generate_image.py
app.py
: The main Flask application file..env
: A file to store your OpenAI API key.templates/
: A directory for HTML templates.templates/dashboard.html
: The HTML template for the user interface.utils/
: A directory for Python modules containing reusable functions.__init__.py
: Makes the utils directory a Python package.transcribe.py
: Contains the function to transcribe audio using Whisper.summarize.py
: Contains the function to summarize the transcription using a Large Language Model.generate_prompt.py
: Contains the function to generate an image prompt from the summary using a Large Language Model.generate_image.py
: Contains the function to generate an image with DALL·E 3.
Step 2: Create the Utility Modules
Create the following Python files in the utils/
directory:
utils/transcribe.py
:
import openai
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def transcribe_audio(file_path: str) -> Optional[str]:
"""
Transcribes an audio file using OpenAI's Whisper API.
Args:
file_path (str): The path to the audio file.
Returns:
Optional[str]: The transcribed text, or None on error.
"""
try:
logger.info(f"Transcribing audio: {file_path}")
audio_file = open(file_path, "rb")
response = openai.Audio.transcriptions.create(
model="whisper-1",
file=audio_file,
)
transcript = response.text
audio_file.close()
return transcript
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error during transcription: {e}")
return None
- This module defines the
transcribe_audio
function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription. - The function opens the audio file in binary read mode (
"rb"
). - It calls
openai.Audio.transcriptions.create()
to perform the transcription, specifying the "whisper-1" model. - It extracts the transcribed text from the API response.
- It includes error handling using a
try...except
block to catch potentialopenai.error.OpenAIError
exceptions (specific to OpenAI) and generalException
for other errors. If an error occurs, it logs the error and returnsNone
. - It logs the file path before transcription and the length of the transcribed text after successful transcription.
- The audio file is closed after transcription.
utils/summarize.py
:
import openai
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def summarize_transcript(text: str) -> Optional[str]:
"""
Summarizes a text transcript using OpenAI's Chat Completion API.
Args:
text (str): The text transcript to summarize.
Returns:
Optional[str]: The summarized text, or None on error.
"""
try:
logger.info("Summarizing transcript")
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system",
"content": "You are a helpful assistant. Provide a concise summary of the text, suitable for generating a visual representation."},
{"role": "user", "content": text}
],
)
summary = response.choices[0].message.content
logger.info(f"Summary: {summary}")
return summary
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating summary: {e}")
return None
- This module defines the
summarize_transcript
function, which takes a text transcript as input and uses OpenAI's Chat Completion API to generate a concise summary. - The system message instructs the model to act as a helpful assistant and to provide a concise summary of the text, suitable for generating a visual representation.
- The user message provides the transcript as the content for the model to summarize.
- The function extracts the summary from the API response.
- It includes error handling.
utils/generate_prompt.py
:
import openai
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def create_image_prompt(transcription: str) -> Optional[str]:
"""
Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.
Args:
transcription (str): The text transcription of the audio.
Returns:
Optional[str]: A detailed text prompt suitable for image generation, or None on error.
"""
try:
logger.info("Generating image prompt from transcription")
response = openai.chat.completions.create(
model="gpt-4o", # Use a powerful chat model
messages=[
{
"role": "system",
"content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content. Do not include any phrases like 'based on the audio' or 'from the user audio'. Incorporate scene lighting, time of day, weather, and camera angle into the description. Limit the description to 200 words.",
},
{"role": "user", "content": transcription},
],
)
prompt = response.choices[0].message.content
prompt = prompt.strip() # Remove leading/trailing spaces
logger.info(f"Generated prompt: {prompt}")
return prompt
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image prompt: {e}")
return None
- This module defines the
create_image_prompt
function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation. - The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
- The user message provides the transcribed text as the content for the model to work with.
- The function extracts the generated prompt from the API response.
- It strips any leading/trailing spaces from the generated prompt.
- It includes error handling.
utils/generate_image.py
:
import openai
import logging
from typing import Optional, Dict
logger = logging.getLogger(__name__)
def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024",
response_format: str = "url", quality: str = "standard") -> Optional[str]:
"""
Generates an image using OpenAI's DALL·E API.
Args:
prompt (str): The text prompt to generate the image from.
model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
size (str, optional): The size of the generated image. Defaults to "1024x1024".
response_format (str, optional): The format of the response. Defaults to "url".
quality (str, optional): The quality of the image. Defaults to "standard".
Returns:
Optional[str]: The URL of the generated image, or None on error.
"""
try:
logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}, quality: {quality}")
response = openai.images.generate(
prompt=prompt,
model=model,
size=size,
response_format=response_format,
quality=quality
)
image_url = response.data[0].url
logger.info(f"Image URL: {image_url}")
return image_url
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image: {e}")
return None
- This module defines the
generate_dalle_image
function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image. - It calls the
openai.images.generate()
method to generate the image. - It accepts optional
model
,size
,response_format
, andquality
parameters, allowing the user to configure the image generation. - It extracts the URL of the generated image from the API response.
- It includes error handling.
Step 5: Create the Main App (app.py)
Create a Python file named app.py
in the root directory of your project and add the following code:
from flask import Flask, request, render_template, jsonify, make_response, redirect, url_for
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict
from werkzeug.utils import secure_filename
from werkzeug.datastructures import FileStorage
# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image
from utils.summarize import summarize_transcript
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads' # Store uploaded files
app.config['MAX_CONTENT_LENGTH'] = 25 * 1024 * 1024 # 25MB max file size - increased for larger audio files
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True) # Create the upload folder if it doesn't exist
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'} # Allowed audio file extensions
def allowed_file(filename: str) -> bool:
"""
Checks if the uploaded file has an allowed extension.
Args:
filename (str): The name of the file.
Returns:
bool: True if the file has an allowed extension, False otherwise.
"""
return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
@app.route("/", methods=["GET", "POST"])
def index():
"""
Handles the main route for the web application.
Processes audio uploads, transcribes them, generates image prompts, and displays images.
"""
transcript = None
image_url = None
prompt_summary = None
error_message = None
summary = None # Initialize summary
if request.method == "POST":
if 'audio_file' not in request.files:
error_message = "No file part"
logger.warning(error_message)
return render_template("index.html", error_message=error_message)
file: FileStorage = request.files['audio_file'] # Use type hinting
if file.filename == '':
error_message = "No file selected"
logger.warning(request)
return render_template("index.html", error_message=error_message)
if file and allowed_file(file.filename):
try:
# Secure the filename and construct a safe path
filename = secure_filename(file.filename)
file_path = os.path.join(app.config['UPLOAD_FOLDER'], filename)
file.save(file_path) # Save the uploaded file
transcript = transcribe_audio(file_path) # Transcribe audio
if not transcript:
error_message = "Audio transcription failed. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
summary = summarize_transcript(transcript) # Summarize the transcript
if not summary:
error_message = "Audio summary failed. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
prompt_summary = generate_image_prompt(transcript) # Generate prompt
if not prompt_summary:
error_message = "Failed to generate image prompt. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
image_url = generate_dalle_image(prompt_summary, model=request.form.get('model', 'dall-e-3'),
size=request.form.get('size', '1024x1024'),
response_format=request.form.get('format', 'url'),
quality=request.form.get('quality', 'standard')) # Generate image
if not image_url:
error_message = "Failed to generate image. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
# Optionally, delete the uploaded file after processing
os.remove(file_path)
logger.info(f"Successfully processed audio file and generated image.")
return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary, summary=summary)
except Exception as e:
error_message = f"An error occurred: {e}"
logger.error(error_message)
return render_template("index.html", error_message=error_message)
else:
error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
logger.warning(request)
return render_template("index.html", error_message=error_message)
return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary,
error=error_message, summary=summary)
@app.errorhandler(500)
def internal_server_error(e):
"""Handles internal server errors."""
logger.error(f"Internal Server Error: {e}")
return render_template("error.html", error="Internal Server Error"), 500
if __name__ == "__main__":
app.run(debug=True)
Code Breakdown:
- Import Statements: Imports necessary Flask modules, OpenAI library,
os
,dotenv
,logging
,Optional
andDict
for type hinting, andsecure_filename
andFileStorage
from Werkzeug. - Environment Variables: Loads the OpenAI API key from the
.env
file. - Flask Application:
- Creates a Flask application instance.
- Configures an upload folder and maximum file size. The
UPLOAD_FOLDER
is set to 'uploads', andMAX_CONTENT_LENGTH
is set to 25MB. The upload folder is created if it does not exist.
- Logging Configuration: Configures logging.
allowed_file
Function: Checks if the uploaded file has an allowed audio extension.transcribe_audio
Function:- Takes the audio file path as input.
- Opens the audio file in binary read mode (
"rb"
). - Calls the OpenAI API's
openai.Audio.transcriptions.create()
method to transcribe the audio. - Extracts the transcribed text from the API response.
- Logs the file path before transcription and the length of the transcribed text after successful transcription.
- Includes error handling for OpenAI API errors and other exceptions. The audio file is closed after transcription.
generate_image_prompt
Function:- Takes the transcribed text as input.
- Uses the OpenAI Chat Completion API (
openai.chat.completions.create()
) with thegpt-4o
model to generate a detailed text prompt suitable for image generation. - The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
- Extracts the generated prompt from the API response.
- It strips any leading/trailing spaces from the generated prompt.
- Includes error handling.
generate_image
Function:- Takes the image prompt as input.
- Calls the OpenAI API's
openai.Image.create()
method to generate an image using DALL·E 3. - Accepts optional
model
,size
,response_format
, andquality
parameters, allowing the user to configure the image generation. - Extracts the URL of the generated image from the API response.
- Includes error handling.
index
Route:- Handles both GET and POST requests.
- For GET requests, it renders the initial HTML page.
- For POST requests (when the user uploads an audio file):
- It validates the uploaded file:
- Checks if the file part exists in the request.
- Checks if a file was selected.
- Checks if the file type is allowed using the
allowed_file
function.
- It saves the uploaded file to a temporary location using a secure filename.
- It calls the utility functions to:
- Transcribe the audio using
transcribe_audio()
. - Generate an image prompt from the transcription using
create_image_prompt()
. - Generate an image from the prompt using
generate_dalle_image()
.
- Transcribe the audio using
- It summarizes the transcript using openai chat completions api.
- It handles errors that may occur during any of these steps, logging the error and rendering the
index.html
template with an appropriate error message. The temporary file is deleted before rendering the error page. - If all steps are successful, it renders the
index.html
template, passing the transcription text, image URL, and generated prompt to be displayed.
- It validates the uploaded file:
@app.errorhandler(500)
: Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.if __name__ == "__main__":
: Starts the Flask development server if the script is executed directly.
Step 6: Create the HTML Template (templates/dashboard.html)
Create a folder named templates
in the same directory as app.py
. Inside the templates
folder, create a file named dashboard.html
with the following HTML code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Creator Dashboard</title>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
<style>
/* --- General Styles --- */
body {
font-family: 'Inter', sans-serif;
padding: 40px;
background-color: #f9fafb; /* Tailwind's gray-50 */
display: flex;
justify-content: center;
align-items: center;
min-height: 100vh;
margin: 0;
color: #374151; /* Tailwind's gray-700 */
}
.container {
max-width: 800px; /* Increased max-width */
width: 95%; /* Take up most of the viewport */
background-color: #fff;
padding: 2rem;
border-radius: 0.75rem; /* Tailwind's rounded-lg */
box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
text-align: center;
}
h2 {
font-size: 2.25rem; /* Tailwind's text-3xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1.5rem; /* Tailwind's mb-6 */
color: #1e293b; /* Tailwind's gray-900 */
}
p{
color: #6b7280; /* Tailwind's gray-500 */
margin-bottom: 1rem;
}
/* --- Form Styles --- */
form {
margin-top: 1rem; /* Tailwind's mt-4 */
margin-bottom: 1.5rem;
display: flex;
flex-direction: column;
align-items: center; /* Center form elements */
gap: 0.5rem; /* Tailwind's gap-2 */
}
label {
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
color: #4b5563; /* Tailwind's gray-600 */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
display: block; /* Ensure label takes full width */
text-align: left;
width: 100%;
max-width: 400px; /* Added max-width for label */
margin-left: auto;
margin-right: auto;
}
input[type="file"] {
width: 100%;
max-width: 400px; /* Added max-width for file input */
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
font-size: 1rem; /* Tailwind's text-base */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
margin-left: auto;
margin-right: auto;
}
input[type="submit"] {
padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
color: #fff;
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
cursor: pointer;
transition: background-color 0.3s ease; /* Smooth transition */
border: none;
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
margin-top: 1rem;
}
input[type="submit"]:hover {
background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
}
input[type="submit"]:focus {
outline: none;
box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
}
/* --- Result Styles --- */
.result-container {
margin-top: 2rem; /* Tailwind's mt-8 */
padding: 1.5rem; /* Tailwind's p-6 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #f8fafc; /* Tailwind's bg-gray-50 */
border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
text-align: left;
}
h3 {
font-size: 1.5rem; /* Tailwind's text-2xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1rem; /* Tailwind's mb-4 */
color: #1e293b; /* Tailwind's gray-900 */
}
textarea {
width: 100%;
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
resize: none;
font-size: 1rem; /* Tailwind's text-base */
line-height: 1.5rem; /* Tailwind's leading-relaxed */
margin-top: 0.5rem; /* Tailwind's mt-2 */
margin-bottom: 0;
box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
min-height: 100px;
}
textarea:focus {
outline: none;
border-color: #3b82f6; /* Tailwind's border-blue-500 */
box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
}
img {
max-width: 100%;
border-radius: 0.5rem; /* Tailwind's rounded-md */
margin-top: 1.5rem; /* Tailwind's mt-6 */
box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
}
/* --- Error Styles --- */
.error-message {
color: #dc2626; /* Tailwind's text-red-600 */
margin-top: 1rem; /* Tailwind's mt-4 */
padding: 0.75rem;
background-color: #fee2e2; /* Tailwind's bg-red-100 */
border-radius: 0.375rem; /* Tailwind's rounded-md */
border: 1px solid #fecaca; /* Tailwind's border-red-300 */
text-align: center;
}
.prompt-select {
margin-top: 1rem; /* Tailwind's mt-4 */
display: flex;
flex-direction: column;
align-items: center;
gap: 0.5rem;
width: 100%;
}
.prompt-select label {
font-size: 1rem;
font-weight: 600;
color: #4b5563;
margin-bottom: 0.25rem;
display: block; /* Ensure label takes full width */
text-align: left;
width: 100%;
max-width: 400px; /* Added max-width for label */
margin-left: auto;
margin-right: auto;
}
.prompt-select select {
width: 100%;
max-width: 400px;
padding: 0.75rem;
border-radius: 0.5rem;
border: 1px solid #d1d5db;
font-size: 1rem;
margin-bottom: 0.25rem;
margin-left: auto;
margin-right: auto;
appearance: none; /* Remove default arrow */
background-image: url("data:image/svg+xml,%3Csvgxmlns='http://www.w3.org/2000/svg' viewBox='0 0 20 20' fill='none' stroke='currentColor' stroke-width='1.5' stroke-linecap='round' stroke-linejoin='round'%3E%3Cpath d='M6 9l4 4 4-4'%3E%3C/path%3E%3C/svg%3E"); /* Add custom arrow */
background-repeat: no-repeat;
background-position: right 0.75rem center;
background-size: 1rem;
padding-right: 2.5rem; /* Make space for the arrow */
}
.prompt-select select:focus {
outline: none;
border-color: #3b82f6;
box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15);
}
</style>
</head>
<body>
<div class="container">
<h2>🎤🧠🎨 Multimodal Assistant</h2>
<p> Upload an audio file to transcribe and generate a corresponding image. </p>
<form method="POST" enctype="multipart/form-data">
<label for="audio_file">Upload your voice note:</label><br>
<input type="file" name="audio_file" accept="audio/*" required><br><br>
<div class = "prompt-select">
<label for="prompt_mode">Image Prompt Mode:</label>
<select id="prompt_mode" name="prompt_mode">
<option value="detailed">Detailed Scene Description</option>
<option value="keywords">Keywords</option>
<option value="creative">Creative Interpretation</option>
</select>
</div>
<input type="submit" value="Generate Visual Response">
</form>
{% if transcript %}
<div class="result-container">
<h3>📝 Transcript:</h3>
<textarea readonly>{{ transcript }}</textarea>
</div>
{% endif %}
{% if summary %}
<div class="result-container">
<h3>🔎 Summary:</h3>
<p>{{ summary }}</p>
</div>
{% endif %}
{% if prompt %}
<div class="result-container">
<h3>🎯 Scene Prompt:</h3>
<p>{{ prompt }}</p>
</div>
{% endif %}
{% if image_url %}
<div class="result-container">
<h3>🖼️ Generated Image:</h3>
<img src="{{ image_url }}" alt="Generated image">
</div>
{% endif %}
{% if error %}
<div class="error-message">{{ error }}</div>
{% endif %}
</div>
</body>
</html>
6.2.3 What Makes This a Dashboard?
This layout combines several key elements that transform it from a simple interface into a comprehensive dashboard:
- Multiple output zones (text, summary, prompt, image)The interface is divided into distinct sections, each dedicated to displaying different types of processed data. This organization allows users to easily track the progression from speech input to visual output.
- Simple user interaction (one-click processing)Despite the complex processing happening behind the scenes, users only need to perform one action to initiate the entire workflow. This simplicity makes the tool accessible to users of all technical levels.
- Clean, readable formattingThe interface uses consistent spacing, typography, and visual hierarchy to ensure information is easily digestible. Each section is clearly labeled and visually separated from others.
- Visual feedback to reinforce model outputThe dashboard provides immediate visual confirmation at each step of the process, helping users understand how their input is being transformed across different AI models.
- Reusable architecture, thanks to the
utils/
structureThe modular design separates core functionality into utility functions, making the code easier to maintain and adapt for different use cases.
6.2.4 Use Case Ideas
This versatile dashboard has numerous potential applications. Let's explore some key use cases in detail:
- A content creator's AI toolkit (turn thoughts into blogs + visuals)
- Record brainstorming sessions and convert them into structured blog posts
- Generate matching illustrations for key concepts
- Create social media content bundles with matching visuals
- A teacher's assistant (record voice ➝ summarize ➝ illustrate)
- Transform lesson plans into visual learning materials
- Create engaging educational content with matching illustrations
- Generate visual aids for complex concepts
- A journaling tool (log voice entries ➝ summarize + visualize)
- Convert daily voice memos into organized written entries
- Create mood boards based on journal content
- Track emotional patterns through visual representations
Summary
In this section, you elevated your multimodal assistant into a professional-grade dashboard. Here's what you accomplished:
- Break down your logic into reusable utilities
- Created modular, maintainable code structure
- Implemented clean separation of concerns
- Accept audio input and process it across models
- Seamless integration of multiple AI technologies
- Efficient processing pipeline
- Present everything clearly in a cohesive UI
- User-friendly interface design
- Intuitive information hierarchy
- Move from "demo" to tool
- Production-ready implementation
- Scalable architecture
This dashboard represents a professional-grade interface that delivers real value to users. With its robust architecture and intuitive design, it's ready to be transformed into a full-fledged product with minimal additional development.
6.2 Building a Creator Dashboard
This is where all the capabilities you've developed so far come together to create a powerful, unified system. By integrating multiple AI technologies, we can create applications that are greater than the sum of their parts. Let's explore these core capabilities in detail:
- Transcription turns spoken words into written textUsing advanced speech recognition models like Whisper, we can accurately convert audio recordings into text, preserving the speaker's intent and context. This forms the foundation for further processing.
- Content generation creates new, contextually relevant materialLarge language models can analyze the transcribed text and generate new content that maintains consistency with the original message while adding valuable insights or expanding on key points.
- Prompt engineering crafts precise instructions for AI modelsThrough careful prompt construction, we can guide AI models to produce more accurate and relevant outputs. This involves understanding both the technical capabilities of the models and the nuanced ways to communicate with them.
- Image creation transforms text descriptions into visual artModels like DALL·E can interpret textual descriptions and create corresponding images, adding a visual dimension to our applications and making abstract concepts more tangible.
These components don't just exist side by side - they form an interconnected pipeline where each step enhances the next. The output from transcription feeds into content generation, which informs prompt engineering, ultimately leading to image creation. This seamless integration creates a fluid workflow where users can start with a simple voice recording and end with a rich multimedia output, all within a single, cohesive system. By eliminating the need to switch between different tools or interfaces, users can focus on their creative process rather than technical implementation details.
6.2.1 What You'll Build
In this section, you'll design and implement a Creator Dashboard - a sophisticated web interface that transforms how creators work with AI. This comprehensive platform serves as a central hub for content creation, combining multiple AI technologies into one seamless experience. Let's explore the key features that make this dashboard powerful:
- Upload a voice recordingCreators can easily upload audio files in various formats, making it simple to start their creative process with spoken ideas or narration.
- Transcribe the voice recording into text using AIUsing advanced AI speech recognition technology, the system accurately converts spoken words into written text, maintaining the nuances and context of the original recording.
- Turn that transcription into an editable promptThe system intelligently processes the transcribed text to create structured, AI-ready prompts that can be customized to achieve the desired creative output.
- Generate images using DALL·E based on the promptLeveraging DALL·E's powerful image generation capabilities, the system creates visual representations that match the specified prompts, bringing ideas to life through AI-generated artwork.
- Summarize the transcriptThe dashboard employs AI to distill long transcriptions into concise, meaningful summaries, helping creators quickly grasp the core concepts and themes.
- Display all the results for review and further use in content productionAll generated content - from transcripts to images - is presented in an organized, easy-to-review format, allowing creators to efficiently manage and utilize their assets.
To build this robust system, you'll implement a modern tech stack using Flask for the backend operations and a clean, responsive combination of HTML and CSS for the frontend interface. This architecture ensures both modularity and maintainability, making it easy to update and scale the dashboard as needed.
6.2.2 Step-by-Step Implementation
Step 1: Project Setup
Download the audio sample: https://files.cuantum.tech/audio/dashboard-project.mp3
Create a new directory for your project and navigate into it:
mkdir creator_dashboard
cd creator_dashboard
It's recommended to set up a virtual environment:
python -m venv venv
source venv/bin/activate # On macOS/Linux
venv\\Scripts\\activate # On Windows
Install the required Python packages:
pip install flask openai python-dotenv
Organize your project files as follows:
/creator_dashboard
│
├── app.py
├── .env
└── templates/
└── dashboard.html
└── utils/
├── __init__.py
├── transcribe.py
├── summarize.py
├── generate_prompt.py
└── generate_image.py
app.py
: The main Flask application file..env
: A file to store your OpenAI API key.templates/
: A directory for HTML templates.templates/dashboard.html
: The HTML template for the user interface.utils/
: A directory for Python modules containing reusable functions.__init__.py
: Makes the utils directory a Python package.transcribe.py
: Contains the function to transcribe audio using Whisper.summarize.py
: Contains the function to summarize the transcription using a Large Language Model.generate_prompt.py
: Contains the function to generate an image prompt from the summary using a Large Language Model.generate_image.py
: Contains the function to generate an image with DALL·E 3.
Step 2: Create the Utility Modules
Create the following Python files in the utils/
directory:
utils/transcribe.py
:
import openai
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def transcribe_audio(file_path: str) -> Optional[str]:
"""
Transcribes an audio file using OpenAI's Whisper API.
Args:
file_path (str): The path to the audio file.
Returns:
Optional[str]: The transcribed text, or None on error.
"""
try:
logger.info(f"Transcribing audio: {file_path}")
audio_file = open(file_path, "rb")
response = openai.Audio.transcriptions.create(
model="whisper-1",
file=audio_file,
)
transcript = response.text
audio_file.close()
return transcript
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error during transcription: {e}")
return None
- This module defines the
transcribe_audio
function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription. - The function opens the audio file in binary read mode (
"rb"
). - It calls
openai.Audio.transcriptions.create()
to perform the transcription, specifying the "whisper-1" model. - It extracts the transcribed text from the API response.
- It includes error handling using a
try...except
block to catch potentialopenai.error.OpenAIError
exceptions (specific to OpenAI) and generalException
for other errors. If an error occurs, it logs the error and returnsNone
. - It logs the file path before transcription and the length of the transcribed text after successful transcription.
- The audio file is closed after transcription.
utils/summarize.py
:
import openai
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def summarize_transcript(text: str) -> Optional[str]:
"""
Summarizes a text transcript using OpenAI's Chat Completion API.
Args:
text (str): The text transcript to summarize.
Returns:
Optional[str]: The summarized text, or None on error.
"""
try:
logger.info("Summarizing transcript")
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system",
"content": "You are a helpful assistant. Provide a concise summary of the text, suitable for generating a visual representation."},
{"role": "user", "content": text}
],
)
summary = response.choices[0].message.content
logger.info(f"Summary: {summary}")
return summary
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating summary: {e}")
return None
- This module defines the
summarize_transcript
function, which takes a text transcript as input and uses OpenAI's Chat Completion API to generate a concise summary. - The system message instructs the model to act as a helpful assistant and to provide a concise summary of the text, suitable for generating a visual representation.
- The user message provides the transcript as the content for the model to summarize.
- The function extracts the summary from the API response.
- It includes error handling.
utils/generate_prompt.py
:
import openai
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def create_image_prompt(transcription: str) -> Optional[str]:
"""
Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.
Args:
transcription (str): The text transcription of the audio.
Returns:
Optional[str]: A detailed text prompt suitable for image generation, or None on error.
"""
try:
logger.info("Generating image prompt from transcription")
response = openai.chat.completions.create(
model="gpt-4o", # Use a powerful chat model
messages=[
{
"role": "system",
"content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content. Do not include any phrases like 'based on the audio' or 'from the user audio'. Incorporate scene lighting, time of day, weather, and camera angle into the description. Limit the description to 200 words.",
},
{"role": "user", "content": transcription},
],
)
prompt = response.choices[0].message.content
prompt = prompt.strip() # Remove leading/trailing spaces
logger.info(f"Generated prompt: {prompt}")
return prompt
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image prompt: {e}")
return None
- This module defines the
create_image_prompt
function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation. - The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
- The user message provides the transcribed text as the content for the model to work with.
- The function extracts the generated prompt from the API response.
- It strips any leading/trailing spaces from the generated prompt.
- It includes error handling.
utils/generate_image.py
:
import openai
import logging
from typing import Optional, Dict
logger = logging.getLogger(__name__)
def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024",
response_format: str = "url", quality: str = "standard") -> Optional[str]:
"""
Generates an image using OpenAI's DALL·E API.
Args:
prompt (str): The text prompt to generate the image from.
model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
size (str, optional): The size of the generated image. Defaults to "1024x1024".
response_format (str, optional): The format of the response. Defaults to "url".
quality (str, optional): The quality of the image. Defaults to "standard".
Returns:
Optional[str]: The URL of the generated image, or None on error.
"""
try:
logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}, quality: {quality}")
response = openai.images.generate(
prompt=prompt,
model=model,
size=size,
response_format=response_format,
quality=quality
)
image_url = response.data[0].url
logger.info(f"Image URL: {image_url}")
return image_url
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image: {e}")
return None
- This module defines the
generate_dalle_image
function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image. - It calls the
openai.images.generate()
method to generate the image. - It accepts optional
model
,size
,response_format
, andquality
parameters, allowing the user to configure the image generation. - It extracts the URL of the generated image from the API response.
- It includes error handling.
Step 5: Create the Main App (app.py)
Create a Python file named app.py
in the root directory of your project and add the following code:
from flask import Flask, request, render_template, jsonify, make_response, redirect, url_for
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict
from werkzeug.utils import secure_filename
from werkzeug.datastructures import FileStorage
# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image
from utils.summarize import summarize_transcript
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads' # Store uploaded files
app.config['MAX_CONTENT_LENGTH'] = 25 * 1024 * 1024 # 25MB max file size - increased for larger audio files
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True) # Create the upload folder if it doesn't exist
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'} # Allowed audio file extensions
def allowed_file(filename: str) -> bool:
"""
Checks if the uploaded file has an allowed extension.
Args:
filename (str): The name of the file.
Returns:
bool: True if the file has an allowed extension, False otherwise.
"""
return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
@app.route("/", methods=["GET", "POST"])
def index():
"""
Handles the main route for the web application.
Processes audio uploads, transcribes them, generates image prompts, and displays images.
"""
transcript = None
image_url = None
prompt_summary = None
error_message = None
summary = None # Initialize summary
if request.method == "POST":
if 'audio_file' not in request.files:
error_message = "No file part"
logger.warning(error_message)
return render_template("index.html", error_message=error_message)
file: FileStorage = request.files['audio_file'] # Use type hinting
if file.filename == '':
error_message = "No file selected"
logger.warning(request)
return render_template("index.html", error_message=error_message)
if file and allowed_file(file.filename):
try:
# Secure the filename and construct a safe path
filename = secure_filename(file.filename)
file_path = os.path.join(app.config['UPLOAD_FOLDER'], filename)
file.save(file_path) # Save the uploaded file
transcript = transcribe_audio(file_path) # Transcribe audio
if not transcript:
error_message = "Audio transcription failed. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
summary = summarize_transcript(transcript) # Summarize the transcript
if not summary:
error_message = "Audio summary failed. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
prompt_summary = generate_image_prompt(transcript) # Generate prompt
if not prompt_summary:
error_message = "Failed to generate image prompt. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
image_url = generate_dalle_image(prompt_summary, model=request.form.get('model', 'dall-e-3'),
size=request.form.get('size', '1024x1024'),
response_format=request.form.get('format', 'url'),
quality=request.form.get('quality', 'standard')) # Generate image
if not image_url:
error_message = "Failed to generate image. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
# Optionally, delete the uploaded file after processing
os.remove(file_path)
logger.info(f"Successfully processed audio file and generated image.")
return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary, summary=summary)
except Exception as e:
error_message = f"An error occurred: {e}"
logger.error(error_message)
return render_template("index.html", error_message=error_message)
else:
error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
logger.warning(request)
return render_template("index.html", error_message=error_message)
return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary,
error=error_message, summary=summary)
@app.errorhandler(500)
def internal_server_error(e):
"""Handles internal server errors."""
logger.error(f"Internal Server Error: {e}")
return render_template("error.html", error="Internal Server Error"), 500
if __name__ == "__main__":
app.run(debug=True)
Code Breakdown:
- Import Statements: Imports necessary Flask modules, OpenAI library,
os
,dotenv
,logging
,Optional
andDict
for type hinting, andsecure_filename
andFileStorage
from Werkzeug. - Environment Variables: Loads the OpenAI API key from the
.env
file. - Flask Application:
- Creates a Flask application instance.
- Configures an upload folder and maximum file size. The
UPLOAD_FOLDER
is set to 'uploads', andMAX_CONTENT_LENGTH
is set to 25MB. The upload folder is created if it does not exist.
- Logging Configuration: Configures logging.
allowed_file
Function: Checks if the uploaded file has an allowed audio extension.transcribe_audio
Function:- Takes the audio file path as input.
- Opens the audio file in binary read mode (
"rb"
). - Calls the OpenAI API's
openai.Audio.transcriptions.create()
method to transcribe the audio. - Extracts the transcribed text from the API response.
- Logs the file path before transcription and the length of the transcribed text after successful transcription.
- Includes error handling for OpenAI API errors and other exceptions. The audio file is closed after transcription.
generate_image_prompt
Function:- Takes the transcribed text as input.
- Uses the OpenAI Chat Completion API (
openai.chat.completions.create()
) with thegpt-4o
model to generate a detailed text prompt suitable for image generation. - The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
- Extracts the generated prompt from the API response.
- It strips any leading/trailing spaces from the generated prompt.
- Includes error handling.
generate_image
Function:- Takes the image prompt as input.
- Calls the OpenAI API's
openai.Image.create()
method to generate an image using DALL·E 3. - Accepts optional
model
,size
,response_format
, andquality
parameters, allowing the user to configure the image generation. - Extracts the URL of the generated image from the API response.
- Includes error handling.
index
Route:- Handles both GET and POST requests.
- For GET requests, it renders the initial HTML page.
- For POST requests (when the user uploads an audio file):
- It validates the uploaded file:
- Checks if the file part exists in the request.
- Checks if a file was selected.
- Checks if the file type is allowed using the
allowed_file
function.
- It saves the uploaded file to a temporary location using a secure filename.
- It calls the utility functions to:
- Transcribe the audio using
transcribe_audio()
. - Generate an image prompt from the transcription using
create_image_prompt()
. - Generate an image from the prompt using
generate_dalle_image()
.
- Transcribe the audio using
- It summarizes the transcript using openai chat completions api.
- It handles errors that may occur during any of these steps, logging the error and rendering the
index.html
template with an appropriate error message. The temporary file is deleted before rendering the error page. - If all steps are successful, it renders the
index.html
template, passing the transcription text, image URL, and generated prompt to be displayed.
- It validates the uploaded file:
@app.errorhandler(500)
: Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.if __name__ == "__main__":
: Starts the Flask development server if the script is executed directly.
Step 6: Create the HTML Template (templates/dashboard.html)
Create a folder named templates
in the same directory as app.py
. Inside the templates
folder, create a file named dashboard.html
with the following HTML code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Creator Dashboard</title>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
<style>
/* --- General Styles --- */
body {
font-family: 'Inter', sans-serif;
padding: 40px;
background-color: #f9fafb; /* Tailwind's gray-50 */
display: flex;
justify-content: center;
align-items: center;
min-height: 100vh;
margin: 0;
color: #374151; /* Tailwind's gray-700 */
}
.container {
max-width: 800px; /* Increased max-width */
width: 95%; /* Take up most of the viewport */
background-color: #fff;
padding: 2rem;
border-radius: 0.75rem; /* Tailwind's rounded-lg */
box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
text-align: center;
}
h2 {
font-size: 2.25rem; /* Tailwind's text-3xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1.5rem; /* Tailwind's mb-6 */
color: #1e293b; /* Tailwind's gray-900 */
}
p{
color: #6b7280; /* Tailwind's gray-500 */
margin-bottom: 1rem;
}
/* --- Form Styles --- */
form {
margin-top: 1rem; /* Tailwind's mt-4 */
margin-bottom: 1.5rem;
display: flex;
flex-direction: column;
align-items: center; /* Center form elements */
gap: 0.5rem; /* Tailwind's gap-2 */
}
label {
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
color: #4b5563; /* Tailwind's gray-600 */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
display: block; /* Ensure label takes full width */
text-align: left;
width: 100%;
max-width: 400px; /* Added max-width for label */
margin-left: auto;
margin-right: auto;
}
input[type="file"] {
width: 100%;
max-width: 400px; /* Added max-width for file input */
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
font-size: 1rem; /* Tailwind's text-base */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
margin-left: auto;
margin-right: auto;
}
input[type="submit"] {
padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
color: #fff;
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
cursor: pointer;
transition: background-color 0.3s ease; /* Smooth transition */
border: none;
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
margin-top: 1rem;
}
input[type="submit"]:hover {
background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
}
input[type="submit"]:focus {
outline: none;
box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
}
/* --- Result Styles --- */
.result-container {
margin-top: 2rem; /* Tailwind's mt-8 */
padding: 1.5rem; /* Tailwind's p-6 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #f8fafc; /* Tailwind's bg-gray-50 */
border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
text-align: left;
}
h3 {
font-size: 1.5rem; /* Tailwind's text-2xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1rem; /* Tailwind's mb-4 */
color: #1e293b; /* Tailwind's gray-900 */
}
textarea {
width: 100%;
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
resize: none;
font-size: 1rem; /* Tailwind's text-base */
line-height: 1.5rem; /* Tailwind's leading-relaxed */
margin-top: 0.5rem; /* Tailwind's mt-2 */
margin-bottom: 0;
box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
min-height: 100px;
}
textarea:focus {
outline: none;
border-color: #3b82f6; /* Tailwind's border-blue-500 */
box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
}
img {
max-width: 100%;
border-radius: 0.5rem; /* Tailwind's rounded-md */
margin-top: 1.5rem; /* Tailwind's mt-6 */
box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
}
/* --- Error Styles --- */
.error-message {
color: #dc2626; /* Tailwind's text-red-600 */
margin-top: 1rem; /* Tailwind's mt-4 */
padding: 0.75rem;
background-color: #fee2e2; /* Tailwind's bg-red-100 */
border-radius: 0.375rem; /* Tailwind's rounded-md */
border: 1px solid #fecaca; /* Tailwind's border-red-300 */
text-align: center;
}
.prompt-select {
margin-top: 1rem; /* Tailwind's mt-4 */
display: flex;
flex-direction: column;
align-items: center;
gap: 0.5rem;
width: 100%;
}
.prompt-select label {
font-size: 1rem;
font-weight: 600;
color: #4b5563;
margin-bottom: 0.25rem;
display: block; /* Ensure label takes full width */
text-align: left;
width: 100%;
max-width: 400px; /* Added max-width for label */
margin-left: auto;
margin-right: auto;
}
.prompt-select select {
width: 100%;
max-width: 400px;
padding: 0.75rem;
border-radius: 0.5rem;
border: 1px solid #d1d5db;
font-size: 1rem;
margin-bottom: 0.25rem;
margin-left: auto;
margin-right: auto;
appearance: none; /* Remove default arrow */
background-image: url("data:image/svg+xml,%3Csvgxmlns='http://www.w3.org/2000/svg' viewBox='0 0 20 20' fill='none' stroke='currentColor' stroke-width='1.5' stroke-linecap='round' stroke-linejoin='round'%3E%3Cpath d='M6 9l4 4 4-4'%3E%3C/path%3E%3C/svg%3E"); /* Add custom arrow */
background-repeat: no-repeat;
background-position: right 0.75rem center;
background-size: 1rem;
padding-right: 2.5rem; /* Make space for the arrow */
}
.prompt-select select:focus {
outline: none;
border-color: #3b82f6;
box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15);
}
</style>
</head>
<body>
<div class="container">
<h2>🎤🧠🎨 Multimodal Assistant</h2>
<p> Upload an audio file to transcribe and generate a corresponding image. </p>
<form method="POST" enctype="multipart/form-data">
<label for="audio_file">Upload your voice note:</label><br>
<input type="file" name="audio_file" accept="audio/*" required><br><br>
<div class = "prompt-select">
<label for="prompt_mode">Image Prompt Mode:</label>
<select id="prompt_mode" name="prompt_mode">
<option value="detailed">Detailed Scene Description</option>
<option value="keywords">Keywords</option>
<option value="creative">Creative Interpretation</option>
</select>
</div>
<input type="submit" value="Generate Visual Response">
</form>
{% if transcript %}
<div class="result-container">
<h3>📝 Transcript:</h3>
<textarea readonly>{{ transcript }}</textarea>
</div>
{% endif %}
{% if summary %}
<div class="result-container">
<h3>🔎 Summary:</h3>
<p>{{ summary }}</p>
</div>
{% endif %}
{% if prompt %}
<div class="result-container">
<h3>🎯 Scene Prompt:</h3>
<p>{{ prompt }}</p>
</div>
{% endif %}
{% if image_url %}
<div class="result-container">
<h3>🖼️ Generated Image:</h3>
<img src="{{ image_url }}" alt="Generated image">
</div>
{% endif %}
{% if error %}
<div class="error-message">{{ error }}</div>
{% endif %}
</div>
</body>
</html>
6.2.3 What Makes This a Dashboard?
This layout combines several key elements that transform it from a simple interface into a comprehensive dashboard:
- Multiple output zones (text, summary, prompt, image)The interface is divided into distinct sections, each dedicated to displaying different types of processed data. This organization allows users to easily track the progression from speech input to visual output.
- Simple user interaction (one-click processing)Despite the complex processing happening behind the scenes, users only need to perform one action to initiate the entire workflow. This simplicity makes the tool accessible to users of all technical levels.
- Clean, readable formattingThe interface uses consistent spacing, typography, and visual hierarchy to ensure information is easily digestible. Each section is clearly labeled and visually separated from others.
- Visual feedback to reinforce model outputThe dashboard provides immediate visual confirmation at each step of the process, helping users understand how their input is being transformed across different AI models.
- Reusable architecture, thanks to the
utils/
structureThe modular design separates core functionality into utility functions, making the code easier to maintain and adapt for different use cases.
6.2.4 Use Case Ideas
This versatile dashboard has numerous potential applications. Let's explore some key use cases in detail:
- A content creator's AI toolkit (turn thoughts into blogs + visuals)
- Record brainstorming sessions and convert them into structured blog posts
- Generate matching illustrations for key concepts
- Create social media content bundles with matching visuals
- A teacher's assistant (record voice ➝ summarize ➝ illustrate)
- Transform lesson plans into visual learning materials
- Create engaging educational content with matching illustrations
- Generate visual aids for complex concepts
- A journaling tool (log voice entries ➝ summarize + visualize)
- Convert daily voice memos into organized written entries
- Create mood boards based on journal content
- Track emotional patterns through visual representations
Summary
In this section, you elevated your multimodal assistant into a professional-grade dashboard. Here's what you accomplished:
- Break down your logic into reusable utilities
- Created modular, maintainable code structure
- Implemented clean separation of concerns
- Accept audio input and process it across models
- Seamless integration of multiple AI technologies
- Efficient processing pipeline
- Present everything clearly in a cohesive UI
- User-friendly interface design
- Intuitive information hierarchy
- Move from "demo" to tool
- Production-ready implementation
- Scalable architecture
This dashboard represents a professional-grade interface that delivers real value to users. With its robust architecture and intuitive design, it's ready to be transformed into a full-fledged product with minimal additional development.
6.2 Building a Creator Dashboard
This is where all the capabilities you've developed so far come together to create a powerful, unified system. By integrating multiple AI technologies, we can create applications that are greater than the sum of their parts. Let's explore these core capabilities in detail:
- Transcription turns spoken words into written textUsing advanced speech recognition models like Whisper, we can accurately convert audio recordings into text, preserving the speaker's intent and context. This forms the foundation for further processing.
- Content generation creates new, contextually relevant materialLarge language models can analyze the transcribed text and generate new content that maintains consistency with the original message while adding valuable insights or expanding on key points.
- Prompt engineering crafts precise instructions for AI modelsThrough careful prompt construction, we can guide AI models to produce more accurate and relevant outputs. This involves understanding both the technical capabilities of the models and the nuanced ways to communicate with them.
- Image creation transforms text descriptions into visual artModels like DALL·E can interpret textual descriptions and create corresponding images, adding a visual dimension to our applications and making abstract concepts more tangible.
These components don't just exist side by side - they form an interconnected pipeline where each step enhances the next. The output from transcription feeds into content generation, which informs prompt engineering, ultimately leading to image creation. This seamless integration creates a fluid workflow where users can start with a simple voice recording and end with a rich multimedia output, all within a single, cohesive system. By eliminating the need to switch between different tools or interfaces, users can focus on their creative process rather than technical implementation details.
6.2.1 What You'll Build
In this section, you'll design and implement a Creator Dashboard - a sophisticated web interface that transforms how creators work with AI. This comprehensive platform serves as a central hub for content creation, combining multiple AI technologies into one seamless experience. Let's explore the key features that make this dashboard powerful:
- Upload a voice recordingCreators can easily upload audio files in various formats, making it simple to start their creative process with spoken ideas or narration.
- Transcribe the voice recording into text using AIUsing advanced AI speech recognition technology, the system accurately converts spoken words into written text, maintaining the nuances and context of the original recording.
- Turn that transcription into an editable promptThe system intelligently processes the transcribed text to create structured, AI-ready prompts that can be customized to achieve the desired creative output.
- Generate images using DALL·E based on the promptLeveraging DALL·E's powerful image generation capabilities, the system creates visual representations that match the specified prompts, bringing ideas to life through AI-generated artwork.
- Summarize the transcriptThe dashboard employs AI to distill long transcriptions into concise, meaningful summaries, helping creators quickly grasp the core concepts and themes.
- Display all the results for review and further use in content productionAll generated content - from transcripts to images - is presented in an organized, easy-to-review format, allowing creators to efficiently manage and utilize their assets.
To build this robust system, you'll implement a modern tech stack using Flask for the backend operations and a clean, responsive combination of HTML and CSS for the frontend interface. This architecture ensures both modularity and maintainability, making it easy to update and scale the dashboard as needed.
6.2.2 Step-by-Step Implementation
Step 1: Project Setup
Download the audio sample: https://files.cuantum.tech/audio/dashboard-project.mp3
Create a new directory for your project and navigate into it:
mkdir creator_dashboard
cd creator_dashboard
It's recommended to set up a virtual environment:
python -m venv venv
source venv/bin/activate # On macOS/Linux
venv\\Scripts\\activate # On Windows
Install the required Python packages:
pip install flask openai python-dotenv
Organize your project files as follows:
/creator_dashboard
│
├── app.py
├── .env
└── templates/
└── dashboard.html
└── utils/
├── __init__.py
├── transcribe.py
├── summarize.py
├── generate_prompt.py
└── generate_image.py
app.py
: The main Flask application file..env
: A file to store your OpenAI API key.templates/
: A directory for HTML templates.templates/dashboard.html
: The HTML template for the user interface.utils/
: A directory for Python modules containing reusable functions.__init__.py
: Makes the utils directory a Python package.transcribe.py
: Contains the function to transcribe audio using Whisper.summarize.py
: Contains the function to summarize the transcription using a Large Language Model.generate_prompt.py
: Contains the function to generate an image prompt from the summary using a Large Language Model.generate_image.py
: Contains the function to generate an image with DALL·E 3.
Step 2: Create the Utility Modules
Create the following Python files in the utils/
directory:
utils/transcribe.py
:
import openai
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def transcribe_audio(file_path: str) -> Optional[str]:
"""
Transcribes an audio file using OpenAI's Whisper API.
Args:
file_path (str): The path to the audio file.
Returns:
Optional[str]: The transcribed text, or None on error.
"""
try:
logger.info(f"Transcribing audio: {file_path}")
audio_file = open(file_path, "rb")
response = openai.Audio.transcriptions.create(
model="whisper-1",
file=audio_file,
)
transcript = response.text
audio_file.close()
return transcript
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error during transcription: {e}")
return None
- This module defines the
transcribe_audio
function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription. - The function opens the audio file in binary read mode (
"rb"
). - It calls
openai.Audio.transcriptions.create()
to perform the transcription, specifying the "whisper-1" model. - It extracts the transcribed text from the API response.
- It includes error handling using a
try...except
block to catch potentialopenai.error.OpenAIError
exceptions (specific to OpenAI) and generalException
for other errors. If an error occurs, it logs the error and returnsNone
. - It logs the file path before transcription and the length of the transcribed text after successful transcription.
- The audio file is closed after transcription.
utils/summarize.py
:
import openai
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def summarize_transcript(text: str) -> Optional[str]:
"""
Summarizes a text transcript using OpenAI's Chat Completion API.
Args:
text (str): The text transcript to summarize.
Returns:
Optional[str]: The summarized text, or None on error.
"""
try:
logger.info("Summarizing transcript")
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system",
"content": "You are a helpful assistant. Provide a concise summary of the text, suitable for generating a visual representation."},
{"role": "user", "content": text}
],
)
summary = response.choices[0].message.content
logger.info(f"Summary: {summary}")
return summary
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating summary: {e}")
return None
- This module defines the
summarize_transcript
function, which takes a text transcript as input and uses OpenAI's Chat Completion API to generate a concise summary. - The system message instructs the model to act as a helpful assistant and to provide a concise summary of the text, suitable for generating a visual representation.
- The user message provides the transcript as the content for the model to summarize.
- The function extracts the summary from the API response.
- It includes error handling.
utils/generate_prompt.py
:
import openai
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def create_image_prompt(transcription: str) -> Optional[str]:
"""
Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.
Args:
transcription (str): The text transcription of the audio.
Returns:
Optional[str]: A detailed text prompt suitable for image generation, or None on error.
"""
try:
logger.info("Generating image prompt from transcription")
response = openai.chat.completions.create(
model="gpt-4o", # Use a powerful chat model
messages=[
{
"role": "system",
"content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content. Do not include any phrases like 'based on the audio' or 'from the user audio'. Incorporate scene lighting, time of day, weather, and camera angle into the description. Limit the description to 200 words.",
},
{"role": "user", "content": transcription},
],
)
prompt = response.choices[0].message.content
prompt = prompt.strip() # Remove leading/trailing spaces
logger.info(f"Generated prompt: {prompt}")
return prompt
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image prompt: {e}")
return None
- This module defines the
create_image_prompt
function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation. - The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
- The user message provides the transcribed text as the content for the model to work with.
- The function extracts the generated prompt from the API response.
- It strips any leading/trailing spaces from the generated prompt.
- It includes error handling.
utils/generate_image.py
:
import openai
import logging
from typing import Optional, Dict
logger = logging.getLogger(__name__)
def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024",
response_format: str = "url", quality: str = "standard") -> Optional[str]:
"""
Generates an image using OpenAI's DALL·E API.
Args:
prompt (str): The text prompt to generate the image from.
model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
size (str, optional): The size of the generated image. Defaults to "1024x1024".
response_format (str, optional): The format of the response. Defaults to "url".
quality (str, optional): The quality of the image. Defaults to "standard".
Returns:
Optional[str]: The URL of the generated image, or None on error.
"""
try:
logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}, quality: {quality}")
response = openai.images.generate(
prompt=prompt,
model=model,
size=size,
response_format=response_format,
quality=quality
)
image_url = response.data[0].url
logger.info(f"Image URL: {image_url}")
return image_url
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image: {e}")
return None
- This module defines the
generate_dalle_image
function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image. - It calls the
openai.images.generate()
method to generate the image. - It accepts optional
model
,size
,response_format
, andquality
parameters, allowing the user to configure the image generation. - It extracts the URL of the generated image from the API response.
- It includes error handling.
Step 5: Create the Main App (app.py)
Create a Python file named app.py
in the root directory of your project and add the following code:
from flask import Flask, request, render_template, jsonify, make_response, redirect, url_for
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict
from werkzeug.utils import secure_filename
from werkzeug.datastructures import FileStorage
# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image
from utils.summarize import summarize_transcript
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads' # Store uploaded files
app.config['MAX_CONTENT_LENGTH'] = 25 * 1024 * 1024 # 25MB max file size - increased for larger audio files
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True) # Create the upload folder if it doesn't exist
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'} # Allowed audio file extensions
def allowed_file(filename: str) -> bool:
"""
Checks if the uploaded file has an allowed extension.
Args:
filename (str): The name of the file.
Returns:
bool: True if the file has an allowed extension, False otherwise.
"""
return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
@app.route("/", methods=["GET", "POST"])
def index():
"""
Handles the main route for the web application.
Processes audio uploads, transcribes them, generates image prompts, and displays images.
"""
transcript = None
image_url = None
prompt_summary = None
error_message = None
summary = None # Initialize summary
if request.method == "POST":
if 'audio_file' not in request.files:
error_message = "No file part"
logger.warning(error_message)
return render_template("index.html", error_message=error_message)
file: FileStorage = request.files['audio_file'] # Use type hinting
if file.filename == '':
error_message = "No file selected"
logger.warning(request)
return render_template("index.html", error_message=error_message)
if file and allowed_file(file.filename):
try:
# Secure the filename and construct a safe path
filename = secure_filename(file.filename)
file_path = os.path.join(app.config['UPLOAD_FOLDER'], filename)
file.save(file_path) # Save the uploaded file
transcript = transcribe_audio(file_path) # Transcribe audio
if not transcript:
error_message = "Audio transcription failed. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
summary = summarize_transcript(transcript) # Summarize the transcript
if not summary:
error_message = "Audio summary failed. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
prompt_summary = generate_image_prompt(transcript) # Generate prompt
if not prompt_summary:
error_message = "Failed to generate image prompt. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
image_url = generate_dalle_image(prompt_summary, model=request.form.get('model', 'dall-e-3'),
size=request.form.get('size', '1024x1024'),
response_format=request.form.get('format', 'url'),
quality=request.form.get('quality', 'standard')) # Generate image
if not image_url:
error_message = "Failed to generate image. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
# Optionally, delete the uploaded file after processing
os.remove(file_path)
logger.info(f"Successfully processed audio file and generated image.")
return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary, summary=summary)
except Exception as e:
error_message = f"An error occurred: {e}"
logger.error(error_message)
return render_template("index.html", error_message=error_message)
else:
error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
logger.warning(request)
return render_template("index.html", error_message=error_message)
return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary,
error=error_message, summary=summary)
@app.errorhandler(500)
def internal_server_error(e):
"""Handles internal server errors."""
logger.error(f"Internal Server Error: {e}")
return render_template("error.html", error="Internal Server Error"), 500
if __name__ == "__main__":
app.run(debug=True)
Code Breakdown:
- Import Statements: Imports necessary Flask modules, OpenAI library,
os
,dotenv
,logging
,Optional
andDict
for type hinting, andsecure_filename
andFileStorage
from Werkzeug. - Environment Variables: Loads the OpenAI API key from the
.env
file. - Flask Application:
- Creates a Flask application instance.
- Configures an upload folder and maximum file size. The
UPLOAD_FOLDER
is set to 'uploads', andMAX_CONTENT_LENGTH
is set to 25MB. The upload folder is created if it does not exist.
- Logging Configuration: Configures logging.
allowed_file
Function: Checks if the uploaded file has an allowed audio extension.transcribe_audio
Function:- Takes the audio file path as input.
- Opens the audio file in binary read mode (
"rb"
). - Calls the OpenAI API's
openai.Audio.transcriptions.create()
method to transcribe the audio. - Extracts the transcribed text from the API response.
- Logs the file path before transcription and the length of the transcribed text after successful transcription.
- Includes error handling for OpenAI API errors and other exceptions. The audio file is closed after transcription.
generate_image_prompt
Function:- Takes the transcribed text as input.
- Uses the OpenAI Chat Completion API (
openai.chat.completions.create()
) with thegpt-4o
model to generate a detailed text prompt suitable for image generation. - The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
- Extracts the generated prompt from the API response.
- It strips any leading/trailing spaces from the generated prompt.
- Includes error handling.
generate_image
Function:- Takes the image prompt as input.
- Calls the OpenAI API's
openai.Image.create()
method to generate an image using DALL·E 3. - Accepts optional
model
,size
,response_format
, andquality
parameters, allowing the user to configure the image generation. - Extracts the URL of the generated image from the API response.
- Includes error handling.
index
Route:- Handles both GET and POST requests.
- For GET requests, it renders the initial HTML page.
- For POST requests (when the user uploads an audio file):
- It validates the uploaded file:
- Checks if the file part exists in the request.
- Checks if a file was selected.
- Checks if the file type is allowed using the
allowed_file
function.
- It saves the uploaded file to a temporary location using a secure filename.
- It calls the utility functions to:
- Transcribe the audio using
transcribe_audio()
. - Generate an image prompt from the transcription using
create_image_prompt()
. - Generate an image from the prompt using
generate_dalle_image()
.
- Transcribe the audio using
- It summarizes the transcript using openai chat completions api.
- It handles errors that may occur during any of these steps, logging the error and rendering the
index.html
template with an appropriate error message. The temporary file is deleted before rendering the error page. - If all steps are successful, it renders the
index.html
template, passing the transcription text, image URL, and generated prompt to be displayed.
- It validates the uploaded file:
@app.errorhandler(500)
: Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.if __name__ == "__main__":
: Starts the Flask development server if the script is executed directly.
Step 6: Create the HTML Template (templates/dashboard.html)
Create a folder named templates
in the same directory as app.py
. Inside the templates
folder, create a file named dashboard.html
with the following HTML code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Creator Dashboard</title>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
<style>
/* --- General Styles --- */
body {
font-family: 'Inter', sans-serif;
padding: 40px;
background-color: #f9fafb; /* Tailwind's gray-50 */
display: flex;
justify-content: center;
align-items: center;
min-height: 100vh;
margin: 0;
color: #374151; /* Tailwind's gray-700 */
}
.container {
max-width: 800px; /* Increased max-width */
width: 95%; /* Take up most of the viewport */
background-color: #fff;
padding: 2rem;
border-radius: 0.75rem; /* Tailwind's rounded-lg */
box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
text-align: center;
}
h2 {
font-size: 2.25rem; /* Tailwind's text-3xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1.5rem; /* Tailwind's mb-6 */
color: #1e293b; /* Tailwind's gray-900 */
}
p{
color: #6b7280; /* Tailwind's gray-500 */
margin-bottom: 1rem;
}
/* --- Form Styles --- */
form {
margin-top: 1rem; /* Tailwind's mt-4 */
margin-bottom: 1.5rem;
display: flex;
flex-direction: column;
align-items: center; /* Center form elements */
gap: 0.5rem; /* Tailwind's gap-2 */
}
label {
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
color: #4b5563; /* Tailwind's gray-600 */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
display: block; /* Ensure label takes full width */
text-align: left;
width: 100%;
max-width: 400px; /* Added max-width for label */
margin-left: auto;
margin-right: auto;
}
input[type="file"] {
width: 100%;
max-width: 400px; /* Added max-width for file input */
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
font-size: 1rem; /* Tailwind's text-base */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
margin-left: auto;
margin-right: auto;
}
input[type="submit"] {
padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
color: #fff;
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
cursor: pointer;
transition: background-color 0.3s ease; /* Smooth transition */
border: none;
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
margin-top: 1rem;
}
input[type="submit"]:hover {
background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
}
input[type="submit"]:focus {
outline: none;
box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
}
/* --- Result Styles --- */
.result-container {
margin-top: 2rem; /* Tailwind's mt-8 */
padding: 1.5rem; /* Tailwind's p-6 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #f8fafc; /* Tailwind's bg-gray-50 */
border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
text-align: left;
}
h3 {
font-size: 1.5rem; /* Tailwind's text-2xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1rem; /* Tailwind's mb-4 */
color: #1e293b; /* Tailwind's gray-900 */
}
textarea {
width: 100%;
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
resize: none;
font-size: 1rem; /* Tailwind's text-base */
line-height: 1.5rem; /* Tailwind's leading-relaxed */
margin-top: 0.5rem; /* Tailwind's mt-2 */
margin-bottom: 0;
box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
min-height: 100px;
}
textarea:focus {
outline: none;
border-color: #3b82f6; /* Tailwind's border-blue-500 */
box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
}
img {
max-width: 100%;
border-radius: 0.5rem; /* Tailwind's rounded-md */
margin-top: 1.5rem; /* Tailwind's mt-6 */
box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
}
/* --- Error Styles --- */
.error-message {
color: #dc2626; /* Tailwind's text-red-600 */
margin-top: 1rem; /* Tailwind's mt-4 */
padding: 0.75rem;
background-color: #fee2e2; /* Tailwind's bg-red-100 */
border-radius: 0.375rem; /* Tailwind's rounded-md */
border: 1px solid #fecaca; /* Tailwind's border-red-300 */
text-align: center;
}
.prompt-select {
margin-top: 1rem; /* Tailwind's mt-4 */
display: flex;
flex-direction: column;
align-items: center;
gap: 0.5rem;
width: 100%;
}
.prompt-select label {
font-size: 1rem;
font-weight: 600;
color: #4b5563;
margin-bottom: 0.25rem;
display: block; /* Ensure label takes full width */
text-align: left;
width: 100%;
max-width: 400px; /* Added max-width for label */
margin-left: auto;
margin-right: auto;
}
.prompt-select select {
width: 100%;
max-width: 400px;
padding: 0.75rem;
border-radius: 0.5rem;
border: 1px solid #d1d5db;
font-size: 1rem;
margin-bottom: 0.25rem;
margin-left: auto;
margin-right: auto;
appearance: none; /* Remove default arrow */
background-image: url("data:image/svg+xml,%3Csvgxmlns='http://www.w3.org/2000/svg' viewBox='0 0 20 20' fill='none' stroke='currentColor' stroke-width='1.5' stroke-linecap='round' stroke-linejoin='round'%3E%3Cpath d='M6 9l4 4 4-4'%3E%3C/path%3E%3C/svg%3E"); /* Add custom arrow */
background-repeat: no-repeat;
background-position: right 0.75rem center;
background-size: 1rem;
padding-right: 2.5rem; /* Make space for the arrow */
}
.prompt-select select:focus {
outline: none;
border-color: #3b82f6;
box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15);
}
</style>
</head>
<body>
<div class="container">
<h2>🎤🧠🎨 Multimodal Assistant</h2>
<p> Upload an audio file to transcribe and generate a corresponding image. </p>
<form method="POST" enctype="multipart/form-data">
<label for="audio_file">Upload your voice note:</label><br>
<input type="file" name="audio_file" accept="audio/*" required><br><br>
<div class = "prompt-select">
<label for="prompt_mode">Image Prompt Mode:</label>
<select id="prompt_mode" name="prompt_mode">
<option value="detailed">Detailed Scene Description</option>
<option value="keywords">Keywords</option>
<option value="creative">Creative Interpretation</option>
</select>
</div>
<input type="submit" value="Generate Visual Response">
</form>
{% if transcript %}
<div class="result-container">
<h3>📝 Transcript:</h3>
<textarea readonly>{{ transcript }}</textarea>
</div>
{% endif %}
{% if summary %}
<div class="result-container">
<h3>🔎 Summary:</h3>
<p>{{ summary }}</p>
</div>
{% endif %}
{% if prompt %}
<div class="result-container">
<h3>🎯 Scene Prompt:</h3>
<p>{{ prompt }}</p>
</div>
{% endif %}
{% if image_url %}
<div class="result-container">
<h3>🖼️ Generated Image:</h3>
<img src="{{ image_url }}" alt="Generated image">
</div>
{% endif %}
{% if error %}
<div class="error-message">{{ error }}</div>
{% endif %}
</div>
</body>
</html>
6.2.3 What Makes This a Dashboard?
This layout combines several key elements that transform it from a simple interface into a comprehensive dashboard:
- Multiple output zones (text, summary, prompt, image)The interface is divided into distinct sections, each dedicated to displaying different types of processed data. This organization allows users to easily track the progression from speech input to visual output.
- Simple user interaction (one-click processing)Despite the complex processing happening behind the scenes, users only need to perform one action to initiate the entire workflow. This simplicity makes the tool accessible to users of all technical levels.
- Clean, readable formattingThe interface uses consistent spacing, typography, and visual hierarchy to ensure information is easily digestible. Each section is clearly labeled and visually separated from others.
- Visual feedback to reinforce model outputThe dashboard provides immediate visual confirmation at each step of the process, helping users understand how their input is being transformed across different AI models.
- Reusable architecture, thanks to the
utils/
structureThe modular design separates core functionality into utility functions, making the code easier to maintain and adapt for different use cases.
6.2.4 Use Case Ideas
This versatile dashboard has numerous potential applications. Let's explore some key use cases in detail:
- A content creator's AI toolkit (turn thoughts into blogs + visuals)
- Record brainstorming sessions and convert them into structured blog posts
- Generate matching illustrations for key concepts
- Create social media content bundles with matching visuals
- A teacher's assistant (record voice ➝ summarize ➝ illustrate)
- Transform lesson plans into visual learning materials
- Create engaging educational content with matching illustrations
- Generate visual aids for complex concepts
- A journaling tool (log voice entries ➝ summarize + visualize)
- Convert daily voice memos into organized written entries
- Create mood boards based on journal content
- Track emotional patterns through visual representations
Summary
In this section, you elevated your multimodal assistant into a professional-grade dashboard. Here's what you accomplished:
- Break down your logic into reusable utilities
- Created modular, maintainable code structure
- Implemented clean separation of concerns
- Accept audio input and process it across models
- Seamless integration of multiple AI technologies
- Efficient processing pipeline
- Present everything clearly in a cohesive UI
- User-friendly interface design
- Intuitive information hierarchy
- Move from "demo" to tool
- Production-ready implementation
- Scalable architecture
This dashboard represents a professional-grade interface that delivers real value to users. With its robust architecture and intuitive design, it's ready to be transformed into a full-fledged product with minimal additional development.
6.2 Building a Creator Dashboard
This is where all the capabilities you've developed so far come together to create a powerful, unified system. By integrating multiple AI technologies, we can create applications that are greater than the sum of their parts. Let's explore these core capabilities in detail:
- Transcription turns spoken words into written textUsing advanced speech recognition models like Whisper, we can accurately convert audio recordings into text, preserving the speaker's intent and context. This forms the foundation for further processing.
- Content generation creates new, contextually relevant materialLarge language models can analyze the transcribed text and generate new content that maintains consistency with the original message while adding valuable insights or expanding on key points.
- Prompt engineering crafts precise instructions for AI modelsThrough careful prompt construction, we can guide AI models to produce more accurate and relevant outputs. This involves understanding both the technical capabilities of the models and the nuanced ways to communicate with them.
- Image creation transforms text descriptions into visual artModels like DALL·E can interpret textual descriptions and create corresponding images, adding a visual dimension to our applications and making abstract concepts more tangible.
These components don't just exist side by side - they form an interconnected pipeline where each step enhances the next. The output from transcription feeds into content generation, which informs prompt engineering, ultimately leading to image creation. This seamless integration creates a fluid workflow where users can start with a simple voice recording and end with a rich multimedia output, all within a single, cohesive system. By eliminating the need to switch between different tools or interfaces, users can focus on their creative process rather than technical implementation details.
6.2.1 What You'll Build
In this section, you'll design and implement a Creator Dashboard - a sophisticated web interface that transforms how creators work with AI. This comprehensive platform serves as a central hub for content creation, combining multiple AI technologies into one seamless experience. Let's explore the key features that make this dashboard powerful:
- Upload a voice recordingCreators can easily upload audio files in various formats, making it simple to start their creative process with spoken ideas or narration.
- Transcribe the voice recording into text using AIUsing advanced AI speech recognition technology, the system accurately converts spoken words into written text, maintaining the nuances and context of the original recording.
- Turn that transcription into an editable promptThe system intelligently processes the transcribed text to create structured, AI-ready prompts that can be customized to achieve the desired creative output.
- Generate images using DALL·E based on the promptLeveraging DALL·E's powerful image generation capabilities, the system creates visual representations that match the specified prompts, bringing ideas to life through AI-generated artwork.
- Summarize the transcriptThe dashboard employs AI to distill long transcriptions into concise, meaningful summaries, helping creators quickly grasp the core concepts and themes.
- Display all the results for review and further use in content productionAll generated content - from transcripts to images - is presented in an organized, easy-to-review format, allowing creators to efficiently manage and utilize their assets.
To build this robust system, you'll implement a modern tech stack using Flask for the backend operations and a clean, responsive combination of HTML and CSS for the frontend interface. This architecture ensures both modularity and maintainability, making it easy to update and scale the dashboard as needed.
6.2.2 Step-by-Step Implementation
Step 1: Project Setup
Download the audio sample: https://files.cuantum.tech/audio/dashboard-project.mp3
Create a new directory for your project and navigate into it:
mkdir creator_dashboard
cd creator_dashboard
It's recommended to set up a virtual environment:
python -m venv venv
source venv/bin/activate # On macOS/Linux
venv\\Scripts\\activate # On Windows
Install the required Python packages:
pip install flask openai python-dotenv
Organize your project files as follows:
/creator_dashboard
│
├── app.py
├── .env
└── templates/
└── dashboard.html
└── utils/
├── __init__.py
├── transcribe.py
├── summarize.py
├── generate_prompt.py
└── generate_image.py
app.py
: The main Flask application file..env
: A file to store your OpenAI API key.templates/
: A directory for HTML templates.templates/dashboard.html
: The HTML template for the user interface.utils/
: A directory for Python modules containing reusable functions.__init__.py
: Makes the utils directory a Python package.transcribe.py
: Contains the function to transcribe audio using Whisper.summarize.py
: Contains the function to summarize the transcription using a Large Language Model.generate_prompt.py
: Contains the function to generate an image prompt from the summary using a Large Language Model.generate_image.py
: Contains the function to generate an image with DALL·E 3.
Step 2: Create the Utility Modules
Create the following Python files in the utils/
directory:
utils/transcribe.py
:
import openai
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def transcribe_audio(file_path: str) -> Optional[str]:
"""
Transcribes an audio file using OpenAI's Whisper API.
Args:
file_path (str): The path to the audio file.
Returns:
Optional[str]: The transcribed text, or None on error.
"""
try:
logger.info(f"Transcribing audio: {file_path}")
audio_file = open(file_path, "rb")
response = openai.Audio.transcriptions.create(
model="whisper-1",
file=audio_file,
)
transcript = response.text
audio_file.close()
return transcript
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error during transcription: {e}")
return None
- This module defines the
transcribe_audio
function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription. - The function opens the audio file in binary read mode (
"rb"
). - It calls
openai.Audio.transcriptions.create()
to perform the transcription, specifying the "whisper-1" model. - It extracts the transcribed text from the API response.
- It includes error handling using a
try...except
block to catch potentialopenai.error.OpenAIError
exceptions (specific to OpenAI) and generalException
for other errors. If an error occurs, it logs the error and returnsNone
. - It logs the file path before transcription and the length of the transcribed text after successful transcription.
- The audio file is closed after transcription.
utils/summarize.py
:
import openai
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def summarize_transcript(text: str) -> Optional[str]:
"""
Summarizes a text transcript using OpenAI's Chat Completion API.
Args:
text (str): The text transcript to summarize.
Returns:
Optional[str]: The summarized text, or None on error.
"""
try:
logger.info("Summarizing transcript")
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system",
"content": "You are a helpful assistant. Provide a concise summary of the text, suitable for generating a visual representation."},
{"role": "user", "content": text}
],
)
summary = response.choices[0].message.content
logger.info(f"Summary: {summary}")
return summary
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating summary: {e}")
return None
- This module defines the
summarize_transcript
function, which takes a text transcript as input and uses OpenAI's Chat Completion API to generate a concise summary. - The system message instructs the model to act as a helpful assistant and to provide a concise summary of the text, suitable for generating a visual representation.
- The user message provides the transcript as the content for the model to summarize.
- The function extracts the summary from the API response.
- It includes error handling.
utils/generate_prompt.py
:
import openai
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def create_image_prompt(transcription: str) -> Optional[str]:
"""
Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.
Args:
transcription (str): The text transcription of the audio.
Returns:
Optional[str]: A detailed text prompt suitable for image generation, or None on error.
"""
try:
logger.info("Generating image prompt from transcription")
response = openai.chat.completions.create(
model="gpt-4o", # Use a powerful chat model
messages=[
{
"role": "system",
"content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content. Do not include any phrases like 'based on the audio' or 'from the user audio'. Incorporate scene lighting, time of day, weather, and camera angle into the description. Limit the description to 200 words.",
},
{"role": "user", "content": transcription},
],
)
prompt = response.choices[0].message.content
prompt = prompt.strip() # Remove leading/trailing spaces
logger.info(f"Generated prompt: {prompt}")
return prompt
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image prompt: {e}")
return None
- This module defines the
create_image_prompt
function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation. - The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
- The user message provides the transcribed text as the content for the model to work with.
- The function extracts the generated prompt from the API response.
- It strips any leading/trailing spaces from the generated prompt.
- It includes error handling.
utils/generate_image.py
:
import openai
import logging
from typing import Optional, Dict
logger = logging.getLogger(__name__)
def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024",
response_format: str = "url", quality: str = "standard") -> Optional[str]:
"""
Generates an image using OpenAI's DALL·E API.
Args:
prompt (str): The text prompt to generate the image from.
model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
size (str, optional): The size of the generated image. Defaults to "1024x1024".
response_format (str, optional): The format of the response. Defaults to "url".
quality (str, optional): The quality of the image. Defaults to "standard".
Returns:
Optional[str]: The URL of the generated image, or None on error.
"""
try:
logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}, quality: {quality}")
response = openai.images.generate(
prompt=prompt,
model=model,
size=size,
response_format=response_format,
quality=quality
)
image_url = response.data[0].url
logger.info(f"Image URL: {image_url}")
return image_url
except openai.error.OpenAIError as e:
logger.error(f"OpenAI API Error: {e}")
return None
except Exception as e:
logger.error(f"Error generating image: {e}")
return None
- This module defines the
generate_dalle_image
function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image. - It calls the
openai.images.generate()
method to generate the image. - It accepts optional
model
,size
,response_format
, andquality
parameters, allowing the user to configure the image generation. - It extracts the URL of the generated image from the API response.
- It includes error handling.
Step 5: Create the Main App (app.py)
Create a Python file named app.py
in the root directory of your project and add the following code:
from flask import Flask, request, render_template, jsonify, make_response, redirect, url_for
import os
from dotenv import load_dotenv
import logging
from typing import Optional, Dict
from werkzeug.utils import secure_filename
from werkzeug.datastructures import FileStorage
# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image
from utils.summarize import summarize_transcript
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads' # Store uploaded files
app.config['MAX_CONTENT_LENGTH'] = 25 * 1024 * 1024 # 25MB max file size - increased for larger audio files
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True) # Create the upload folder if it doesn't exist
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'} # Allowed audio file extensions
def allowed_file(filename: str) -> bool:
"""
Checks if the uploaded file has an allowed extension.
Args:
filename (str): The name of the file.
Returns:
bool: True if the file has an allowed extension, False otherwise.
"""
return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
@app.route("/", methods=["GET", "POST"])
def index():
"""
Handles the main route for the web application.
Processes audio uploads, transcribes them, generates image prompts, and displays images.
"""
transcript = None
image_url = None
prompt_summary = None
error_message = None
summary = None # Initialize summary
if request.method == "POST":
if 'audio_file' not in request.files:
error_message = "No file part"
logger.warning(error_message)
return render_template("index.html", error_message=error_message)
file: FileStorage = request.files['audio_file'] # Use type hinting
if file.filename == '':
error_message = "No file selected"
logger.warning(request)
return render_template("index.html", error_message=error_message)
if file and allowed_file(file.filename):
try:
# Secure the filename and construct a safe path
filename = secure_filename(file.filename)
file_path = os.path.join(app.config['UPLOAD_FOLDER'], filename)
file.save(file_path) # Save the uploaded file
transcript = transcribe_audio(file_path) # Transcribe audio
if not transcript:
error_message = "Audio transcription failed. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
summary = summarize_transcript(transcript) # Summarize the transcript
if not summary:
error_message = "Audio summary failed. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
prompt_summary = generate_image_prompt(transcript) # Generate prompt
if not prompt_summary:
error_message = "Failed to generate image prompt. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
image_url = generate_dalle_image(prompt_summary, model=request.form.get('model', 'dall-e-3'),
size=request.form.get('size', '1024x1024'),
response_format=request.form.get('format', 'url'),
quality=request.form.get('quality', 'standard')) # Generate image
if not image_url:
error_message = "Failed to generate image. Please try again."
os.remove(file_path)
return render_template("index.html", error_message=error_message)
# Optionally, delete the uploaded file after processing
os.remove(file_path)
logger.info(f"Successfully processed audio file and generated image.")
return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary, summary=summary)
except Exception as e:
error_message = f"An error occurred: {e}"
logger.error(error_message)
return render_template("index.html", error_message=error_message)
else:
error_message = "Invalid file type. Please upload a valid audio file (MP3, MP4, WAV, M4A)."
logger.warning(request)
return render_template("index.html", error_message=error_message)
return render_template("index.html", transcript=transcript, image_url=image_url, prompt=prompt_summary,
error=error_message, summary=summary)
@app.errorhandler(500)
def internal_server_error(e):
"""Handles internal server errors."""
logger.error(f"Internal Server Error: {e}")
return render_template("error.html", error="Internal Server Error"), 500
if __name__ == "__main__":
app.run(debug=True)
Code Breakdown:
- Import Statements: Imports necessary Flask modules, OpenAI library,
os
,dotenv
,logging
,Optional
andDict
for type hinting, andsecure_filename
andFileStorage
from Werkzeug. - Environment Variables: Loads the OpenAI API key from the
.env
file. - Flask Application:
- Creates a Flask application instance.
- Configures an upload folder and maximum file size. The
UPLOAD_FOLDER
is set to 'uploads', andMAX_CONTENT_LENGTH
is set to 25MB. The upload folder is created if it does not exist.
- Logging Configuration: Configures logging.
allowed_file
Function: Checks if the uploaded file has an allowed audio extension.transcribe_audio
Function:- Takes the audio file path as input.
- Opens the audio file in binary read mode (
"rb"
). - Calls the OpenAI API's
openai.Audio.transcriptions.create()
method to transcribe the audio. - Extracts the transcribed text from the API response.
- Logs the file path before transcription and the length of the transcribed text after successful transcription.
- Includes error handling for OpenAI API errors and other exceptions. The audio file is closed after transcription.
generate_image_prompt
Function:- Takes the transcribed text as input.
- Uses the OpenAI Chat Completion API (
openai.chat.completions.create()
) with thegpt-4o
model to generate a detailed text prompt suitable for image generation. - The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
- Extracts the generated prompt from the API response.
- It strips any leading/trailing spaces from the generated prompt.
- Includes error handling.
generate_image
Function:- Takes the image prompt as input.
- Calls the OpenAI API's
openai.Image.create()
method to generate an image using DALL·E 3. - Accepts optional
model
,size
,response_format
, andquality
parameters, allowing the user to configure the image generation. - Extracts the URL of the generated image from the API response.
- Includes error handling.
index
Route:- Handles both GET and POST requests.
- For GET requests, it renders the initial HTML page.
- For POST requests (when the user uploads an audio file):
- It validates the uploaded file:
- Checks if the file part exists in the request.
- Checks if a file was selected.
- Checks if the file type is allowed using the
allowed_file
function.
- It saves the uploaded file to a temporary location using a secure filename.
- It calls the utility functions to:
- Transcribe the audio using
transcribe_audio()
. - Generate an image prompt from the transcription using
create_image_prompt()
. - Generate an image from the prompt using
generate_dalle_image()
.
- Transcribe the audio using
- It summarizes the transcript using openai chat completions api.
- It handles errors that may occur during any of these steps, logging the error and rendering the
index.html
template with an appropriate error message. The temporary file is deleted before rendering the error page. - If all steps are successful, it renders the
index.html
template, passing the transcription text, image URL, and generated prompt to be displayed.
- It validates the uploaded file:
@app.errorhandler(500)
: Handles HTTP 500 errors (Internal Server Error) by logging the error and rendering a user-friendly error page.if __name__ == "__main__":
: Starts the Flask development server if the script is executed directly.
Step 6: Create the HTML Template (templates/dashboard.html)
Create a folder named templates
in the same directory as app.py
. Inside the templates
folder, create a file named dashboard.html
with the following HTML code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Creator Dashboard</title>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
<style>
/* --- General Styles --- */
body {
font-family: 'Inter', sans-serif;
padding: 40px;
background-color: #f9fafb; /* Tailwind's gray-50 */
display: flex;
justify-content: center;
align-items: center;
min-height: 100vh;
margin: 0;
color: #374151; /* Tailwind's gray-700 */
}
.container {
max-width: 800px; /* Increased max-width */
width: 95%; /* Take up most of the viewport */
background-color: #fff;
padding: 2rem;
border-radius: 0.75rem; /* Tailwind's rounded-lg */
box-shadow: 0 10px 25px -5px rgba(0, 0, 0, 0.1), 0 8px 10px -6px rgba(0, 0, 0, 0.05); /* Tailwind's shadow-xl */
text-align: center;
}
h2 {
font-size: 2.25rem; /* Tailwind's text-3xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1.5rem; /* Tailwind's mb-6 */
color: #1e293b; /* Tailwind's gray-900 */
}
p{
color: #6b7280; /* Tailwind's gray-500 */
margin-bottom: 1rem;
}
/* --- Form Styles --- */
form {
margin-top: 1rem; /* Tailwind's mt-4 */
margin-bottom: 1.5rem;
display: flex;
flex-direction: column;
align-items: center; /* Center form elements */
gap: 0.5rem; /* Tailwind's gap-2 */
}
label {
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
color: #4b5563; /* Tailwind's gray-600 */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
display: block; /* Ensure label takes full width */
text-align: left;
width: 100%;
max-width: 400px; /* Added max-width for label */
margin-left: auto;
margin-right: auto;
}
input[type="file"] {
width: 100%;
max-width: 400px; /* Added max-width for file input */
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
font-size: 1rem; /* Tailwind's text-base */
margin-bottom: 0.25rem; /* Tailwind's mb-1 */
margin-left: auto;
margin-right: auto;
}
input[type="submit"] {
padding: 0.75rem 1.5rem; /* Tailwind's px-6 py-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #4f46e5; /* Tailwind's bg-indigo-500 */
color: #fff;
font-size: 1rem; /* Tailwind's text-base */
font-weight: 600; /* Tailwind's font-semibold */
cursor: pointer;
transition: background-color 0.3s ease; /* Smooth transition */
border: none;
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.2); /* Subtle shadow */
margin-top: 1rem;
}
input[type="submit"]:hover {
background-color: #4338ca; /* Tailwind's bg-indigo-700 on hover */
}
input[type="submit"]:focus {
outline: none;
box-shadow: 0 0 0 3px rgba(79, 70, 229, 0.3); /* Tailwind's ring-indigo-500 */
}
/* --- Result Styles --- */
.result-container {
margin-top: 2rem; /* Tailwind's mt-8 */
padding: 1.5rem; /* Tailwind's p-6 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
background-color: #f8fafc; /* Tailwind's bg-gray-50 */
border: 1px solid #e2e8f0; /* Tailwind's border-gray-200 */
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05); /* Subtle shadow */
text-align: left;
}
h3 {
font-size: 1.5rem; /* Tailwind's text-2xl */
font-weight: 600; /* Tailwind's font-semibold */
margin-bottom: 1rem; /* Tailwind's mb-4 */
color: #1e293b; /* Tailwind's gray-900 */
}
textarea {
width: 100%;
padding: 0.75rem; /* Tailwind's p-3 */
border-radius: 0.5rem; /* Tailwind's rounded-md */
border: 1px solid #d1d5db; /* Tailwind's border-gray-300 */
resize: none;
font-size: 1rem; /* Tailwind's text-base */
line-height: 1.5rem; /* Tailwind's leading-relaxed */
margin-top: 0.5rem; /* Tailwind's mt-2 */
margin-bottom: 0;
box-shadow: inset 0 2px 4px rgba(0,0,0,0.06); /* Inner shadow */
min-height: 100px;
}
textarea:focus {
outline: none;
border-color: #3b82f6; /* Tailwind's border-blue-500 */
box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15); /* Tailwind's ring-blue-500 */
}
img {
max-width: 100%;
border-radius: 0.5rem; /* Tailwind's rounded-md */
margin-top: 1.5rem; /* Tailwind's mt-6 */
box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); /* Tailwind's shadow-md */
}
/* --- Error Styles --- */
.error-message {
color: #dc2626; /* Tailwind's text-red-600 */
margin-top: 1rem; /* Tailwind's mt-4 */
padding: 0.75rem;
background-color: #fee2e2; /* Tailwind's bg-red-100 */
border-radius: 0.375rem; /* Tailwind's rounded-md */
border: 1px solid #fecaca; /* Tailwind's border-red-300 */
text-align: center;
}
.prompt-select {
margin-top: 1rem; /* Tailwind's mt-4 */
display: flex;
flex-direction: column;
align-items: center;
gap: 0.5rem;
width: 100%;
}
.prompt-select label {
font-size: 1rem;
font-weight: 600;
color: #4b5563;
margin-bottom: 0.25rem;
display: block; /* Ensure label takes full width */
text-align: left;
width: 100%;
max-width: 400px; /* Added max-width for label */
margin-left: auto;
margin-right: auto;
}
.prompt-select select {
width: 100%;
max-width: 400px;
padding: 0.75rem;
border-radius: 0.5rem;
border: 1px solid #d1d5db;
font-size: 1rem;
margin-bottom: 0.25rem;
margin-left: auto;
margin-right: auto;
appearance: none; /* Remove default arrow */
background-image: url("data:image/svg+xml,%3Csvgxmlns='http://www.w3.org/2000/svg' viewBox='0 0 20 20' fill='none' stroke='currentColor' stroke-width='1.5' stroke-linecap='round' stroke-linejoin='round'%3E%3Cpath d='M6 9l4 4 4-4'%3E%3C/path%3E%3C/svg%3E"); /* Add custom arrow */
background-repeat: no-repeat;
background-position: right 0.75rem center;
background-size: 1rem;
padding-right: 2.5rem; /* Make space for the arrow */
}
.prompt-select select:focus {
outline: none;
border-color: #3b82f6;
box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.15);
}
</style>
</head>
<body>
<div class="container">
<h2>🎤🧠🎨 Multimodal Assistant</h2>
<p> Upload an audio file to transcribe and generate a corresponding image. </p>
<form method="POST" enctype="multipart/form-data">
<label for="audio_file">Upload your voice note:</label><br>
<input type="file" name="audio_file" accept="audio/*" required><br><br>
<div class = "prompt-select">
<label for="prompt_mode">Image Prompt Mode:</label>
<select id="prompt_mode" name="prompt_mode">
<option value="detailed">Detailed Scene Description</option>
<option value="keywords">Keywords</option>
<option value="creative">Creative Interpretation</option>
</select>
</div>
<input type="submit" value="Generate Visual Response">
</form>
{% if transcript %}
<div class="result-container">
<h3>📝 Transcript:</h3>
<textarea readonly>{{ transcript }}</textarea>
</div>
{% endif %}
{% if summary %}
<div class="result-container">
<h3>🔎 Summary:</h3>
<p>{{ summary }}</p>
</div>
{% endif %}
{% if prompt %}
<div class="result-container">
<h3>🎯 Scene Prompt:</h3>
<p>{{ prompt }}</p>
</div>
{% endif %}
{% if image_url %}
<div class="result-container">
<h3>🖼️ Generated Image:</h3>
<img src="{{ image_url }}" alt="Generated image">
</div>
{% endif %}
{% if error %}
<div class="error-message">{{ error }}</div>
{% endif %}
</div>
</body>
</html>
6.2.3 What Makes This a Dashboard?
This layout combines several key elements that transform it from a simple interface into a comprehensive dashboard:
- Multiple output zones (text, summary, prompt, image)The interface is divided into distinct sections, each dedicated to displaying different types of processed data. This organization allows users to easily track the progression from speech input to visual output.
- Simple user interaction (one-click processing)Despite the complex processing happening behind the scenes, users only need to perform one action to initiate the entire workflow. This simplicity makes the tool accessible to users of all technical levels.
- Clean, readable formattingThe interface uses consistent spacing, typography, and visual hierarchy to ensure information is easily digestible. Each section is clearly labeled and visually separated from others.
- Visual feedback to reinforce model outputThe dashboard provides immediate visual confirmation at each step of the process, helping users understand how their input is being transformed across different AI models.
- Reusable architecture, thanks to the
utils/
structureThe modular design separates core functionality into utility functions, making the code easier to maintain and adapt for different use cases.
6.2.4 Use Case Ideas
This versatile dashboard has numerous potential applications. Let's explore some key use cases in detail:
- A content creator's AI toolkit (turn thoughts into blogs + visuals)
- Record brainstorming sessions and convert them into structured blog posts
- Generate matching illustrations for key concepts
- Create social media content bundles with matching visuals
- A teacher's assistant (record voice ➝ summarize ➝ illustrate)
- Transform lesson plans into visual learning materials
- Create engaging educational content with matching illustrations
- Generate visual aids for complex concepts
- A journaling tool (log voice entries ➝ summarize + visualize)
- Convert daily voice memos into organized written entries
- Create mood boards based on journal content
- Track emotional patterns through visual representations
Summary
In this section, you elevated your multimodal assistant into a professional-grade dashboard. Here's what you accomplished:
- Break down your logic into reusable utilities
- Created modular, maintainable code structure
- Implemented clean separation of concerns
- Accept audio input and process it across models
- Seamless integration of multiple AI technologies
- Efficient processing pipeline
- Present everything clearly in a cohesive UI
- User-friendly interface design
- Intuitive information hierarchy
- Move from "demo" to tool
- Production-ready implementation
- Scalable architecture
This dashboard represents a professional-grade interface that delivers real value to users. With its robust architecture and intuitive design, it's ready to be transformed into a full-fledged product with minimal additional development.