Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconOpenAI API Bible Volume 2
OpenAI API Bible Volume 2

Chapter 6: Cross-Model AI Suites

6.3 Automating Summaries, Transcriptions, and Images

Now it's time to go one level deeper — to automation. This section will demonstrate how to create a sophisticated background processing workflow that revolutionizes how we handle audio content. Instead of requiring manual intervention, this system operates autonomously by:

  1. Continuously monitoring designated folders for new audio file uploads
  2. Automatically initiating the transcription process using advanced speech recognition
  3. Creating intelligent summaries that capture the key points and context
  4. Generating contextually relevant images based on the content
  5. Organizing and storing all outputs in a structured format

This automated pipeline eliminates the need for manual processing, saving considerable time and effort while maintaining consistency in output quality. The system can handle multiple files simultaneously and works 24/7, making it ideal for organizations that process large volumes of audio content.

6.3.1 What You'll Build

In this section, you'll create a sophisticated Python automation script that transforms how audio files are processed. This automation pipeline watches a specified directory and springs into action whenever new audio content appears. Let's break down exactly what this powerful system accomplishes:

  • Transcribe the audio using Whisper - The script leverages OpenAI's advanced speech recognition model to convert spoken words into accurate text, handling multiple accents and languages with impressive precision.
  • Summarize the transcription using GPT-4o - After transcription, the system employs GPT-4o's advanced language understanding to distill the key points and main ideas into a concise, coherent summary.
  • Generate an image prompt from the transcription using GPT-4o - The script then analyzes the content and context of the transcription to craft detailed, vivid prompts that capture the essence of the audio in visual terms.
  • Generate an image from the prompt using DALL·E 3 - Using these carefully crafted prompts, DALL·E 3 creates high-quality, contextually relevant images that complement the audio content.
  • Save the transcription, summary, prompt, and generated image to separate files - The system automatically organizes all outputs in a structured way, making it easy to access and use the generated content.

This comprehensive automation solution transforms raw audio into a rich collection of digital assets. The versatility of this system allows it to be integrated into various production environments:

  • A background task manager (e.g., Celery) for asynchronous processing - Perfect for handling multiple files simultaneously without blocking other operations, ensuring smooth scalability.
  • A cloud function (e.g., AWS Lambda or Google Cloud Functions) for serverless execution - Enables cost-effective, on-demand processing without maintaining constant server infrastructure.
  • A scheduled local script for batch processing - Ideal for regular, automated processing of accumulated audio files at specified intervals.

For this chapter, we'll focus on building a local prototype, which serves as an excellent starting point for creators, researchers, or developers who need a reliable and efficient way to process audio files. This approach allows for easy testing and iteration before scaling to larger deployments.

6.3.2 Step-by-Step Implementation

Step 1: Project Setup

Download the sample audio: https://files.cuantum.tech/audio/automating-summaries.mp3

Create a new directory for your project and navigate into it:

mkdir audio_processing_automation
cd audio_processing_automation

It's recommended to set up a virtual environment:

python -m venv venv
source venv/bin/activate  # On macOS/Linux
venv\\Scripts\\activate  # On Windows

Install the required Python packages:

pip install openai python-dotenv requests

Organize your project files as follows:

/audio_processing_automation

├── main.py
├── .env
└── utils/
    ├── __init__.py
    ├── transcribe.py
    ├── summarize.py
    ├── generate_prompt.py
    └── generate_image.py
  • /audio_processing_automation: The root directory for your project.
  • main.py: The Python script that will automate the audio processing.
  • .env: A file to store your OpenAI API key.
  • utils/: A directory for Python modules containing reusable functions.
    • __init__.py: Makes the utils directory a Python package.
    • transcribe.py: Contains the function to transcribe audio using Whisper.
    • summarize.py: Contains the function to summarize the transcription using a Large Language Model.
    • generate_prompt.py: Contains the function to generate an image prompt from the summary using a Large Language Model.
    • generate_image.py: Contains the function to generate an image with DALL·E 3.

Step 2: Create the Utility Modules

Create the following Python files in the utils/ directory:

utils/transcribe.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        audio_file.close()
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None
  • This module defines the transcribe_audio function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription.
  • The function opens the audio file in binary read mode ("rb").
  • It calls openai.audio.transcriptions.create() to perform the transcription, specifying the "whisper-1" model.
  • It extracts the transcribed text from the API response.
  • It includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors. If an error occurs, it logs the error and returns None.
  • It logs the file path before transcription and the length of the transcribed text after successful transcription.
  • The audio file is closed after transcription.

utils/summarize.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def summarize_transcript(text: str) -> Optional[str]:
    """
    Summarizes a text transcript using OpenAI's Chat Completion API.

    Args:
        text (str): The text transcript to summarize.

    Returns:
        Optional[str]: The summarized text, or None on error.
    """
    try:
        logger.info("Summarizing transcript")
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system",
                 "content": "You are a helpful assistant.  Provide a concise summary of the text, suitable for generating a visual representation."},
                {"role": "user", "content": text}
            ],
        )
        summary = response.choices[0].message.content
        logger.info(f"Summary: {summary}")
        return summary
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating summary: {e}")
        return None
  • This module defines the summarize_transcript function, which takes a text transcript as input and uses OpenAI's Chat Completion API to generate a concise summary.
  • The system message instructs the model to act as a helpful assistant and to provide a concise summary of the text, suitable for generating a visual representation.
  • The user message provides the transcript as the content for the model to summarize.
  • The function extracts the summary from the API response.
  • It includes error handling.

utils/generate_prompt.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def create_image_prompt(transcription: str) -> Optional[str]:
    """
    Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.

    Args:
        transcription (str): The text transcription of the audio.

    Returns:
        Optional[str]: A detailed text prompt suitable for image generation, or None on error.
    """
    try:
        logger.info("Generating image prompt from transcription")
        response = openai.chat.completions.create(
            model="gpt-4o",  # Use a powerful chat model
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.  Incorporate scene lighting, time of day, weather, and camera angle into the description.  Limit the description to 200 words.",
                },
                {"role": "user", "content": transcription},
            ],
        )
        prompt = response.choices[0].message.content
        prompt = prompt.strip()  # Remove leading/trailing spaces
        logger.info(f"Generated prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None
  • This module defines the create_image_prompt function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation.
  • The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
  • The user message provides the transcribed text as the content for the model to work with.
  • The function extracts the generated prompt from the API response.
  • It strips any leading/trailing spaces from the generated prompt.
  • It includes error handling.

utils/generate_image.py:

import openai
import logging
from typing import Optional, Dict

logger = logging.getLogger(__name__)


def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024",
                       response_format: str = "url", quality: str = "standard") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".
        quality (str, optional): The quality of the image. Defaults to "standard".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}, quality: {quality}")
        response = openai.images.generate(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
            quality=quality
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None
  • This module defines the generate_dalle_image function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image.
  • It calls the openai.images.generate() method to generate the image.
  • It accepts optional modelsizeresponse_format, and quality parameters, allowing the user to configure the image generation.
  • It extracts the URL of the generated image from the API response.
  • It includes error handling.

Step 4: Create the Main Script (main.py)

Create a Python file named main.py in the root directory of your project and add the following code:

import os
import time
import logging
from typing import Optional

# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image
from utils.summarize import summarize_transcript

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

UPLOAD_DIR = "uploads"  # Directory where audio files are placed
OUTPUT_DIR = "outputs"  # Directory where results are saved
ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}

def process_audio(file_path: str) -> None:
    """
    Processes an audio file, transcribing, summarizing, generating a prompt, and generating an image.

    Args:
        file_path (str): The path to the audio file to process.
    """
    logger.info(f"Processing audio file: {file_path}")
    base_name = os.path.splitext(os.path.basename(file_path))[0]  # Get filename without extension

    # Create output directory for this file
    output_path = os.path.join(OUTPUT_DIR, base_name)
    os.makedirs(output_path, exist_ok=True)

    try:
        # Transcription
        transcript = transcribe_audio(file_path)
        if transcript:
            transcript_file_path = os.path.join(output_path, f"{base_name}_transcript.txt")
            with open(transcript_file_path, "w") as f:
                f.write(transcript)
            logger.info(f"Transcription saved to {transcript_file_path}")
        else:
            logger.error(f"Transcription failed for {file_path}")
            return  # Stop processing if transcription fails

        # Summary
        summary = summarize_transcript(transcript)
        if summary:
            summary_file_path = os.path.join(output_path, f"{base_name}_summary.txt")
            with open(summary_file_path, "w") as f:
                f.write(summary)
            logger.info(f"Summary saved to {summary_file_path}")
        else:
            logger.error(f"Summary failed for {file_path}")
            return

        # Image Prompt
        prompt = create_image_prompt(transcript)
        if prompt:
            prompt_file_path = os.path.join(output_path, f"{base_name}_prompt.txt")
            with open(prompt_file_path, "w") as f:
                f.write(prompt)
            logger.info(f"Prompt saved to {prompt_file_path}")
        else:
            logger.error(f"Prompt generation failed for {file_path}")
            return

        # Image Generation
        image_url = generate_dalle_image(prompt)
        if image_url:
            try:
                import requests
                img_data = requests.get(image_url).content
                image_file_path = os.path.join(output_path, f"{base_name}_image.png")
                with open(image_file_path, "wb") as f:
                    f.write(img_data)
                logger.info(f"Image saved to {image_file_path}")
            except requests.exceptions.RequestException as e:
                logger.error(f"Error downloading image: {e}")
                return
        else:
            logger.error(f"Image generation failed for {file_path}")
            return

        logger.info(f"Successfully processed audio file: {file_path}")

    except Exception as e:
        logger.error(f"An error occurred while processing {file_path}: {e}")


def watch_folder(upload_dir: str, output_dir: str) -> None:
    """
    Monitors a folder for new audio files and processes them.

    Args:
        upload_dir (str): The path to the directory to watch for new audio files.
        output_dir (str): The path to the directory where results should be saved.
    """
    logger.info(f"Watching folder: {upload_dir} for new audio files...")
    processed_files = set()

    while True:
        try:
            files = [
                os.path.join(upload_dir, f)
                for f in os.listdir(upload_dir)
                if f.lower().endswith(tuple(ALLOWED_EXTENSIONS))
            ]
            for file_path in files:
                if file_path not in processed_files and os.path.isfile(file_path):
                    process_audio(file_path)
                    processed_files.add(file_path)
        except Exception as e:
            logger.error(f"Error while watching folder: {e}")
        time.sleep(5)  # Check for new files every 5 seconds



if __name__ == "__main__":
    os.makedirs(UPLOAD_DIR, exist_ok=True)
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    watch_folder(UPLOAD_DIR, OUTPUT_DIR)

Code Breakdown:

  • Import Statements: Imports the necessary libraries:
    • os: For interacting with the operating system (file paths, directories).
    • time: For adding delays (pausing between checks).
    • logging: For logging events.
    • typing: For type hinting.
    • werkzeug.utils: For secure_filename.
    • werkzeug.datastructures: For FileStorage.
  • Constants:
    • UPLOAD_DIR: The directory where audio files are expected to be uploaded.
    • OUTPUT_DIR: The directory where the processed results (transcripts, prompts, images) will be saved.
    • ALLOWED_EXTENSIONS: A set of allowed audio file extensions
  • allowed_file Function:
    • Checks if a file has an allowed extension.
  • transcribe_audio Function:
    • Takes the audio file path as input.
    • Opens the audio file in binary read mode ("rb").
    • Calls the OpenAI API's openai.Audio.transcriptions.create() method to transcribe the audio.
    • Extracts the transcribed text from the API response.
    • Logs the file path before transcription and the length of the transcribed text after successful transcription.
    • Includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors.
  • generate_image_prompt Function:
    • Takes the transcribed text as input.
    • Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a detailed text prompt suitable for image generation.
    • The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model.
    • Extracts the generated prompt from the API response.
    • Includes error handling.
  • generate_image Function:
    • Takes the image prompt as input.
    • Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
    • Accepts optional parameters for modelsizeresponse_format, and quality.
    • Extracts the URL of the generated image from the API response.
    • Includes error handling.
  • process_audio Function:
    • Takes the path to an audio file as input.
    • Performs the following steps:
      1. Logs the start of the processing.
      2. Extracts the base filename (without extension) from the audio file path.
      3. Creates a directory within the OUTPUT_DIR using the base filename to store the results for this specific audio file.
      4. Calls transcribe_audio() to transcribe the audio. If transcription fails, it logs an error and returns.
      5. Saves the transcription to a text file in the output directory.
      6. Calls summarize_transcript() to summarize the transcription. If summarization fails, it logs an error and returns.
      7. Saves the summary to a text file in the output directory.
      8. Calls create_image_prompt() to generate an image prompt. If prompt generation fails, it logs an error and returns.
      9. Saves the prompt to a text file in the output directory.
      10. Calls generate_dalle_image() to generate an image from the prompt. If image generation fails, it logs an error and returns.
      11. Downloads the image from the URL and saves it as a PNG file in the output directory.
      12. Logs the successful completion of the processing.
    • Includes a broad try...except block to catch any exceptions during the processing of a single audio file. Any errors during the process are logged.
  • watch_folder Function:
    • Takes the paths to the upload and output directories as input.
    • Logs the start of the folder watching process.
    • Initializes an empty set processed_files to keep track of processed files.
    • Enters an infinite loop:
      1. Lists all files in the upload_dir that have allowed audio file extensions.
      2. Iterates through the files:
        • If a file has not been processed yet:
          • Calls process_audio() to process the file.
          • Adds the file path to the processed_files set.
      3. Pauses for 5 seconds using time.sleep(5) before checking for new files again.
    • Includes a try...except block to catch any exceptions that might occur during the folder watching process. Any errors are logged.
  • Main Execution Block:
    • if __name__ == "__main__":: Ensures that the following code is executed only when the script is run directly (not when imported as a module).
    • Creates the UPLOAD_DIR and OUTPUT_DIR directories if they don't exist.
    • Calls the watch_folder() function to start monitoring the upload directory.

6.3.3 Tips for Scaling This Workflow

  • Add email/SMS/Slack notifications when a new item is readyImplement automated notifications to keep users informed when their content has been processed. This can be done using services like SendGrid for email, Twilio for SMS, or Slack's webhook API for instant messaging. This ensures users don't need to constantly check for completed items.
  • Store results in a database or cloud bucketInstead of storing files locally, utilize cloud storage solutions like AWS S3, Google Cloud Storage, or Azure Blob Storage. This provides better scalability, backup protection, and easier access management. For structured data, consider using databases like PostgreSQL or MongoDB to enable efficient querying and management of processed results.
  • Use threading or async libraries (like watchdogaiofiles) for better performanceReplace the basic file watching loop with more sophisticated solutions. The watchdog library provides reliable file system events monitoring, while aiofiles enables asynchronous file operations. This prevents blocking operations and improves overall system responsiveness.
  • Connect to Google Drive or Dropbox APIs to watch for uploads remotelyExpand beyond local file monitoring by integrating with cloud storage APIs. This allows users to trigger processing by simply uploading files to their preferred cloud storage service. Implement webhook listeners or use API polling to detect new uploads.
  • Add a task queue (e.g., Celery or RQ) for concurrent processingReplace synchronous processing with a distributed task queue system. Celery or Redis Queue (RQ) can manage multiple workers processing files simultaneously, handle retries on failures, and provide task prioritization. This is essential for handling high volumes of uploads and preventing system overload.

In this section, you transformed your collection of AI tools into a powerful automated creative engine. This system represents a significant advancement in multimodal AI processing - rather than manually feeding content through different models, you've created an intelligent pipeline that automatically handles the entire workflow. The script you built actively monitors for new files, seamlessly processes them through three sophisticated OpenAI models (Whisper, GPT, and DALL·E), and organizes the results into a structured collection of summaries, creative prompts, and generated images.

This type of automation revolutionizes content processing by eliminating manual steps and creating a continuous, hands-free workflow. The applications for this system are diverse and powerful, particularly beneficial for:

  • Content teams: Streamlining the creation of multimedia content by automatically generating complementary assets from audio recordings
  • Solo creators: Enabling individual content creators to multiply their creative output without additional manual effort
  • Researchers: Automating the transcription and visualization of interviews, field recordings, and research notes
  • Voice diary apps: Converting spoken journals into rich multimedia experiences with generated imagery
  • Educational tools: Transforming lectures and educational content into engaging visual and textual materials
  • AI-based documentation systems: Automatically generating comprehensive documentation with visual aids from voice recordings

Perhaps most remarkably, you've achieved this sophisticated automation with just a few Python files. This represents a democratization of technology that previously would have required a team of specialized engineers and complex infrastructure to implement. By leveraging modern AI APIs and smart programming practices, you've created an enterprise-grade creative automation system that can be maintained and modified by a single developer.

6.3 Automating Summaries, Transcriptions, and Images

Now it's time to go one level deeper — to automation. This section will demonstrate how to create a sophisticated background processing workflow that revolutionizes how we handle audio content. Instead of requiring manual intervention, this system operates autonomously by:

  1. Continuously monitoring designated folders for new audio file uploads
  2. Automatically initiating the transcription process using advanced speech recognition
  3. Creating intelligent summaries that capture the key points and context
  4. Generating contextually relevant images based on the content
  5. Organizing and storing all outputs in a structured format

This automated pipeline eliminates the need for manual processing, saving considerable time and effort while maintaining consistency in output quality. The system can handle multiple files simultaneously and works 24/7, making it ideal for organizations that process large volumes of audio content.

6.3.1 What You'll Build

In this section, you'll create a sophisticated Python automation script that transforms how audio files are processed. This automation pipeline watches a specified directory and springs into action whenever new audio content appears. Let's break down exactly what this powerful system accomplishes:

  • Transcribe the audio using Whisper - The script leverages OpenAI's advanced speech recognition model to convert spoken words into accurate text, handling multiple accents and languages with impressive precision.
  • Summarize the transcription using GPT-4o - After transcription, the system employs GPT-4o's advanced language understanding to distill the key points and main ideas into a concise, coherent summary.
  • Generate an image prompt from the transcription using GPT-4o - The script then analyzes the content and context of the transcription to craft detailed, vivid prompts that capture the essence of the audio in visual terms.
  • Generate an image from the prompt using DALL·E 3 - Using these carefully crafted prompts, DALL·E 3 creates high-quality, contextually relevant images that complement the audio content.
  • Save the transcription, summary, prompt, and generated image to separate files - The system automatically organizes all outputs in a structured way, making it easy to access and use the generated content.

This comprehensive automation solution transforms raw audio into a rich collection of digital assets. The versatility of this system allows it to be integrated into various production environments:

  • A background task manager (e.g., Celery) for asynchronous processing - Perfect for handling multiple files simultaneously without blocking other operations, ensuring smooth scalability.
  • A cloud function (e.g., AWS Lambda or Google Cloud Functions) for serverless execution - Enables cost-effective, on-demand processing without maintaining constant server infrastructure.
  • A scheduled local script for batch processing - Ideal for regular, automated processing of accumulated audio files at specified intervals.

For this chapter, we'll focus on building a local prototype, which serves as an excellent starting point for creators, researchers, or developers who need a reliable and efficient way to process audio files. This approach allows for easy testing and iteration before scaling to larger deployments.

6.3.2 Step-by-Step Implementation

Step 1: Project Setup

Download the sample audio: https://files.cuantum.tech/audio/automating-summaries.mp3

Create a new directory for your project and navigate into it:

mkdir audio_processing_automation
cd audio_processing_automation

It's recommended to set up a virtual environment:

python -m venv venv
source venv/bin/activate  # On macOS/Linux
venv\\Scripts\\activate  # On Windows

Install the required Python packages:

pip install openai python-dotenv requests

Organize your project files as follows:

/audio_processing_automation

├── main.py
├── .env
└── utils/
    ├── __init__.py
    ├── transcribe.py
    ├── summarize.py
    ├── generate_prompt.py
    └── generate_image.py
  • /audio_processing_automation: The root directory for your project.
  • main.py: The Python script that will automate the audio processing.
  • .env: A file to store your OpenAI API key.
  • utils/: A directory for Python modules containing reusable functions.
    • __init__.py: Makes the utils directory a Python package.
    • transcribe.py: Contains the function to transcribe audio using Whisper.
    • summarize.py: Contains the function to summarize the transcription using a Large Language Model.
    • generate_prompt.py: Contains the function to generate an image prompt from the summary using a Large Language Model.
    • generate_image.py: Contains the function to generate an image with DALL·E 3.

Step 2: Create the Utility Modules

Create the following Python files in the utils/ directory:

utils/transcribe.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        audio_file.close()
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None
  • This module defines the transcribe_audio function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription.
  • The function opens the audio file in binary read mode ("rb").
  • It calls openai.audio.transcriptions.create() to perform the transcription, specifying the "whisper-1" model.
  • It extracts the transcribed text from the API response.
  • It includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors. If an error occurs, it logs the error and returns None.
  • It logs the file path before transcription and the length of the transcribed text after successful transcription.
  • The audio file is closed after transcription.

utils/summarize.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def summarize_transcript(text: str) -> Optional[str]:
    """
    Summarizes a text transcript using OpenAI's Chat Completion API.

    Args:
        text (str): The text transcript to summarize.

    Returns:
        Optional[str]: The summarized text, or None on error.
    """
    try:
        logger.info("Summarizing transcript")
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system",
                 "content": "You are a helpful assistant.  Provide a concise summary of the text, suitable for generating a visual representation."},
                {"role": "user", "content": text}
            ],
        )
        summary = response.choices[0].message.content
        logger.info(f"Summary: {summary}")
        return summary
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating summary: {e}")
        return None
  • This module defines the summarize_transcript function, which takes a text transcript as input and uses OpenAI's Chat Completion API to generate a concise summary.
  • The system message instructs the model to act as a helpful assistant and to provide a concise summary of the text, suitable for generating a visual representation.
  • The user message provides the transcript as the content for the model to summarize.
  • The function extracts the summary from the API response.
  • It includes error handling.

utils/generate_prompt.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def create_image_prompt(transcription: str) -> Optional[str]:
    """
    Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.

    Args:
        transcription (str): The text transcription of the audio.

    Returns:
        Optional[str]: A detailed text prompt suitable for image generation, or None on error.
    """
    try:
        logger.info("Generating image prompt from transcription")
        response = openai.chat.completions.create(
            model="gpt-4o",  # Use a powerful chat model
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.  Incorporate scene lighting, time of day, weather, and camera angle into the description.  Limit the description to 200 words.",
                },
                {"role": "user", "content": transcription},
            ],
        )
        prompt = response.choices[0].message.content
        prompt = prompt.strip()  # Remove leading/trailing spaces
        logger.info(f"Generated prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None
  • This module defines the create_image_prompt function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation.
  • The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
  • The user message provides the transcribed text as the content for the model to work with.
  • The function extracts the generated prompt from the API response.
  • It strips any leading/trailing spaces from the generated prompt.
  • It includes error handling.

utils/generate_image.py:

import openai
import logging
from typing import Optional, Dict

logger = logging.getLogger(__name__)


def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024",
                       response_format: str = "url", quality: str = "standard") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".
        quality (str, optional): The quality of the image. Defaults to "standard".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}, quality: {quality}")
        response = openai.images.generate(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
            quality=quality
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None
  • This module defines the generate_dalle_image function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image.
  • It calls the openai.images.generate() method to generate the image.
  • It accepts optional modelsizeresponse_format, and quality parameters, allowing the user to configure the image generation.
  • It extracts the URL of the generated image from the API response.
  • It includes error handling.

Step 4: Create the Main Script (main.py)

Create a Python file named main.py in the root directory of your project and add the following code:

import os
import time
import logging
from typing import Optional

# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image
from utils.summarize import summarize_transcript

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

UPLOAD_DIR = "uploads"  # Directory where audio files are placed
OUTPUT_DIR = "outputs"  # Directory where results are saved
ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}

def process_audio(file_path: str) -> None:
    """
    Processes an audio file, transcribing, summarizing, generating a prompt, and generating an image.

    Args:
        file_path (str): The path to the audio file to process.
    """
    logger.info(f"Processing audio file: {file_path}")
    base_name = os.path.splitext(os.path.basename(file_path))[0]  # Get filename without extension

    # Create output directory for this file
    output_path = os.path.join(OUTPUT_DIR, base_name)
    os.makedirs(output_path, exist_ok=True)

    try:
        # Transcription
        transcript = transcribe_audio(file_path)
        if transcript:
            transcript_file_path = os.path.join(output_path, f"{base_name}_transcript.txt")
            with open(transcript_file_path, "w") as f:
                f.write(transcript)
            logger.info(f"Transcription saved to {transcript_file_path}")
        else:
            logger.error(f"Transcription failed for {file_path}")
            return  # Stop processing if transcription fails

        # Summary
        summary = summarize_transcript(transcript)
        if summary:
            summary_file_path = os.path.join(output_path, f"{base_name}_summary.txt")
            with open(summary_file_path, "w") as f:
                f.write(summary)
            logger.info(f"Summary saved to {summary_file_path}")
        else:
            logger.error(f"Summary failed for {file_path}")
            return

        # Image Prompt
        prompt = create_image_prompt(transcript)
        if prompt:
            prompt_file_path = os.path.join(output_path, f"{base_name}_prompt.txt")
            with open(prompt_file_path, "w") as f:
                f.write(prompt)
            logger.info(f"Prompt saved to {prompt_file_path}")
        else:
            logger.error(f"Prompt generation failed for {file_path}")
            return

        # Image Generation
        image_url = generate_dalle_image(prompt)
        if image_url:
            try:
                import requests
                img_data = requests.get(image_url).content
                image_file_path = os.path.join(output_path, f"{base_name}_image.png")
                with open(image_file_path, "wb") as f:
                    f.write(img_data)
                logger.info(f"Image saved to {image_file_path}")
            except requests.exceptions.RequestException as e:
                logger.error(f"Error downloading image: {e}")
                return
        else:
            logger.error(f"Image generation failed for {file_path}")
            return

        logger.info(f"Successfully processed audio file: {file_path}")

    except Exception as e:
        logger.error(f"An error occurred while processing {file_path}: {e}")


def watch_folder(upload_dir: str, output_dir: str) -> None:
    """
    Monitors a folder for new audio files and processes them.

    Args:
        upload_dir (str): The path to the directory to watch for new audio files.
        output_dir (str): The path to the directory where results should be saved.
    """
    logger.info(f"Watching folder: {upload_dir} for new audio files...")
    processed_files = set()

    while True:
        try:
            files = [
                os.path.join(upload_dir, f)
                for f in os.listdir(upload_dir)
                if f.lower().endswith(tuple(ALLOWED_EXTENSIONS))
            ]
            for file_path in files:
                if file_path not in processed_files and os.path.isfile(file_path):
                    process_audio(file_path)
                    processed_files.add(file_path)
        except Exception as e:
            logger.error(f"Error while watching folder: {e}")
        time.sleep(5)  # Check for new files every 5 seconds



if __name__ == "__main__":
    os.makedirs(UPLOAD_DIR, exist_ok=True)
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    watch_folder(UPLOAD_DIR, OUTPUT_DIR)

Code Breakdown:

  • Import Statements: Imports the necessary libraries:
    • os: For interacting with the operating system (file paths, directories).
    • time: For adding delays (pausing between checks).
    • logging: For logging events.
    • typing: For type hinting.
    • werkzeug.utils: For secure_filename.
    • werkzeug.datastructures: For FileStorage.
  • Constants:
    • UPLOAD_DIR: The directory where audio files are expected to be uploaded.
    • OUTPUT_DIR: The directory where the processed results (transcripts, prompts, images) will be saved.
    • ALLOWED_EXTENSIONS: A set of allowed audio file extensions
  • allowed_file Function:
    • Checks if a file has an allowed extension.
  • transcribe_audio Function:
    • Takes the audio file path as input.
    • Opens the audio file in binary read mode ("rb").
    • Calls the OpenAI API's openai.Audio.transcriptions.create() method to transcribe the audio.
    • Extracts the transcribed text from the API response.
    • Logs the file path before transcription and the length of the transcribed text after successful transcription.
    • Includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors.
  • generate_image_prompt Function:
    • Takes the transcribed text as input.
    • Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a detailed text prompt suitable for image generation.
    • The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model.
    • Extracts the generated prompt from the API response.
    • Includes error handling.
  • generate_image Function:
    • Takes the image prompt as input.
    • Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
    • Accepts optional parameters for modelsizeresponse_format, and quality.
    • Extracts the URL of the generated image from the API response.
    • Includes error handling.
  • process_audio Function:
    • Takes the path to an audio file as input.
    • Performs the following steps:
      1. Logs the start of the processing.
      2. Extracts the base filename (without extension) from the audio file path.
      3. Creates a directory within the OUTPUT_DIR using the base filename to store the results for this specific audio file.
      4. Calls transcribe_audio() to transcribe the audio. If transcription fails, it logs an error and returns.
      5. Saves the transcription to a text file in the output directory.
      6. Calls summarize_transcript() to summarize the transcription. If summarization fails, it logs an error and returns.
      7. Saves the summary to a text file in the output directory.
      8. Calls create_image_prompt() to generate an image prompt. If prompt generation fails, it logs an error and returns.
      9. Saves the prompt to a text file in the output directory.
      10. Calls generate_dalle_image() to generate an image from the prompt. If image generation fails, it logs an error and returns.
      11. Downloads the image from the URL and saves it as a PNG file in the output directory.
      12. Logs the successful completion of the processing.
    • Includes a broad try...except block to catch any exceptions during the processing of a single audio file. Any errors during the process are logged.
  • watch_folder Function:
    • Takes the paths to the upload and output directories as input.
    • Logs the start of the folder watching process.
    • Initializes an empty set processed_files to keep track of processed files.
    • Enters an infinite loop:
      1. Lists all files in the upload_dir that have allowed audio file extensions.
      2. Iterates through the files:
        • If a file has not been processed yet:
          • Calls process_audio() to process the file.
          • Adds the file path to the processed_files set.
      3. Pauses for 5 seconds using time.sleep(5) before checking for new files again.
    • Includes a try...except block to catch any exceptions that might occur during the folder watching process. Any errors are logged.
  • Main Execution Block:
    • if __name__ == "__main__":: Ensures that the following code is executed only when the script is run directly (not when imported as a module).
    • Creates the UPLOAD_DIR and OUTPUT_DIR directories if they don't exist.
    • Calls the watch_folder() function to start monitoring the upload directory.

6.3.3 Tips for Scaling This Workflow

  • Add email/SMS/Slack notifications when a new item is readyImplement automated notifications to keep users informed when their content has been processed. This can be done using services like SendGrid for email, Twilio for SMS, or Slack's webhook API for instant messaging. This ensures users don't need to constantly check for completed items.
  • Store results in a database or cloud bucketInstead of storing files locally, utilize cloud storage solutions like AWS S3, Google Cloud Storage, or Azure Blob Storage. This provides better scalability, backup protection, and easier access management. For structured data, consider using databases like PostgreSQL or MongoDB to enable efficient querying and management of processed results.
  • Use threading or async libraries (like watchdogaiofiles) for better performanceReplace the basic file watching loop with more sophisticated solutions. The watchdog library provides reliable file system events monitoring, while aiofiles enables asynchronous file operations. This prevents blocking operations and improves overall system responsiveness.
  • Connect to Google Drive or Dropbox APIs to watch for uploads remotelyExpand beyond local file monitoring by integrating with cloud storage APIs. This allows users to trigger processing by simply uploading files to their preferred cloud storage service. Implement webhook listeners or use API polling to detect new uploads.
  • Add a task queue (e.g., Celery or RQ) for concurrent processingReplace synchronous processing with a distributed task queue system. Celery or Redis Queue (RQ) can manage multiple workers processing files simultaneously, handle retries on failures, and provide task prioritization. This is essential for handling high volumes of uploads and preventing system overload.

In this section, you transformed your collection of AI tools into a powerful automated creative engine. This system represents a significant advancement in multimodal AI processing - rather than manually feeding content through different models, you've created an intelligent pipeline that automatically handles the entire workflow. The script you built actively monitors for new files, seamlessly processes them through three sophisticated OpenAI models (Whisper, GPT, and DALL·E), and organizes the results into a structured collection of summaries, creative prompts, and generated images.

This type of automation revolutionizes content processing by eliminating manual steps and creating a continuous, hands-free workflow. The applications for this system are diverse and powerful, particularly beneficial for:

  • Content teams: Streamlining the creation of multimedia content by automatically generating complementary assets from audio recordings
  • Solo creators: Enabling individual content creators to multiply their creative output without additional manual effort
  • Researchers: Automating the transcription and visualization of interviews, field recordings, and research notes
  • Voice diary apps: Converting spoken journals into rich multimedia experiences with generated imagery
  • Educational tools: Transforming lectures and educational content into engaging visual and textual materials
  • AI-based documentation systems: Automatically generating comprehensive documentation with visual aids from voice recordings

Perhaps most remarkably, you've achieved this sophisticated automation with just a few Python files. This represents a democratization of technology that previously would have required a team of specialized engineers and complex infrastructure to implement. By leveraging modern AI APIs and smart programming practices, you've created an enterprise-grade creative automation system that can be maintained and modified by a single developer.

6.3 Automating Summaries, Transcriptions, and Images

Now it's time to go one level deeper — to automation. This section will demonstrate how to create a sophisticated background processing workflow that revolutionizes how we handle audio content. Instead of requiring manual intervention, this system operates autonomously by:

  1. Continuously monitoring designated folders for new audio file uploads
  2. Automatically initiating the transcription process using advanced speech recognition
  3. Creating intelligent summaries that capture the key points and context
  4. Generating contextually relevant images based on the content
  5. Organizing and storing all outputs in a structured format

This automated pipeline eliminates the need for manual processing, saving considerable time and effort while maintaining consistency in output quality. The system can handle multiple files simultaneously and works 24/7, making it ideal for organizations that process large volumes of audio content.

6.3.1 What You'll Build

In this section, you'll create a sophisticated Python automation script that transforms how audio files are processed. This automation pipeline watches a specified directory and springs into action whenever new audio content appears. Let's break down exactly what this powerful system accomplishes:

  • Transcribe the audio using Whisper - The script leverages OpenAI's advanced speech recognition model to convert spoken words into accurate text, handling multiple accents and languages with impressive precision.
  • Summarize the transcription using GPT-4o - After transcription, the system employs GPT-4o's advanced language understanding to distill the key points and main ideas into a concise, coherent summary.
  • Generate an image prompt from the transcription using GPT-4o - The script then analyzes the content and context of the transcription to craft detailed, vivid prompts that capture the essence of the audio in visual terms.
  • Generate an image from the prompt using DALL·E 3 - Using these carefully crafted prompts, DALL·E 3 creates high-quality, contextually relevant images that complement the audio content.
  • Save the transcription, summary, prompt, and generated image to separate files - The system automatically organizes all outputs in a structured way, making it easy to access and use the generated content.

This comprehensive automation solution transforms raw audio into a rich collection of digital assets. The versatility of this system allows it to be integrated into various production environments:

  • A background task manager (e.g., Celery) for asynchronous processing - Perfect for handling multiple files simultaneously without blocking other operations, ensuring smooth scalability.
  • A cloud function (e.g., AWS Lambda or Google Cloud Functions) for serverless execution - Enables cost-effective, on-demand processing without maintaining constant server infrastructure.
  • A scheduled local script for batch processing - Ideal for regular, automated processing of accumulated audio files at specified intervals.

For this chapter, we'll focus on building a local prototype, which serves as an excellent starting point for creators, researchers, or developers who need a reliable and efficient way to process audio files. This approach allows for easy testing and iteration before scaling to larger deployments.

6.3.2 Step-by-Step Implementation

Step 1: Project Setup

Download the sample audio: https://files.cuantum.tech/audio/automating-summaries.mp3

Create a new directory for your project and navigate into it:

mkdir audio_processing_automation
cd audio_processing_automation

It's recommended to set up a virtual environment:

python -m venv venv
source venv/bin/activate  # On macOS/Linux
venv\\Scripts\\activate  # On Windows

Install the required Python packages:

pip install openai python-dotenv requests

Organize your project files as follows:

/audio_processing_automation

├── main.py
├── .env
└── utils/
    ├── __init__.py
    ├── transcribe.py
    ├── summarize.py
    ├── generate_prompt.py
    └── generate_image.py
  • /audio_processing_automation: The root directory for your project.
  • main.py: The Python script that will automate the audio processing.
  • .env: A file to store your OpenAI API key.
  • utils/: A directory for Python modules containing reusable functions.
    • __init__.py: Makes the utils directory a Python package.
    • transcribe.py: Contains the function to transcribe audio using Whisper.
    • summarize.py: Contains the function to summarize the transcription using a Large Language Model.
    • generate_prompt.py: Contains the function to generate an image prompt from the summary using a Large Language Model.
    • generate_image.py: Contains the function to generate an image with DALL·E 3.

Step 2: Create the Utility Modules

Create the following Python files in the utils/ directory:

utils/transcribe.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        audio_file.close()
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None
  • This module defines the transcribe_audio function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription.
  • The function opens the audio file in binary read mode ("rb").
  • It calls openai.audio.transcriptions.create() to perform the transcription, specifying the "whisper-1" model.
  • It extracts the transcribed text from the API response.
  • It includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors. If an error occurs, it logs the error and returns None.
  • It logs the file path before transcription and the length of the transcribed text after successful transcription.
  • The audio file is closed after transcription.

utils/summarize.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def summarize_transcript(text: str) -> Optional[str]:
    """
    Summarizes a text transcript using OpenAI's Chat Completion API.

    Args:
        text (str): The text transcript to summarize.

    Returns:
        Optional[str]: The summarized text, or None on error.
    """
    try:
        logger.info("Summarizing transcript")
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system",
                 "content": "You are a helpful assistant.  Provide a concise summary of the text, suitable for generating a visual representation."},
                {"role": "user", "content": text}
            ],
        )
        summary = response.choices[0].message.content
        logger.info(f"Summary: {summary}")
        return summary
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating summary: {e}")
        return None
  • This module defines the summarize_transcript function, which takes a text transcript as input and uses OpenAI's Chat Completion API to generate a concise summary.
  • The system message instructs the model to act as a helpful assistant and to provide a concise summary of the text, suitable for generating a visual representation.
  • The user message provides the transcript as the content for the model to summarize.
  • The function extracts the summary from the API response.
  • It includes error handling.

utils/generate_prompt.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def create_image_prompt(transcription: str) -> Optional[str]:
    """
    Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.

    Args:
        transcription (str): The text transcription of the audio.

    Returns:
        Optional[str]: A detailed text prompt suitable for image generation, or None on error.
    """
    try:
        logger.info("Generating image prompt from transcription")
        response = openai.chat.completions.create(
            model="gpt-4o",  # Use a powerful chat model
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.  Incorporate scene lighting, time of day, weather, and camera angle into the description.  Limit the description to 200 words.",
                },
                {"role": "user", "content": transcription},
            ],
        )
        prompt = response.choices[0].message.content
        prompt = prompt.strip()  # Remove leading/trailing spaces
        logger.info(f"Generated prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None
  • This module defines the create_image_prompt function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation.
  • The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
  • The user message provides the transcribed text as the content for the model to work with.
  • The function extracts the generated prompt from the API response.
  • It strips any leading/trailing spaces from the generated prompt.
  • It includes error handling.

utils/generate_image.py:

import openai
import logging
from typing import Optional, Dict

logger = logging.getLogger(__name__)


def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024",
                       response_format: str = "url", quality: str = "standard") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".
        quality (str, optional): The quality of the image. Defaults to "standard".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}, quality: {quality}")
        response = openai.images.generate(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
            quality=quality
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None
  • This module defines the generate_dalle_image function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image.
  • It calls the openai.images.generate() method to generate the image.
  • It accepts optional modelsizeresponse_format, and quality parameters, allowing the user to configure the image generation.
  • It extracts the URL of the generated image from the API response.
  • It includes error handling.

Step 4: Create the Main Script (main.py)

Create a Python file named main.py in the root directory of your project and add the following code:

import os
import time
import logging
from typing import Optional

# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image
from utils.summarize import summarize_transcript

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

UPLOAD_DIR = "uploads"  # Directory where audio files are placed
OUTPUT_DIR = "outputs"  # Directory where results are saved
ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}

def process_audio(file_path: str) -> None:
    """
    Processes an audio file, transcribing, summarizing, generating a prompt, and generating an image.

    Args:
        file_path (str): The path to the audio file to process.
    """
    logger.info(f"Processing audio file: {file_path}")
    base_name = os.path.splitext(os.path.basename(file_path))[0]  # Get filename without extension

    # Create output directory for this file
    output_path = os.path.join(OUTPUT_DIR, base_name)
    os.makedirs(output_path, exist_ok=True)

    try:
        # Transcription
        transcript = transcribe_audio(file_path)
        if transcript:
            transcript_file_path = os.path.join(output_path, f"{base_name}_transcript.txt")
            with open(transcript_file_path, "w") as f:
                f.write(transcript)
            logger.info(f"Transcription saved to {transcript_file_path}")
        else:
            logger.error(f"Transcription failed for {file_path}")
            return  # Stop processing if transcription fails

        # Summary
        summary = summarize_transcript(transcript)
        if summary:
            summary_file_path = os.path.join(output_path, f"{base_name}_summary.txt")
            with open(summary_file_path, "w") as f:
                f.write(summary)
            logger.info(f"Summary saved to {summary_file_path}")
        else:
            logger.error(f"Summary failed for {file_path}")
            return

        # Image Prompt
        prompt = create_image_prompt(transcript)
        if prompt:
            prompt_file_path = os.path.join(output_path, f"{base_name}_prompt.txt")
            with open(prompt_file_path, "w") as f:
                f.write(prompt)
            logger.info(f"Prompt saved to {prompt_file_path}")
        else:
            logger.error(f"Prompt generation failed for {file_path}")
            return

        # Image Generation
        image_url = generate_dalle_image(prompt)
        if image_url:
            try:
                import requests
                img_data = requests.get(image_url).content
                image_file_path = os.path.join(output_path, f"{base_name}_image.png")
                with open(image_file_path, "wb") as f:
                    f.write(img_data)
                logger.info(f"Image saved to {image_file_path}")
            except requests.exceptions.RequestException as e:
                logger.error(f"Error downloading image: {e}")
                return
        else:
            logger.error(f"Image generation failed for {file_path}")
            return

        logger.info(f"Successfully processed audio file: {file_path}")

    except Exception as e:
        logger.error(f"An error occurred while processing {file_path}: {e}")


def watch_folder(upload_dir: str, output_dir: str) -> None:
    """
    Monitors a folder for new audio files and processes them.

    Args:
        upload_dir (str): The path to the directory to watch for new audio files.
        output_dir (str): The path to the directory where results should be saved.
    """
    logger.info(f"Watching folder: {upload_dir} for new audio files...")
    processed_files = set()

    while True:
        try:
            files = [
                os.path.join(upload_dir, f)
                for f in os.listdir(upload_dir)
                if f.lower().endswith(tuple(ALLOWED_EXTENSIONS))
            ]
            for file_path in files:
                if file_path not in processed_files and os.path.isfile(file_path):
                    process_audio(file_path)
                    processed_files.add(file_path)
        except Exception as e:
            logger.error(f"Error while watching folder: {e}")
        time.sleep(5)  # Check for new files every 5 seconds



if __name__ == "__main__":
    os.makedirs(UPLOAD_DIR, exist_ok=True)
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    watch_folder(UPLOAD_DIR, OUTPUT_DIR)

Code Breakdown:

  • Import Statements: Imports the necessary libraries:
    • os: For interacting with the operating system (file paths, directories).
    • time: For adding delays (pausing between checks).
    • logging: For logging events.
    • typing: For type hinting.
    • werkzeug.utils: For secure_filename.
    • werkzeug.datastructures: For FileStorage.
  • Constants:
    • UPLOAD_DIR: The directory where audio files are expected to be uploaded.
    • OUTPUT_DIR: The directory where the processed results (transcripts, prompts, images) will be saved.
    • ALLOWED_EXTENSIONS: A set of allowed audio file extensions
  • allowed_file Function:
    • Checks if a file has an allowed extension.
  • transcribe_audio Function:
    • Takes the audio file path as input.
    • Opens the audio file in binary read mode ("rb").
    • Calls the OpenAI API's openai.Audio.transcriptions.create() method to transcribe the audio.
    • Extracts the transcribed text from the API response.
    • Logs the file path before transcription and the length of the transcribed text after successful transcription.
    • Includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors.
  • generate_image_prompt Function:
    • Takes the transcribed text as input.
    • Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a detailed text prompt suitable for image generation.
    • The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model.
    • Extracts the generated prompt from the API response.
    • Includes error handling.
  • generate_image Function:
    • Takes the image prompt as input.
    • Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
    • Accepts optional parameters for modelsizeresponse_format, and quality.
    • Extracts the URL of the generated image from the API response.
    • Includes error handling.
  • process_audio Function:
    • Takes the path to an audio file as input.
    • Performs the following steps:
      1. Logs the start of the processing.
      2. Extracts the base filename (without extension) from the audio file path.
      3. Creates a directory within the OUTPUT_DIR using the base filename to store the results for this specific audio file.
      4. Calls transcribe_audio() to transcribe the audio. If transcription fails, it logs an error and returns.
      5. Saves the transcription to a text file in the output directory.
      6. Calls summarize_transcript() to summarize the transcription. If summarization fails, it logs an error and returns.
      7. Saves the summary to a text file in the output directory.
      8. Calls create_image_prompt() to generate an image prompt. If prompt generation fails, it logs an error and returns.
      9. Saves the prompt to a text file in the output directory.
      10. Calls generate_dalle_image() to generate an image from the prompt. If image generation fails, it logs an error and returns.
      11. Downloads the image from the URL and saves it as a PNG file in the output directory.
      12. Logs the successful completion of the processing.
    • Includes a broad try...except block to catch any exceptions during the processing of a single audio file. Any errors during the process are logged.
  • watch_folder Function:
    • Takes the paths to the upload and output directories as input.
    • Logs the start of the folder watching process.
    • Initializes an empty set processed_files to keep track of processed files.
    • Enters an infinite loop:
      1. Lists all files in the upload_dir that have allowed audio file extensions.
      2. Iterates through the files:
        • If a file has not been processed yet:
          • Calls process_audio() to process the file.
          • Adds the file path to the processed_files set.
      3. Pauses for 5 seconds using time.sleep(5) before checking for new files again.
    • Includes a try...except block to catch any exceptions that might occur during the folder watching process. Any errors are logged.
  • Main Execution Block:
    • if __name__ == "__main__":: Ensures that the following code is executed only when the script is run directly (not when imported as a module).
    • Creates the UPLOAD_DIR and OUTPUT_DIR directories if they don't exist.
    • Calls the watch_folder() function to start monitoring the upload directory.

6.3.3 Tips for Scaling This Workflow

  • Add email/SMS/Slack notifications when a new item is readyImplement automated notifications to keep users informed when their content has been processed. This can be done using services like SendGrid for email, Twilio for SMS, or Slack's webhook API for instant messaging. This ensures users don't need to constantly check for completed items.
  • Store results in a database or cloud bucketInstead of storing files locally, utilize cloud storage solutions like AWS S3, Google Cloud Storage, or Azure Blob Storage. This provides better scalability, backup protection, and easier access management. For structured data, consider using databases like PostgreSQL or MongoDB to enable efficient querying and management of processed results.
  • Use threading or async libraries (like watchdogaiofiles) for better performanceReplace the basic file watching loop with more sophisticated solutions. The watchdog library provides reliable file system events monitoring, while aiofiles enables asynchronous file operations. This prevents blocking operations and improves overall system responsiveness.
  • Connect to Google Drive or Dropbox APIs to watch for uploads remotelyExpand beyond local file monitoring by integrating with cloud storage APIs. This allows users to trigger processing by simply uploading files to their preferred cloud storage service. Implement webhook listeners or use API polling to detect new uploads.
  • Add a task queue (e.g., Celery or RQ) for concurrent processingReplace synchronous processing with a distributed task queue system. Celery or Redis Queue (RQ) can manage multiple workers processing files simultaneously, handle retries on failures, and provide task prioritization. This is essential for handling high volumes of uploads and preventing system overload.

In this section, you transformed your collection of AI tools into a powerful automated creative engine. This system represents a significant advancement in multimodal AI processing - rather than manually feeding content through different models, you've created an intelligent pipeline that automatically handles the entire workflow. The script you built actively monitors for new files, seamlessly processes them through three sophisticated OpenAI models (Whisper, GPT, and DALL·E), and organizes the results into a structured collection of summaries, creative prompts, and generated images.

This type of automation revolutionizes content processing by eliminating manual steps and creating a continuous, hands-free workflow. The applications for this system are diverse and powerful, particularly beneficial for:

  • Content teams: Streamlining the creation of multimedia content by automatically generating complementary assets from audio recordings
  • Solo creators: Enabling individual content creators to multiply their creative output without additional manual effort
  • Researchers: Automating the transcription and visualization of interviews, field recordings, and research notes
  • Voice diary apps: Converting spoken journals into rich multimedia experiences with generated imagery
  • Educational tools: Transforming lectures and educational content into engaging visual and textual materials
  • AI-based documentation systems: Automatically generating comprehensive documentation with visual aids from voice recordings

Perhaps most remarkably, you've achieved this sophisticated automation with just a few Python files. This represents a democratization of technology that previously would have required a team of specialized engineers and complex infrastructure to implement. By leveraging modern AI APIs and smart programming practices, you've created an enterprise-grade creative automation system that can be maintained and modified by a single developer.

6.3 Automating Summaries, Transcriptions, and Images

Now it's time to go one level deeper — to automation. This section will demonstrate how to create a sophisticated background processing workflow that revolutionizes how we handle audio content. Instead of requiring manual intervention, this system operates autonomously by:

  1. Continuously monitoring designated folders for new audio file uploads
  2. Automatically initiating the transcription process using advanced speech recognition
  3. Creating intelligent summaries that capture the key points and context
  4. Generating contextually relevant images based on the content
  5. Organizing and storing all outputs in a structured format

This automated pipeline eliminates the need for manual processing, saving considerable time and effort while maintaining consistency in output quality. The system can handle multiple files simultaneously and works 24/7, making it ideal for organizations that process large volumes of audio content.

6.3.1 What You'll Build

In this section, you'll create a sophisticated Python automation script that transforms how audio files are processed. This automation pipeline watches a specified directory and springs into action whenever new audio content appears. Let's break down exactly what this powerful system accomplishes:

  • Transcribe the audio using Whisper - The script leverages OpenAI's advanced speech recognition model to convert spoken words into accurate text, handling multiple accents and languages with impressive precision.
  • Summarize the transcription using GPT-4o - After transcription, the system employs GPT-4o's advanced language understanding to distill the key points and main ideas into a concise, coherent summary.
  • Generate an image prompt from the transcription using GPT-4o - The script then analyzes the content and context of the transcription to craft detailed, vivid prompts that capture the essence of the audio in visual terms.
  • Generate an image from the prompt using DALL·E 3 - Using these carefully crafted prompts, DALL·E 3 creates high-quality, contextually relevant images that complement the audio content.
  • Save the transcription, summary, prompt, and generated image to separate files - The system automatically organizes all outputs in a structured way, making it easy to access and use the generated content.

This comprehensive automation solution transforms raw audio into a rich collection of digital assets. The versatility of this system allows it to be integrated into various production environments:

  • A background task manager (e.g., Celery) for asynchronous processing - Perfect for handling multiple files simultaneously without blocking other operations, ensuring smooth scalability.
  • A cloud function (e.g., AWS Lambda or Google Cloud Functions) for serverless execution - Enables cost-effective, on-demand processing without maintaining constant server infrastructure.
  • A scheduled local script for batch processing - Ideal for regular, automated processing of accumulated audio files at specified intervals.

For this chapter, we'll focus on building a local prototype, which serves as an excellent starting point for creators, researchers, or developers who need a reliable and efficient way to process audio files. This approach allows for easy testing and iteration before scaling to larger deployments.

6.3.2 Step-by-Step Implementation

Step 1: Project Setup

Download the sample audio: https://files.cuantum.tech/audio/automating-summaries.mp3

Create a new directory for your project and navigate into it:

mkdir audio_processing_automation
cd audio_processing_automation

It's recommended to set up a virtual environment:

python -m venv venv
source venv/bin/activate  # On macOS/Linux
venv\\Scripts\\activate  # On Windows

Install the required Python packages:

pip install openai python-dotenv requests

Organize your project files as follows:

/audio_processing_automation

├── main.py
├── .env
└── utils/
    ├── __init__.py
    ├── transcribe.py
    ├── summarize.py
    ├── generate_prompt.py
    └── generate_image.py
  • /audio_processing_automation: The root directory for your project.
  • main.py: The Python script that will automate the audio processing.
  • .env: A file to store your OpenAI API key.
  • utils/: A directory for Python modules containing reusable functions.
    • __init__.py: Makes the utils directory a Python package.
    • transcribe.py: Contains the function to transcribe audio using Whisper.
    • summarize.py: Contains the function to summarize the transcription using a Large Language Model.
    • generate_prompt.py: Contains the function to generate an image prompt from the summary using a Large Language Model.
    • generate_image.py: Contains the function to generate an image with DALL·E 3.

Step 2: Create the Utility Modules

Create the following Python files in the utils/ directory:

utils/transcribe.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def transcribe_audio(file_path: str) -> Optional[str]:
    """
    Transcribes an audio file using OpenAI's Whisper API.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        Optional[str]: The transcribed text, or None on error.
    """
    try:
        logger.info(f"Transcribing audio: {file_path}")
        audio_file = open(file_path, "rb")
        response = openai.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
        )
        transcript = response.text
        audio_file.close()
        return transcript
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error during transcription: {e}")
        return None
  • This module defines the transcribe_audio function, which takes the path to an audio file as input and uses OpenAI's Whisper API to generate a text transcription.
  • The function opens the audio file in binary read mode ("rb").
  • It calls openai.audio.transcriptions.create() to perform the transcription, specifying the "whisper-1" model.
  • It extracts the transcribed text from the API response.
  • It includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors. If an error occurs, it logs the error and returns None.
  • It logs the file path before transcription and the length of the transcribed text after successful transcription.
  • The audio file is closed after transcription.

utils/summarize.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def summarize_transcript(text: str) -> Optional[str]:
    """
    Summarizes a text transcript using OpenAI's Chat Completion API.

    Args:
        text (str): The text transcript to summarize.

    Returns:
        Optional[str]: The summarized text, or None on error.
    """
    try:
        logger.info("Summarizing transcript")
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system",
                 "content": "You are a helpful assistant.  Provide a concise summary of the text, suitable for generating a visual representation."},
                {"role": "user", "content": text}
            ],
        )
        summary = response.choices[0].message.content
        logger.info(f"Summary: {summary}")
        return summary
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating summary: {e}")
        return None
  • This module defines the summarize_transcript function, which takes a text transcript as input and uses OpenAI's Chat Completion API to generate a concise summary.
  • The system message instructs the model to act as a helpful assistant and to provide a concise summary of the text, suitable for generating a visual representation.
  • The user message provides the transcript as the content for the model to summarize.
  • The function extracts the summary from the API response.
  • It includes error handling.

utils/generate_prompt.py:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)


def create_image_prompt(transcription: str) -> Optional[str]:
    """
    Generates a detailed image prompt from a text transcription using OpenAI's Chat Completion API.

    Args:
        transcription (str): The text transcription of the audio.

    Returns:
        Optional[str]: A detailed text prompt suitable for image generation, or None on error.
    """
    try:
        logger.info("Generating image prompt from transcription")
        response = openai.chat.completions.create(
            model="gpt-4o",  # Use a powerful chat model
            messages=[
                {
                    "role": "system",
                    "content": "You are a creative assistant. Your task is to create a vivid and detailed text description of a scene that could be used to generate an image with an AI image generation model. Focus on capturing the essence and key visual elements of the audio content.  Do not include any phrases like 'based on the audio' or 'from the user audio'.  Incorporate scene lighting, time of day, weather, and camera angle into the description.  Limit the description to 200 words.",
                },
                {"role": "user", "content": transcription},
            ],
        )
        prompt = response.choices[0].message.content
        prompt = prompt.strip()  # Remove leading/trailing spaces
        logger.info(f"Generated prompt: {prompt}")
        return prompt
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image prompt: {e}")
        return None
  • This module defines the create_image_prompt function, which takes the transcribed text as input and uses OpenAI's Chat Completion API to generate a detailed text prompt for image generation.
  • The system message instructs the model to act as a creative assistant and to generate a vivid scene description. The system prompt is crucial in guiding the LLM to generate a high-quality prompt. We instruct the LLM to focus on visual elements and incorporate details like lighting, time of day, weather, and camera angle. We also limit the description length to 200 words.
  • The user message provides the transcribed text as the content for the model to work with.
  • The function extracts the generated prompt from the API response.
  • It strips any leading/trailing spaces from the generated prompt.
  • It includes error handling.

utils/generate_image.py:

import openai
import logging
from typing import Optional, Dict

logger = logging.getLogger(__name__)


def generate_dalle_image(prompt: str, model: str = "dall-e-3", size: str = "1024x1024",
                       response_format: str = "url", quality: str = "standard") -> Optional[str]:
    """
    Generates an image using OpenAI's DALL·E API.

    Args:
        prompt (str): The text prompt to generate the image from.
        model (str, optional): The DALL·E model to use. Defaults to "dall-e-3".
        size (str, optional): The size of the generated image. Defaults to "1024x1024".
        response_format (str, optional): The format of the response. Defaults to "url".
        quality (str, optional): The quality of the image. Defaults to "standard".

    Returns:
        Optional[str]: The URL of the generated image, or None on error.
    """
    try:
        logger.info(f"Generating image with prompt: {prompt}, model: {model}, size: {size}, format: {response_format}, quality: {quality}")
        response = openai.images.generate(
            prompt=prompt,
            model=model,
            size=size,
            response_format=response_format,
            quality=quality
        )
        image_url = response.data[0].url
        logger.info(f"Image URL: {image_url}")
        return image_url
    except openai.error.OpenAIError as e:
        logger.error(f"OpenAI API Error: {e}")
        return None
    except Exception as e:
        logger.error(f"Error generating image: {e}")
        return None
  • This module defines the generate_dalle_image function, which takes a text prompt as input and uses OpenAI's DALL·E API to generate an image.
  • It calls the openai.images.generate() method to generate the image.
  • It accepts optional modelsizeresponse_format, and quality parameters, allowing the user to configure the image generation.
  • It extracts the URL of the generated image from the API response.
  • It includes error handling.

Step 4: Create the Main Script (main.py)

Create a Python file named main.py in the root directory of your project and add the following code:

import os
import time
import logging
from typing import Optional

# Import the utility functions from the utils directory
from utils.transcribe import transcribe_audio
from utils.generate_prompt import create_image_prompt
from utils.generate_image import generate_dalle_image
from utils.summarize import summarize_transcript

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

UPLOAD_DIR = "uploads"  # Directory where audio files are placed
OUTPUT_DIR = "outputs"  # Directory where results are saved
ALLOWED_EXTENSIONS = {'mp3', 'mp4', 'wav', 'm4a'}

def process_audio(file_path: str) -> None:
    """
    Processes an audio file, transcribing, summarizing, generating a prompt, and generating an image.

    Args:
        file_path (str): The path to the audio file to process.
    """
    logger.info(f"Processing audio file: {file_path}")
    base_name = os.path.splitext(os.path.basename(file_path))[0]  # Get filename without extension

    # Create output directory for this file
    output_path = os.path.join(OUTPUT_DIR, base_name)
    os.makedirs(output_path, exist_ok=True)

    try:
        # Transcription
        transcript = transcribe_audio(file_path)
        if transcript:
            transcript_file_path = os.path.join(output_path, f"{base_name}_transcript.txt")
            with open(transcript_file_path, "w") as f:
                f.write(transcript)
            logger.info(f"Transcription saved to {transcript_file_path}")
        else:
            logger.error(f"Transcription failed for {file_path}")
            return  # Stop processing if transcription fails

        # Summary
        summary = summarize_transcript(transcript)
        if summary:
            summary_file_path = os.path.join(output_path, f"{base_name}_summary.txt")
            with open(summary_file_path, "w") as f:
                f.write(summary)
            logger.info(f"Summary saved to {summary_file_path}")
        else:
            logger.error(f"Summary failed for {file_path}")
            return

        # Image Prompt
        prompt = create_image_prompt(transcript)
        if prompt:
            prompt_file_path = os.path.join(output_path, f"{base_name}_prompt.txt")
            with open(prompt_file_path, "w") as f:
                f.write(prompt)
            logger.info(f"Prompt saved to {prompt_file_path}")
        else:
            logger.error(f"Prompt generation failed for {file_path}")
            return

        # Image Generation
        image_url = generate_dalle_image(prompt)
        if image_url:
            try:
                import requests
                img_data = requests.get(image_url).content
                image_file_path = os.path.join(output_path, f"{base_name}_image.png")
                with open(image_file_path, "wb") as f:
                    f.write(img_data)
                logger.info(f"Image saved to {image_file_path}")
            except requests.exceptions.RequestException as e:
                logger.error(f"Error downloading image: {e}")
                return
        else:
            logger.error(f"Image generation failed for {file_path}")
            return

        logger.info(f"Successfully processed audio file: {file_path}")

    except Exception as e:
        logger.error(f"An error occurred while processing {file_path}: {e}")


def watch_folder(upload_dir: str, output_dir: str) -> None:
    """
    Monitors a folder for new audio files and processes them.

    Args:
        upload_dir (str): The path to the directory to watch for new audio files.
        output_dir (str): The path to the directory where results should be saved.
    """
    logger.info(f"Watching folder: {upload_dir} for new audio files...")
    processed_files = set()

    while True:
        try:
            files = [
                os.path.join(upload_dir, f)
                for f in os.listdir(upload_dir)
                if f.lower().endswith(tuple(ALLOWED_EXTENSIONS))
            ]
            for file_path in files:
                if file_path not in processed_files and os.path.isfile(file_path):
                    process_audio(file_path)
                    processed_files.add(file_path)
        except Exception as e:
            logger.error(f"Error while watching folder: {e}")
        time.sleep(5)  # Check for new files every 5 seconds



if __name__ == "__main__":
    os.makedirs(UPLOAD_DIR, exist_ok=True)
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    watch_folder(UPLOAD_DIR, OUTPUT_DIR)

Code Breakdown:

  • Import Statements: Imports the necessary libraries:
    • os: For interacting with the operating system (file paths, directories).
    • time: For adding delays (pausing between checks).
    • logging: For logging events.
    • typing: For type hinting.
    • werkzeug.utils: For secure_filename.
    • werkzeug.datastructures: For FileStorage.
  • Constants:
    • UPLOAD_DIR: The directory where audio files are expected to be uploaded.
    • OUTPUT_DIR: The directory where the processed results (transcripts, prompts, images) will be saved.
    • ALLOWED_EXTENSIONS: A set of allowed audio file extensions
  • allowed_file Function:
    • Checks if a file has an allowed extension.
  • transcribe_audio Function:
    • Takes the audio file path as input.
    • Opens the audio file in binary read mode ("rb").
    • Calls the OpenAI API's openai.Audio.transcriptions.create() method to transcribe the audio.
    • Extracts the transcribed text from the API response.
    • Logs the file path before transcription and the length of the transcribed text after successful transcription.
    • Includes error handling using a try...except block to catch potential openai.error.OpenAIError exceptions (specific to OpenAI) and general Exception for other errors.
  • generate_image_prompt Function:
    • Takes the transcribed text as input.
    • Uses the OpenAI Chat Completion API (openai.chat.completions.create()) with the gpt-4o model to generate a detailed text prompt suitable for image generation.
    • The system message instructs the model to act as a creative assistant and provide a vivid and detailed description of a scene that could be used to generate an image with an AI image generation model.
    • Extracts the generated prompt from the API response.
    • Includes error handling.
  • generate_image Function:
    • Takes the image prompt as input.
    • Calls the OpenAI API's openai.Image.create() method to generate an image using DALL·E 3.
    • Accepts optional parameters for modelsizeresponse_format, and quality.
    • Extracts the URL of the generated image from the API response.
    • Includes error handling.
  • process_audio Function:
    • Takes the path to an audio file as input.
    • Performs the following steps:
      1. Logs the start of the processing.
      2. Extracts the base filename (without extension) from the audio file path.
      3. Creates a directory within the OUTPUT_DIR using the base filename to store the results for this specific audio file.
      4. Calls transcribe_audio() to transcribe the audio. If transcription fails, it logs an error and returns.
      5. Saves the transcription to a text file in the output directory.
      6. Calls summarize_transcript() to summarize the transcription. If summarization fails, it logs an error and returns.
      7. Saves the summary to a text file in the output directory.
      8. Calls create_image_prompt() to generate an image prompt. If prompt generation fails, it logs an error and returns.
      9. Saves the prompt to a text file in the output directory.
      10. Calls generate_dalle_image() to generate an image from the prompt. If image generation fails, it logs an error and returns.
      11. Downloads the image from the URL and saves it as a PNG file in the output directory.
      12. Logs the successful completion of the processing.
    • Includes a broad try...except block to catch any exceptions during the processing of a single audio file. Any errors during the process are logged.
  • watch_folder Function:
    • Takes the paths to the upload and output directories as input.
    • Logs the start of the folder watching process.
    • Initializes an empty set processed_files to keep track of processed files.
    • Enters an infinite loop:
      1. Lists all files in the upload_dir that have allowed audio file extensions.
      2. Iterates through the files:
        • If a file has not been processed yet:
          • Calls process_audio() to process the file.
          • Adds the file path to the processed_files set.
      3. Pauses for 5 seconds using time.sleep(5) before checking for new files again.
    • Includes a try...except block to catch any exceptions that might occur during the folder watching process. Any errors are logged.
  • Main Execution Block:
    • if __name__ == "__main__":: Ensures that the following code is executed only when the script is run directly (not when imported as a module).
    • Creates the UPLOAD_DIR and OUTPUT_DIR directories if they don't exist.
    • Calls the watch_folder() function to start monitoring the upload directory.

6.3.3 Tips for Scaling This Workflow

  • Add email/SMS/Slack notifications when a new item is readyImplement automated notifications to keep users informed when their content has been processed. This can be done using services like SendGrid for email, Twilio for SMS, or Slack's webhook API for instant messaging. This ensures users don't need to constantly check for completed items.
  • Store results in a database or cloud bucketInstead of storing files locally, utilize cloud storage solutions like AWS S3, Google Cloud Storage, or Azure Blob Storage. This provides better scalability, backup protection, and easier access management. For structured data, consider using databases like PostgreSQL or MongoDB to enable efficient querying and management of processed results.
  • Use threading or async libraries (like watchdogaiofiles) for better performanceReplace the basic file watching loop with more sophisticated solutions. The watchdog library provides reliable file system events monitoring, while aiofiles enables asynchronous file operations. This prevents blocking operations and improves overall system responsiveness.
  • Connect to Google Drive or Dropbox APIs to watch for uploads remotelyExpand beyond local file monitoring by integrating with cloud storage APIs. This allows users to trigger processing by simply uploading files to their preferred cloud storage service. Implement webhook listeners or use API polling to detect new uploads.
  • Add a task queue (e.g., Celery or RQ) for concurrent processingReplace synchronous processing with a distributed task queue system. Celery or Redis Queue (RQ) can manage multiple workers processing files simultaneously, handle retries on failures, and provide task prioritization. This is essential for handling high volumes of uploads and preventing system overload.

In this section, you transformed your collection of AI tools into a powerful automated creative engine. This system represents a significant advancement in multimodal AI processing - rather than manually feeding content through different models, you've created an intelligent pipeline that automatically handles the entire workflow. The script you built actively monitors for new files, seamlessly processes them through three sophisticated OpenAI models (Whisper, GPT, and DALL·E), and organizes the results into a structured collection of summaries, creative prompts, and generated images.

This type of automation revolutionizes content processing by eliminating manual steps and creating a continuous, hands-free workflow. The applications for this system are diverse and powerful, particularly beneficial for:

  • Content teams: Streamlining the creation of multimedia content by automatically generating complementary assets from audio recordings
  • Solo creators: Enabling individual content creators to multiply their creative output without additional manual effort
  • Researchers: Automating the transcription and visualization of interviews, field recordings, and research notes
  • Voice diary apps: Converting spoken journals into rich multimedia experiences with generated imagery
  • Educational tools: Transforming lectures and educational content into engaging visual and textual materials
  • AI-based documentation systems: Automatically generating comprehensive documentation with visual aids from voice recordings

Perhaps most remarkably, you've achieved this sophisticated automation with just a few Python files. This represents a democratization of technology that previously would have required a team of specialized engineers and complex infrastructure to implement. By leveraging modern AI APIs and smart programming practices, you've created an enterprise-grade creative automation system that can be maintained and modified by a single developer.