Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconUnder the Hood of Large Language Models
Under the Hood of Large Language Models

Chapter 5: Beyond Text: Multimodal LLMs

5.2 Audio & Speech Integration (Whisper, SpeechLM)

Language is not only written but spoken. The ability to listen, transcribe, and respond to speech is essential if AI is to become a seamless assistant in daily life. Recent advances in speech recognition and speech-language modeling have made it possible to integrate audio directly into large-scale language systems. This integration represents a significant leap forward in AI capabilities, as it bridges the gap between written and spoken communication.

Speech is our most natural form of communication, and by enabling AI to process audio inputs, we create more intuitive interfaces that don't require users to type or read. This is particularly important for accessibility, allowing those with limited mobility, vision impairments, or literacy challenges to interact with technology. Furthermore, speech carries additional information through tone, pace, and emphasis that text alone cannot convey, providing richer context for AI systems to understand human intent.

Let's explore two key directions in speech-enabled AI:

Whisper – OpenAI's robust speech-to-text system. This open-source model represents a breakthrough in transcription technology with its ability to handle diverse accents, background noise, and technical vocabulary.

Unlike previous speech recognition systems that struggled with real-world audio conditions, Whisper demonstrates remarkable accuracy even with challenging inputs such as podcast conversations, lecture recordings, or phone calls.

SpeechLM / SpeechGPT – models that extend transformers to directly handle audio-text tasks. These advanced systems go beyond simple transcription by maintaining the connection between acoustic features and semantic meaning.

Rather than treating speech-to-text as a separate preprocessing step, they incorporate audio understanding directly into the language modeling process, enabling more nuanced responses that consider not just what was said, but how it was said.

5.2.1 Whisper: Universal Speech Recognition

Whisper is an open-source model from OpenAI designed for speech recognition, translation, and transcription across many languages. Released in September 2022, it represents a significant advancement in audio processing technology by providing robust performance across diverse acoustic environments and speaking styles. Unlike previous speech recognition systems that often struggled with accents, background noise, or specialized vocabulary, Whisper was trained on an extensive dataset of 680,000 hours of multilingual and multitask supervised data collected from the web, giving it remarkable versatility and accuracy.

The model architecture combines a transformer-based encoder that processes audio spectrograms with a decoder similar to GPT models that generates text output. This design allows Whisper to handle the complexities of human speech, including variations in pitch, tone, cadence, and pronunciation across different languages and dialects.

What makes Whisper particularly groundbreaking is its zero-shot capabilities - it can recognize and transcribe speech in languages it wasn't explicitly fine-tuned for. Additionally, Whisper can automatically detect the spoken language, translate speech directly into English, and even handle code-switching (when speakers alternate between multiple languages within a conversation). This versatility makes it valuable for applications ranging from automatic meeting transcription to cross-lingual communication tools and accessibility services for the hearing impaired.

Key features:

  • Trained on 680,000 hours of multilingual audio from the web, including a wide variety of accents, dialects, and background conditions. This massive and diverse training dataset enables Whisper to handle real-world audio that previous systems struggled with. The dataset's scale provides broad coverage across linguistic variations, regional accents, speaking styles, and acoustic environments, giving Whisper an unprecedented ability to understand speech in virtually any context. This extensive training directly translates to Whisper's ability to transcribe speech from speakers with accents or dialects traditionally underrepresented in AI training data.
  • Handles noisy, real-world audio (e.g., phone calls, lectures, podcasts, street recordings) with remarkable resilience. Unlike earlier models that performed well only in studio-quality conditions, Whisper maintains accuracy even with background noise, overlapping speakers, or varying microphone quality. This robustness stems from its exposure to diverse acoustic environments during training, allowing it to filter out irrelevant sounds and focus on the speech signal. Whether processing a recording from a busy café, a conference room with echoing acoustics, or an outdoor interview with wind interference, Whisper can extract the spoken content with surprising accuracy.
  • Supports transcription, translation, and language identification across 99 languages. This multilingual capability allows it to automatically detect the spoken language and process content from global sources without requiring manual language selection. Whisper can seamlessly transcribe content in languages ranging from widely-spoken ones like English, Spanish, and Mandarin to less common languages like Swahili, Lithuanian, and Nepali. This language versatility makes it an invaluable tool for global communication, international research, and cross-cultural content creation. Even more impressively, Whisper can identify when speakers switch between languages mid-conversation, a phenomenon known as code-switching.
  • Features zero-shot learning capabilities, meaning it can perform tasks it wasn't explicitly fine-tuned for, adapting to new scenarios without additional training. This remarkable ability allows Whisper to generalize its knowledge to unfamiliar contexts, speakers, and acoustic environments. For example, without specific fine-tuning, it can transcribe technical jargon in fields like medicine or engineering, understand regional dialects it hasn't explicitly seen before, or adapt to novel audio recording conditions. This zero-shot capability is particularly valuable in practical applications where the diversity of real-world speech would otherwise require countless specialized models for different scenarios.

At its core, Whisper combines a log-Mel spectrogram encoder with a decoder similar to GPT, allowing it to map raw audio to natural language text. The encoder transforms audio waveforms into spectrograms—visual representations of sound frequencies over time—which capture the acoustic patterns in speech. This process begins by converting the raw audio signal into a spectrogram using the Short-Time Fourier Transform (STFT), which breaks down the audio into its frequency components.

These components are then mapped to the Mel scale, which approximates how humans perceive sound frequencies, with greater sensitivity to lower frequencies than higher ones. The resulting log-Mel spectrogram provides a compact representation of the audio that emphasizes the most perceptually relevant features.

These spectrograms are then processed through a transformer encoder that extracts meaningful features. The transformer architecture, with its self-attention mechanisms, allows the model to focus on different parts of the spectrogram simultaneously, capturing both local phonetic details and broader acoustic patterns. This is crucial for handling variations in speech like different accents, speaking rates, and background noise.

The GPT-style decoder then converts these features into text, treating transcription as a sequence prediction task similar to language modeling. This decoder works autoregressively, generating each word or token based on both the encoded audio features and the previously generated text. This approach enables Whisper to maintain contextual coherence throughout the transcription, correctly interpreting ambiguous sounds based on their surrounding context, and producing natural-sounding text that accurately reflects the original speech.

Example: Transcribing Audio with Whisper

# Comprehensive implementation of Whisper for audio transcription

# Install required libraries
# pip install git+https://github.com/openai/whisper.git
# pip install librosa matplotlib numpy

import whisper
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import time
import torch
from pathlib import Path

def visualize_audio(audio_path):
    """Visualize the audio waveform and spectrogram"""
    y, sr = librosa.load(audio_path)
    
    # Create a figure with two subplots
    plt.figure(figsize=(12, 8))
    
    # Plot waveform
    plt.subplot(2, 1, 1)
    librosa.display.waveshow(y, sr=sr)
    plt.title('Waveform')
    
    # Plot spectrogram
    plt.subplot(2, 1, 2)
    D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
    librosa.display.specshow(D, sr=sr, x_axis='time', y_axis='log')
    plt.colorbar(format='%+2.0f dB')
    plt.title('Log-frequency power spectrogram')
    
    plt.tight_layout()
    plt.show()

def transcribe_audio(audio_path, model_size="base", language=None, verbose=True):
    """
    Transcribe audio using OpenAI's Whisper model
    
    Parameters:
    - audio_path: Path to the audio file
    - model_size: Size of the Whisper model to use (tiny, base, small, medium, large)
    - language: Language code (e.g., "en" for English) or None for auto-detection
    - verbose: Whether to print progress information
    
    Returns:
    - Dictionary containing transcription results
    """
    start_time = time.time()
    
    if verbose:
        print(f"Loading Whisper model: {model_size}")
    
    # Load pre-trained Whisper model
    model = whisper.load_model(model_size)
    
    model_load_time = time.time()
    if verbose:
        print(f"Model loaded in {model_load_time - start_time:.2f} seconds")
        print(f"Transcribing: {audio_path}")
    
    # Set transcription options
    options = {}
    if language:
        options["language"] = language
    
    # Transcribe the audio file
    result = model.transcribe(audio_path, **options)
    
    end_time = time.time()
    if verbose:
        print(f"Transcription completed in {end_time - model_load_time:.2f} seconds")
        print(f"Detected language: {result['language']} (confidence: {result.get('language_probability', 0):.2f})")
        print(f"Total processing time: {end_time - start_time:.2f} seconds")
    
    return result

def save_transcription(result, output_file=None):
    """Save transcription results to a text file"""
    if output_file is None:
        output_file = "transcription_output.txt"
    
    with open(output_file, "w", encoding="utf-8") as f:
        # Write the full transcription
        f.write("FULL TRANSCRIPTION:\n")
        f.write(result["text"])
        f.write("\n\n")
        
        # Write segment-by-segment with timestamps
        f.write("SEGMENTS WITH TIMESTAMPS:\n")
        for segment in result["segments"]:
            start = segment["start"]
            end = segment["end"]
            text = segment["text"]
            f.write(f"[{start:.2f}s - {end:.2f}s] {text}\n")
    
    return output_file

def batch_transcribe(directory, extension=".mp3", output_dir=None):
    """Transcribe all audio files with the given extension in a directory"""
    if output_dir is None:
        output_dir = Path("transcription_results")
    
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)
    
    directory = Path(directory)
    audio_files = list(directory.glob(f"*{extension}"))
    
    print(f"Found {len(audio_files)} {extension} files in {directory}")
    
    for audio_file in audio_files:
        print(f"\nProcessing: {audio_file.name}")
        result = transcribe_audio(str(audio_file))
        
        output_file = output_dir / f"{audio_file.stem}_transcription.txt"
        save_transcription(result, output_file)
        print(f"Saved transcription to: {output_file}")

# Example usage
if __name__ == "__main__":
    # Set the path to your audio file
    audio_path = "speech_sample.mp3"
    
    # Check if CUDA is available for GPU acceleration
    cuda_available = torch.cuda.is_available()
    print(f"CUDA available: {cuda_available}")
    
    # Visualize the audio (optional)
    # visualize_audio(audio_path)
    
    # Transcribe the audio
    result = transcribe_audio(audio_path, model_size="base")
    
    # Print the transcription
    print("\nTRANSCRIPTION:")
    print(result["text"])
    
    # Save the transcription to a file
    output_file = save_transcription(result)
    print(f"\nSaved transcription to: {output_file}")
    
    # Example of batch processing
    # batch_transcribe("audio_folder", extension=".wav")

Download the speech sample here: https://files.cuantum.tech/audio/speech_sample.mp3

Note: Save the example audio in the same location as the Python script.

Breaking Down the Whisper Implementation:

1. Setup and Dependencies

The code begins by installing the necessary libraries: Whisper (directly from GitHub), librosa (for audio processing and visualization), matplotlib (for visualization), and numpy (for numerical operations). These libraries provide the foundation for audio processing and transcription.

2. Audio Visualization Function

The visualize_audio() function uses librosa to create two important visualizations:

  • A waveform display showing amplitude over time, which represents how the audio signal varies
  • A log-frequency spectrogram showing how energy is distributed across different frequencies over time, which helps analyze speech characteristics

These visualizations can help users understand the audio characteristics before transcription.

3. Core Transcription Function

The transcribe_audio() function is the heart of the implementation:

  • It accepts parameters for audio path, model size, language, and verbosity level
  • It loads the specified Whisper model (from tiny to large, with larger models being more accurate but slower)
  • It tracks processing time to provide performance metrics
  • It supports automatic language detection or allows specifying a language code
  • It returns a comprehensive result object containing the transcription and metadata

4. Results Processing

The save_transcription() function processes the Whisper results into user-friendly formats:

  • It saves the complete transcription text
  • It also extracts and formats individual segments with their timestamps, which is crucial for aligning transcription with audio timing
  • This enables applications like subtitle generation or time-synchronized content analysis

5. Batch Processing Capability

The batch_transcribe() function extends the utility to handle multiple audio files:

  • It processes all audio files with a specified extension in a directory
  • It organizes outputs into a dedicated directory structure
  • This is valuable for transcribing podcasts, interview series, or lecture collections

6. Example Usage

The main execution block demonstrates how to use these functions in practice:

  • It checks for GPU acceleration via CUDA, which can significantly improve performance for larger models
  • It offers options for audio visualization (commented out by default)
  • It performs transcription and displays the results
  • It saves the output to a file for future reference
  • It includes a commented example of batch processing

Advanced Features:

This implementation goes beyond basic transcription by including:

  • Performance timing to measure processing efficiency
  • Language detection reporting
  • Segment-level transcription with timestamps
  • Hardware acceleration detection
  • Audio analysis capabilities
  • Batch processing for multiple files

This example implementation provides a complete workflow for audio transcription, from preprocessing through visualization, transcription, and results management, making it suitable for both individual use cases and larger-scale applications.

Example: Advanced implementation of Whisper for real-time transcription with visualization

import whisper
import numpy as np
import pyaudio
import threading
import time
import queue
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from collections import deque
import torch
import os
from datetime import datetime

class WhisperRealtimeTranscriber:
    def __init__(self, model_size="base", language="en", energy_threshold=1000, 
                 record_timeout=2, phrase_timeout=3, max_sentences=10):
        """
        Initialize the real-time transcriber with Whisper
        
        Parameters:
        - model_size: Size of Whisper model ("tiny", "base", "small", "medium", "large")
        - language: Language code or None for auto-detection
        - energy_threshold: Minimum audio energy to consider for recording
        - record_timeout: Time in seconds to recheck if audio is speech
        - phrase_timeout: Time in seconds of silence to consider a phrase complete
        - max_sentences: Maximum number of sentences to display in history
        """
        self.model_name = model_size
        self.language = language
        self.energy_threshold = energy_threshold
        self.record_timeout = record_timeout
        self.phrase_timeout = phrase_timeout
        self.max_sentences = max_sentences
        
        # Check for GPU
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Using device: {self.device}")
        
        # Load Whisper model
        print(f"Loading Whisper {model_size} model...")
        self.model = whisper.load_model(model_size).to(self.device)
        print("Model loaded!")
        
        # Initialize audio processing
        self.audio_queue = queue.Queue()
        self.result_queue = queue.Queue()
        self.audio_data = np.zeros(0, dtype=np.float32)
        
        # For visualization
        self.audio_buffer = deque(maxlen=4000)  # ~4 seconds at 16kHz
        self.waveform_data = np.zeros(4000)
        self.spectrogram_data = np.zeros((201, 80))  # Mel spectrogram shape
        self.transcript_history = []
        self.recording = False
        self.terminated = False
        
        # Audio parameters
        self.sample_rate = 16000
        self.audio_format = pyaudio.paFloat32
        self.channels = 1
        self.chunk = 1024
        
        # Setup PyAudio
        self.p = pyaudio.PyAudio()
        
    def _get_audio_input_stream(self):
        """Create and return an input audio stream"""
        stream = self.p.open(
            format=self.audio_format,
            channels=self.channels,
            rate=self.sample_rate,
            input=True,
            frames_per_buffer=self.chunk
        )
        return stream
    
    def _audio_capture_thread(self):
        """Thread function for capturing audio"""
        stream = self._get_audio_input_stream()
        last_sample = bytes()
        phrase_time = None
        
        print("Listening for audio...")
        
        try:
            while not self.terminated:
                # Get new audio chunk
                current_sample = stream.read(self.chunk, exception_on_overflow=False)
                
                # Convert to numpy array
                data = np.frombuffer(current_sample, dtype=np.float32)
                
                # Update audio buffer for visualization
                self.audio_buffer.extend(data)
                self.waveform_data = np.array(list(self.audio_buffer))
                
                # Calculate audio energy
                energy = np.sqrt(np.mean(data**2))
                
                # Detect if audio is speech
                if energy > self.energy_threshold:
                    self.recording = True
                    
                    # Reset phrase timeout
                    phrase_time = None
                    
                    # Add audio to processing queue
                    self.audio_data = np.append(self.audio_data, data)
                
                # Handle phrase timeout
                elif self.recording:
                    if phrase_time is None:
                        phrase_time = time.time()
                    
                    # If enough silence, process the audio phrase
                    if time.time() - phrase_time > self.phrase_timeout:
                        if len(self.audio_data) > 0:
                            self.audio_queue.put(self.audio_data.copy())
                            self.audio_data = np.zeros(0, dtype=np.float32)
                        
                        self.recording = False
                        phrase_time = None
                
                # Process fixed chunks of audio regardless of speech detection
                if len(self.audio_data) > self.sample_rate * self.record_timeout:
                    self.audio_queue.put(self.audio_data.copy())
                    self.audio_data = self.audio_data[int(self.sample_rate * self.record_timeout):]
                
                time.sleep(0.01)
                
        finally:
            stream.stop_stream()
            stream.close()
    
    def _transcription_thread(self):
        """Thread function for processing audio with Whisper"""
        while not self.terminated:
            try:
                # Get audio data from queue
                if self.audio_queue.empty():
                    time.sleep(0.1)
                    continue
                
                audio_data = self.audio_queue.get()
                
                # Skip processing very short audio clips
                if len(audio_data) < 0.5 * self.sample_rate:
                    continue
                
                # Process audio with Whisper
                start_time = time.time()
                
                # Convert to format expected by Whisper
                audio_tensor = torch.tensor(audio_data).to(self.device)
                
                # Generate Mel spectrogram for visualization
                mel = whisper.log_mel_spectrogram(audio_data)
                if len(mel) > 80:
                    mel = mel[:80]
                self.spectrogram_data = mel.T.numpy()  # Transpose for visualization
                
                # Transcribe with Whisper
                options = {"language": self.language} if self.language else {}
                result = self.model.transcribe(audio_data, **options)
                
                # Get transcription result
                text = result["text"].strip()
                elapsed = time.time() - start_time
                
                # Skip empty results
                if len(text) == 0:
                    continue
                
                # Add timestamp and transcription to history
                timestamp = datetime.now().strftime("%H:%M:%S")
                entry = f"[{timestamp}] {text}"
                self.transcript_history.append(entry)
                
                # Keep only most recent entries
                if len(self.transcript_history) > self.max_sentences:
                    self.transcript_history = self.transcript_history[-self.max_sentences:]
                
                # Print result
                print(f"Transcribed ({elapsed:.2f}s): {text}")
                
            except Exception as e:
                print(f"Error in transcription thread: {e}")
                
    def _update_visualization(self, frame):
        """Update function for matplotlib animation"""
        # Clear previous plots
        plt.clf()
        
        # Plot audio waveform
        plt.subplot(3, 1, 1)
        plt.plot(self.waveform_data)
        plt.title("Audio Waveform")
        plt.ylim([-0.5, 0.5])
        
        # Plot status
        if self.recording:
            plt.gca().set_facecolor((0.9, 0.9, 1))
            plt.title("Audio Waveform - RECORDING")
        
        # Plot Mel spectrogram
        plt.subplot(3, 1, 2)
        plt.imshow(self.spectrogram_data, aspect='auto', origin='lower')
        plt.title("Mel Spectrogram")
        plt.tight_layout()
        
        # Show transcript history
        plt.subplot(3, 1, 3)
        plt.axis('off')
        history_text = "\n".join(self.transcript_history)
        plt.text(0.05, 0.95, history_text, 
                 verticalalignment='top', wrap=True, fontsize=9,
                 bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.2))
        plt.title("Transcript History")
        
        # Adjust layout
        plt.subplots_adjust(hspace=0.5)
        
    def start(self, visualize=True):
        """Start the real-time transcription system"""
        # Start audio capture thread
        audio_thread = threading.Thread(target=self._audio_capture_thread)
        audio_thread.daemon = True
        audio_thread.start()
        
        # Start transcription thread
        transcription_thread = threading.Thread(target=self._transcription_thread)
        transcription_thread.daemon = True
        transcription_thread.start()
        
        try:
            if visualize:
                # Set up visualization
                plt.figure(figsize=(10, 8))
                ani = FuncAnimation(plt.gcf(), self._update_visualization, interval=100)
                plt.show()
            else:
                # Just keep the main thread alive
                while True:
                    time.sleep(1)
        except KeyboardInterrupt:
            print("Stopping...")
        finally:
            self.terminated = True
            self.p.terminate()

    def save_transcript(self, filename=None):
        """Save the transcript history to a file"""
        if filename is None:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"transcript_{timestamp}.txt"
        
        with open(filename, "w", encoding="utf-8") as f:
            for entry in self.transcript_history:
                f.write(f"{entry}\n")
        
        print(f"Transcript saved to {filename}")

# Example usage
if __name__ == "__main__":
    # Create and start the transcriber
    transcriber = WhisperRealtimeTranscriber(
        model_size="base",
        language="en",
        energy_threshold=0.01,
        record_timeout=2,
        phrase_timeout=1
    )
    
    try:
        transcriber.start(visualize=True)
    except KeyboardInterrupt:
        pass
    finally:
        transcriber.save_transcript()
Note: To use this example code, you'll need to speak into your microphone during program execution.

Breaking Down the Real-Time Whisper Implementation:

1. Overall Architecture

This advanced implementation creates a real-time speech transcription system using Whisper. Unlike the previous example that processes existing files, this version:

  • Captures live audio input from a microphone
  • Processes audio in chunks as it arrives
  • Provides real-time visualization of the audio signal and transcription
  • Runs Whisper inference continuously on a separate thread

2. Class Structure and Initialization

The WhisperRealtimeTranscriber class encapsulates the entire system:

  • Manages multiple threads for audio capture and processing
  • Maintains queues for communication between threads
  • Configures parameters like energy thresholds for speech detection
  • Initializes visualization components including waveform and spectrogram displays
  • Sets up the Whisper model with GPU acceleration when available

3. Audio Capture System

The _audio_capture_thread method handles continuous audio input:

  • Uses PyAudio to access the microphone stream
  • Implements energy-based voice activity detection to identify speech
  • Manages "phrases" by detecting pauses between speech segments
  • Updates a circular buffer for visualization purposes
  • Queues detected speech for transcription processing

4. Whisper Transcription Engine

The _transcription_thread implements the core speech-to-text functionality:

  • Retrieves audio segments from the queue when available
  • Filters out audio clips that are too short
  • Generates mel spectrograms for both transcription and visualization
  • Runs the Whisper model inference to convert speech to text
  • Maintains a transcript history with timestamps
  • Measures and reports processing time for performance monitoring

5. Real-Time Visualization

The _update_visualization method creates an interactive dashboard:

  • Displays the audio waveform with recording status indicator
  • Shows the mel spectrogram representation used by Whisper
  • Provides a scrolling transcript history panel
  • Updates dynamically using Matplotlib's animation functionality

6. User Interface and Control Flow

The start method orchestrates the system operation:

  • Launches audio capture and transcription threads
  • Sets up the visualization if enabled
  • Handles clean shutdown on user interruption

7. Practical Applications

This implementation offers several advantages over the previous example:

  • Live transcription: Process speech as it happens rather than from files
  • Continuous operation: Run indefinitely for real-time applications
  • Visual feedback: See both the audio signal and the corresponding transcription
  • Speech detection: Automatically identify when someone is speaking
  • Performance monitoring: Track processing times to optimize for real-time use

8. Use Cases for Real-Time Whisper

This implementation is particularly useful for:

  • Live captioning for presentations or meetings
  • Real-time transcription for accessibility purposes
  • Interactive voice-controlled applications
  • Speech analytics and monitoring systems
  • Educational tools showing the relationship between speech and its transcription

9. Technical Considerations

The implementation addresses several challenges:

  • Balancing latency vs. accuracy through parameter tuning
  • Managing computational resources with threading
  • Providing visual feedback without affecting performance
  • Detecting speech vs. silence for efficient processing
  • Formatting and storing transcription results

This real-time implementation represents a significant enhancement over batch processing, enabling interactive applications where immediate transcription is required.

5.2.2 SpeechLM and SpeechGPT: Language Models that Listen

While Whisper excels at ASR (Automatic Speech Recognition), models like SpeechLM and SpeechGPT go a step further: they integrate speech and text into a single transformer framework. This represents a fundamental shift from traditional approaches where speech processing and text understanding were handled by completely separate systems.

This integration is revolutionary because it allows these models to process both modalities simultaneously rather than treating them as separate processing pipelines. By unifying speech and text in the same architecture, these models can leverage contextual information across modalities, resulting in more coherent and contextually appropriate responses. The direct connection between acoustic patterns and semantic meaning enables these models to capture nuances like tone, emphasis, and rhythm that might be lost in a pipeline approach.

To understand the significance of this advancement, consider how traditional speech systems work: first, an ASR component converts audio to text transcripts, then a separate natural language processing (NLP) system analyzes the transcript. Each transition between systems creates an opportunity for information loss. Important acoustic features like speaker emotion, sarcasm detection, or emphasis on specific words are typically stripped away during the initial transcription step.

SpeechLM and SpeechGPT, in contrast, maintain a continuous representation of the speech signal throughout the entire processing chain. This approach preserves crucial paralinguistic information—the non-verbal aspects of communication that often carry significant meaning. For instance, the same phrase spoken with different intonation patterns might convey completely different intentions, from sincere agreement to sarcastic dismissal. By keeping the acoustic signal and its linguistic interpretation linked throughout processing, these models can detect such subtleties.

The technical architecture enabling this integration typically involves specialized encoder modules that process raw audio waveforms or spectrograms into dense vector representations. These speech embeddings are then projected into the same latent space as text embeddings, allowing the transformer's attention mechanisms to establish connections between corresponding elements in both modalities. This cross-modal attention is the key innovation that enables these models to "listen" in a more human-like way.

Unlike traditional systems where speech is first converted to text and then processed by a language model (creating potential information loss at each step), these unified models maintain the richness of the original speech signal throughout processing. This preserves important paralinguistic features such as emotion, speaker identity, and conversational dynamics that are crucial for truly understanding spoken language in context.

SpeechLM (Microsoft):

Pretrained on paired audio–text data, allowing it to develop rich representations that capture both acoustic and linguistic information. This dual-modality training approach enables the model to understand not just what words are being said, but also how they're being said, including tone, emphasis, and speaker characteristics. The model processes raw audio waveforms alongside corresponding transcripts, learning to associate specific acoustic patterns with their semantic meanings. For example, it can distinguish between a question and a statement based on rising or falling intonation, even when the words are identical.

Learns to align acoustic features with linguistic tokens through innovative cross-modal attention mechanisms that map speech patterns to their textual representations. This alignment process creates a shared semantic space where speech and text can interact seamlessly, enabling more accurate interpretation of spoken language. These mechanisms work by establishing bidirectional connections between audio segments and corresponding text tokens, allowing information to flow freely across modalities. When processing a sentence, the model can simultaneously attend to both the acoustic signal and the linguistic structure, creating a unified representation that preserves both aspects of communication.

Supports tasks like speech-to-text, spoken translation, and speech understanding, with superior performance compared to pipeline approaches due to its end-to-end training methodology. By training all components together, SpeechLM avoids error propagation issues common in traditional pipeline systems where mistakes in early stages cascade through the system. In conventional approaches, if the ASR component misrecognizes a word, all downstream components (like translation or understanding) inherit that error. SpeechLM's unified approach allows later processing stages to potentially compensate for earlier uncertainties by leveraging broader contextual information and cross-modal cues, similar to how humans can understand slightly mispronounced words in context.

Utilizes self-supervised learning techniques to maximize learning from limited paired data, enabling robust performance even with limited annotations. These techniques include masked language modeling adapted for speech inputs, contrastive learning between speech and text representations, and consistency regularization across modalities. During training, the model might be presented with an audio segment where certain portions are masked out, requiring it to predict the missing acoustic information based on surrounding context and any available text. Similarly, it learns to minimize the distance between representations of the same content expressed in different modalities, helping to align the speech and text embedding spaces. This approach allows SpeechLM to leverage large quantities of unpaired speech or text data alongside smaller amounts of parallel data.

Incorporates advanced contextual understanding that allows it to better handle ambiguous speech, speaker variations, and noisy environments compared to traditional ASR systems. By maintaining a rich contextual representation throughout processing, SpeechLM can disambiguate homophones (words that sound alike but have different meanings) based on broader semantic context, adapt to different accents and speaking styles by recognizing patterns across larger segments of speech, and filter out background noise by distinguishing relevant speech patterns from irrelevant acoustic information. The model's attention mechanisms can focus on the most informative parts of the signal while de-emphasizing distracting elements, similar to how humans can follow a conversation in a crowded room—often called the "cocktail party effect."

SpeechGPT:

Extends LLMs to work directly with speech as input/output, eliminating the need for separate ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) systems in conversational applications. Traditional conversational AI systems typically require a pipeline approach where speech is first converted to text, processed by a language model, and then converted back to speech for the response. This multi-stage process introduces latency at each conversion point and often loses important acoustic information along the way.

SpeechGPT, however, integrates these components into a unified architecture that processes speech signals end-to-end. This direct integration enables smoother conversational flow, as the model processes speech signals directly without converting to intermediate text representations, reducing latency and preserving acoustic nuances that might be lost in traditional pipeline approaches. By maintaining the integrity of the original speech signal throughout processing, SpeechGPT can detect subtle variations in tone, rhythm, and emphasis that carry important communicative information beyond the literal words being spoken.

Can transcribe, understand, and even generate spoken dialogue in a unified framework, maintaining conversational context across multiple turns. Unlike traditional systems that process each utterance independently, SpeechGPT maintains a continuous memory of the conversation, allowing it to reference previous statements and generate contextually appropriate responses that acknowledge shared history between speakers.

This contextual awareness means the model can track topics across multiple exchanges, resolve ambiguous references, and respond appropriately to follow-up questions without requiring users to restate information. For example, if a user asks about the weather today and then follows up with "What about tomorrow?", SpeechGPT can understand that the second question is still about weather without explicit specification. This ability to maintain conversational state mirrors human dialogue patterns where context is implicitly understood and carried forward, creating more natural and efficient interactions.

Useful for conversational agents that naturally handle both modalities, creating more human-like interactions without modal switching delays. This seamless integration between speech and text processing mimics human communication patterns where we naturally shift between listening and speaking without conscious mode switching, enabling more fluid and natural dialogues with AI systems. In practice, this means users can speak directly to the system and receive spoken responses without perceiving any translation happening behind the scenes.

For applications like virtual assistants, customer service bots, or educational tutors, this creates a significantly more natural user experience that reduces cognitive load on users. The elimination of perceptible modal transitions also increases accessibility for users who may struggle with text interfaces, such as those with visual impairments, reading difficulties, or situations where looking at a screen is impractical (like while driving).

Demonstrates improved prosody and intonation in generated speech by leveraging the semantic understanding capabilities of the underlying LLM. By comprehending the meaning, emotion, and pragmatic intent behind responses, SpeechGPT can apply appropriate stress patterns, rhythm variations, and tonal shifts that convey not just what is being said, but how it should be said to effectively communicate meaning. This represents a significant advance over traditional TTS systems that often produce flat, monotonous speech that lacks the natural variations human speakers use to express meaning.

For instance, when expressing excitement, the system can increase pitch and speed; when conveying serious information, it can adopt a more measured pace with appropriate emphasis on key points. These prosodic features are crucial for effective communication, as they help listeners interpret the speaker's intentions, distinguish between questions and statements, identify important information, and understand emotional context. The ability to generate appropriately expressive speech makes interactions feel more natural and helps ensure that the intended meaning is accurately conveyed to users.

How They Work:

  1. Convert audio into speech embeddings using a feature extractor (like wav2vec2), which captures phonetic, prosodic, and speaker information from raw waveforms into dense vector representations. This process transforms complex audio signals into numerical matrices that preserve crucial linguistic features including pronunciation patterns, speech rhythm, emotional tone, and individual voice characteristics. The resulting embeddings create a mathematical representation of speech that models can process efficiently while maintaining the rich acoustic properties of the original audio.
  2. Align embeddings with text tokens in the transformer through cross-attention mechanisms, creating a joint representation space where acoustic and linguistic features can interact freely. These mechanisms allow the model to establish connections between corresponding elements in both modalities, mapping specific acoustic patterns to their textual counterparts. This alignment process creates bidirectional pathways that enable information to flow between speech and text representations, facilitating tasks like spoken language understanding where both the content and delivery of speech matter.
  3. Train on tasks that require both listening and understanding, such as answering questions about spoken content or following verbal instructions, to develop robust multimodal comprehension abilities. This training approach forces the model to process auditory and textual information simultaneously, extracting meaning from both channels and integrating them into a unified semantic representation. By presenting the model with increasingly complex spoken language understanding challenges, it learns to recognize not just what words are being said, but also how context, emphasis, and tone modify their meaning.
  4. Utilize specialized loss functions that encourage semantic consistency between speech and text representations, ensuring that information is preserved across modality boundaries. These loss functions compare the model's internal representations of the same content expressed in different modalities and penalize inconsistencies, driving the model to develop aligned feature spaces. By minimizing the distance between representations of equivalent content across modalities, these functions help the model build a cohesive understanding regardless of whether information arrives as text or speech.
  5. Employ curriculum learning strategies that gradually increase task complexity, starting with simple speech recognition before progressing to more complex understanding and generation tasks. This staged approach begins with basic transcription to establish fundamental audio-text mappings, then advances to more sophisticated tasks like identifying speaker intent, recognizing emotion, and generating contextually appropriate responses to spoken queries. The progressive difficulty helps the model develop a hierarchy of speech understanding capabilities, from low-level acoustic processing to high-level semantic interpretation.

Code Example: SpeechLM Implementation

from transformers import AutoProcessor, SpeechLMForSpeechToText
import torch
import soundfile as sf
import librosa

# Load pretrained SpeechLM model and processor
model_id = "microsoft/speechlm-large-960h"
processor = AutoProcessor.from_pretrained(model_id)
model = SpeechLMForSpeechToText.from_pretrained(model_id)

# Load and preprocess audio file
audio_file = "speechlm-example.mp3"
speech, sample_rate = sf.read(audio_file)

# Resample if necessary
if sample_rate != 16000:
    speech = librosa.resample(speech, orig_sr=sample_rate, target_sr=16000)
    sample_rate = 16000

# Prepare inputs
inputs = processor(speech, sampling_rate=sample_rate, return_tensors="pt")

# Generate transcription
with torch.no_grad():
    generated_ids = model.generate(
        inputs["input_features"],
        num_beams=5,
        max_length=100
    )

# Decode the output tokens
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")

# For speech understanding tasks, we can also get embeddings
with torch.no_grad():
    outputs = model(
        input_features=inputs["input_features"],
        attention_mask=inputs["attention_mask"],
        decoder_input_ids=processor.get_decoder_prompt_ids(task="asr")
    )
    
    # Get speech embeddings from the encoder
    speech_embeddings = outputs.encoder_last_hidden_state
    print(f"Speech embeddings shape: {speech_embeddings.shape}")
    
    # These embeddings can be used for downstream tasks like
    # speaker identification, emotion recognition, or semantic analysis

Download the speech sample here: https://files.cuantum.tech/audio/speechlm-example.mp3

Note: Save the example audio in the same location as the Python script.

Code Breakdown: SpeechLM Implementation

This SpeechLM code example demonstrates how to use Microsoft's SpeechLM model for speech transcription and understanding. Let's examine each component:

  1. Imports and Model Loading: The code imports necessary libraries and loads the pretrained SpeechLM model and processor from Hugging Face. SpeechLM is a speech-language model that can process raw audio waveforms and perform tasks like transcription and understanding.
  2. Audio Processing: The audio file is loaded using soundfile and potentially resampled to 16kHz (the standard sampling rate expected by most speech models). This preprocessing ensures the audio input matches the format expected by the model regardless of the source recording conditions.
  3. Input Preparation: The processor converts the raw audio waveform into the model's expected input format. This includes extracting acoustic features (similar to spectrograms) and preparing attention masks to handle variable-length inputs. These features capture the phonetic and prosodic information from the speech signal.
  4. Transcription Generation: The model.generate() method performs beam search decoding to convert the audio features into text. This process uses the model's encoder-decoder architecture to map speech representations to text tokens. The num_beams parameter controls how many alternative hypotheses the model considers during decoding, while max_length limits the output length.
  5. Decoding: The processor.batch_decode() function converts the generated token IDs back into human-readable text, removing any special tokens (like padding or end-of-sequence markers) that are used internally by the model but aren't part of the actual transcription.
  6. Speech Embeddings Extraction: Beyond simple transcription, the code demonstrates how to access the model's internal representations of speech. The encoder_last_hidden_state contains rich contextual embeddings that capture both acoustic and linguistic properties of the speech. These embeddings preserve paralinguistic features (tone, emphasis, emotion) that might be lost in text transcription.

Technical Insights on SpeechLM's Architecture

SpeechLM represents a significant advancement in speech processing for several reasons:

  • Unified encoder-decoder architecture: Unlike pipeline approaches that separate ASR and language understanding, SpeechLM processes the entire speech-to-meaning pathway in a single model, reducing error propagation between components.
  • Contextual understanding: The transformer architecture allows the model to capture long-range dependencies in speech, helping it understand content based on the broader context rather than just isolated segments.
  • Cross-modal pretraining: SpeechLM is pretrained on paired speech-text data, allowing it to develop aligned representations between acoustic and linguistic features. This alignment enables more accurate transcription and understanding of spoken language.
  • Speech embeddings: The model's encoder produces contextualized speech embeddings that preserve both linguistic content and paralinguistic features (like speaker identity, emotion, and emphasis). These rich representations can be used for downstream tasks beyond basic transcription.

Practical Applications

The speech embeddings extracted in the example could be used for:

  • Speaker recognition: Identifying who is speaking based on voice characteristics preserved in the embeddings.
  • Emotion detection: Analyzing the emotional tone of speech from acoustic patterns.
  • Intent classification: Determining what the speaker wants to accomplish (ask a question, make a request, etc.).
  • Speech translation: Converting speech in one language to text in another by connecting the speech embeddings to a translation model.

SpeechLM represents an important step toward truly integrated speech-language models that process spoken language in a more human-like way, maintaining the rich acoustic information that gives speech its nuanced meaning beyond just the words being said.

Code Example: SpeechGPT Implementation

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

# Load pretrained SpeechGPT model and processor
model_id = "microsoft/speech_gpt2_oaitr"  # Note: This is an example model ID
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id)

# Function to handle conversational speech input and output
def speech_conversation(audio_path, conversation_history=None):
    # Load audio file
    waveform, sample_rate = torchaudio.load(audio_path)
    
    # Resample if necessary (SpeechGPT typically expects 16kHz audio)
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(sample_rate, 16000)
        waveform = resampler(waveform)
        sample_rate = 16000
    
    # Convert to mono if stereo
    if waveform.shape[0] > 1:
        waveform = torch.mean(waveform, dim=0, keepdim=True)
    
    # Process audio input
    inputs = processor(
        audio=waveform.squeeze().numpy(),
        sampling_rate=sample_rate,
        return_tensors="pt",
        conversation_history=conversation_history
    )
    
    # Generate response
    with torch.no_grad():
        output = model.generate(
            input_features=inputs["input_features"],
            attention_mask=inputs.get("attention_mask"),
            max_length=100,
            num_beams=5,
            early_stopping=True,
            conversation_history=inputs.get("conversation_history")
        )
    
    # Process the output
    transcription = processor.decode(output[0], skip_special_tokens=True)
    
    # Optional: Convert response to speech
    speech_output = model.generate_speech(
        output,
        speaker_embeddings=inputs.get("speaker_embeddings")
    )
    
    # Save the generated speech
    torchaudio.save(
        "response.wav", 
        speech_output.squeeze().unsqueeze(0), 
        16000
    )
    
    # Update conversation history
    new_conversation_history = {
        "input_speech": waveform.squeeze().numpy(),
        "output_text": transcription,
        "output_speech": speech_output.squeeze().numpy()
    }
    
    if conversation_history:
        conversation_history.append(new_conversation_history)
    else:
        conversation_history = [new_conversation_history]
    
    return transcription, speech_output, conversation_history

# Example usage
if __name__ == "__main__":
    # Start a new conversation
    conversation_history = None
    
    # First interaction
    user_query = "user_question_.mp3"  # Path to audio file with user's question
    response_text, response_audio, conversation_history = speech_conversation(
        user_query, conversation_history
    )
    
    print(f"User (transcribed): {response_text}")
    
    # Second interaction (with conversation history for context)
    follow_up_query = "user_follow_up.mp3"  # Path to follow-up question audio
    response_text2, response_audio2, conversation_history = speech_conversation(
        follow_up_query, conversation_history
    )
    
    print(f"User follow-up (transcribed): {response_text2}")

Download the user question audio sample here: https://files.cuantum.tech/audio/user_question_.mp3

Download the user follow up audio sample here: https://files.cuantum.tech/audio/user_follow_up.mp3

Note: Save the example audios in the same location as the Python script.

Code Breakdown: SpeechGPT Implementation

  1. Imports and Model Loading: The code imports PyTorch, torchaudio, and Hugging Face transformers to work with the SpeechGPT model. We load a pretrained model and processor that can handle both speech input and output in a conversational context.
  2. Conversation Function: The speech_conversation function serves as the core component, handling the entire speech-to-speech conversation flow. It takes an audio path and optional conversation history as inputs.
  3. Audio Preprocessing: The function loads the audio file using torchaudio, ensures it's at the required 16kHz sample rate (resampling if necessary), and converts stereo to mono if needed. These preprocessing steps ensure the audio meets the model's input requirements.
  4. Input Processing: The processor converts the raw audio waveform into the feature representations expected by SpeechGPT. Importantly, it includes the conversation history parameter, which allows the model to maintain context across multiple turns.
  5. Response Generation: The model generates a response based on the speech input and conversation context. The generation parameters control the quality and length of the response: 
    • max_length: Limits the response length
    • num_beams: Uses beam search with 5 beams for better quality responses
    • early_stopping: Terminates generation when all beams reach an end token
    • conversation_history: Provides context from previous exchanges
  6. Speech Synthesis: Unlike traditional models that would require a separate TTS system, SpeechGPT can directly generate speech output from its internal representations. The generate_speech method converts the text response into audio, maintaining speaker characteristics if provided.
  7. Conversation State Management: The function tracks conversation history by storing each exchange (input speech, output text, output speech) in a structured format. This history is passed to subsequent calls, enabling the model to reference previous information.
  8. Example Usage: The code demonstrates a two-turn conversation, showing how to: 
    • Start a new conversation (empty history)
    • Process the first user query
    • Maintain conversation context
    • Handle a follow-up question while preserving context

Technical Insights on SpeechGPT Architecture

SpeechGPT represents a significant advancement in speech-language models by integrating several key architectural innovations:

  • End-to-end speech-to-speech framework: Unlike traditional pipeline approaches that separate ASR, language understanding, and TTS components, SpeechGPT unifies these capabilities in a single model, reducing latency and error propagation.
  • Joint speech-text representation: The model learns a shared embedding space for both speech and text, allowing for seamless transitions between modalities without information loss. This joint representation enables the model to maintain the emotional and prosodic elements of speech alongside semantic content.
  • Conversation-aware transformer: SpeechGPT extends the standard transformer architecture with additional mechanisms to track conversation state and maintain coherence across multiple turns. This includes specialized attention layers that can reference previous exchanges.
  • Prosody modeling: The speech generation component preserves natural intonation, rhythm, and emphasis patterns by incorporating prosodic features into the generation process. This results in more human-like speech output compared to traditional TTS systems.

Key Advantages Over Traditional Speech Systems

  • Contextual understanding: SpeechGPT maintains conversation state across multiple turns, allowing it to handle follow-up questions, resolve references, and build on previous exchanges without requiring users to restate context.
  • Seamless modality transitions: The unified architecture eliminates perceptible delays between speech understanding and response generation, creating more natural conversational flow.
  • Expressive speech generation: By leveraging its language understanding capabilities, SpeechGPT can apply appropriate prosody and intonation that matches the semantic and emotional content of responses.
  • Reduced latency: The end-to-end design eliminates the computational overhead of separate ASR, NLU, and TTS systems, enabling faster response times in interactive applications.

Practical Applications

SpeechGPT's unified speech-language capabilities make it particularly well-suited for:

  • Virtual assistants: Creating more natural and contextually aware voice interfaces for smart devices and applications.
  • Accessibility tools: Developing conversation systems for users with visual impairments or those who prefer speech interfaces.
  • Language learning: Building interactive tutors that can engage in spoken dialogue while maintaining context across a learning session.
  • Customer service: Powering voice bots that can handle complex, multi-turn conversations with natural speech patterns.

The integration of speech and language understanding in a single model represents a significant step toward more human-like AI communication systems that can engage in natural conversation across modalities.

Code Example: Extracting Speech Features with wav2vec2 (Hugging Face)

from transformers import Wav2Vec2Processor, Wav2Vec2Model, Wav2Vec2ForCTC
import torch
import soundfile as sf
import librosa
import matplotlib.pyplot as plt
import numpy as np
import IPython.display as ipd

# Function to load and preprocess audio
def load_audio(file_path, target_sr=16000):
    """
    Load audio file and resample if necessary
    """
    # Load audio using librosa (handles various formats better)
    try:
        audio, sample_rate = librosa.load(file_path, sr=None)
        # Resample if needed
        if sample_rate != target_sr:
            print(f"Resampling from {sample_rate}Hz to {target_sr}Hz")
            audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=target_sr)
            sample_rate = target_sr
        return audio, sample_rate
    except Exception as e:
        print(f"Error loading audio: {e}")
        return None, None

# Load pretrained speech model
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

# Also load ASR model for transcription demo
asr_model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

# Load audio file
speech, rate = load_audio("speech_sample_w.mp3")
if speech is not None:
    # Display audio waveform
    plt.figure(figsize=(10, 4))
    plt.plot(speech)
    plt.title("Audio Waveform")
    plt.xlabel("Time (samples)")
    plt.ylabel("Amplitude")
    plt.show()
    
    # Display audio for listening
    ipd.display(ipd.Audio(speech, rate=rate))
    
    # Process audio for feature extraction
    inputs = processor(speech, sampling_rate=rate, return_tensors="pt", padding=True)
    
    # Extract embeddings
    with torch.no_grad():
        # Get the hidden states (embeddings)
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
        
        # Also get the transcription from ASR model
        logits = asr_model(**inputs).logits
        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = processor.batch_decode(predicted_ids)[0]
    
    print("Audio transcription:", transcription)
    print("Shape of embeddings:", embeddings.shape)  # [batch, time, hidden_dim]
    
    # Visualize embeddings
    # Take mean across time dimension to get a single vector per feature
    mean_embeddings = embeddings.mean(dim=1).squeeze().numpy()
    
    plt.figure(figsize=(12, 6))
    plt.imshow(mean_embeddings.reshape(1, -1), aspect='auto', cmap='viridis')
    plt.colorbar()
    plt.title("Speech Embeddings Visualization")
    plt.xlabel("Feature Dimensions")
    plt.ylabel("Sample")
    plt.show()
    
    # Demonstrate feature extraction for downstream tasks
    # Example: Extract global speech representation (average pooling)
    global_speech_vector = embeddings.mean(dim=1)
    print("Global speech vector shape:", global_speech_vector.shape)  # [batch, hidden_dim]
    
    # Example: Extract frame-level features for a specific segment (middle 1 second)
    middle_frame = embeddings.shape[1] // 2
    segment_features = embeddings[0, middle_frame-25:middle_frame+25, :]  # ~1 second at 50Hz frame rate
    print("Segment features shape:", segment_features.shape)  # [frames, hidden_dim]
else:
    print("Failed to load audio file. Please check the path and file format.")

Download the audio sample here: https://files.cuantum.tech/audio/speech_sample_w.mp3

Note: Save the example audio in the same location as the Python script.

Comprehensive Code Breakdown: Speech Feature Extraction with Wav2Vec2

  • 1. Imports and Setup
    • We import the necessary libraries: Transformers for the Wav2Vec2 models, PyTorch for tensor operations, soundfile/librosa for audio processing, and visualization tools.
    • We include both the base Wav2Vec2Model (for embeddings) and Wav2Vec2ForCTC (for transcription) to demonstrate multiple use cases.
  • 2. Audio Loading and Preprocessing
    • The load_audio function handles various audio formats and automatically resamples to 16kHz if necessary (Wav2Vec2's expected sample rate).
    • Using librosa instead of soundfile provides better support for various audio formats and error handling.
  • 3. Model Initialization
    • We load the pretrained Wav2Vec2 processor and model from Hugging Face's model hub.
    • The processor handles tokenization of audio data into the format expected by the model.
    • We also load the ASR variant of the model to demonstrate speech recognition capabilities.
  • 4. Visualization
    • We plot the audio waveform to provide visual insight into the signal being processed.
    • We use IPython's audio display capabilities to allow for listening to the audio directly in notebooks.
  • 5. Feature Extraction
    • The processor converts the raw audio into the input format required by the model.
    • With torch.no_grad(), we ensure no gradients are computed during inference, saving memory.
    • We extract the last_hidden_state which contains the contextualized audio embeddings.
  • 6. Transcription
    • Using the ASR model variant, we convert the same audio input into text.
    • This demonstrates how the same audio features can be used for multiple downstream tasks.
  • 7. Embedding Visualization and Analysis
    • We visualize the embeddings using a heatmap to give insight into the feature patterns.
    • We demonstrate two common ways to use the embeddings:
    • Global representation: averaging across time to get a single vector representing the entire utterance (useful for speaker identification, emotion recognition, etc.)
    • Frame-level features: extracting time-aligned segments for fine-grained analysis (useful for alignment, pronunciation assessment, etc.)
  • 8. Error Handling
    • The code includes basic error handling to gracefully deal with issues like missing files or unsupported formats.

Technical Insights: Why This Approach Matters

  • Wav2Vec2 is a self-supervised model trained on massive amounts of unlabeled speech data, allowing it to learn robust speech representations without requiring transcriptions.
  • The extracted embeddings capture phonetic content, speaker characteristics, emotional tone, and acoustic environment information in a unified representation.
  • These embeddings serve as excellent features for downstream tasks like speech recognition, speaker identification, and emotion classification.
  • The contextual nature of the embeddings (each frame is influenced by surrounding audio) makes them more powerful than traditional acoustic features like MFCCs.

5.2.3 GPT-5 Realtime: Low-Latency Voice Interaction

While Whisper demonstrated the ability to transcribe speech with high accuracy (speech → text) and models like SpeechLM and SpeechGPT extended this by integrating spoken inputs into large language models, GPT-5 Realtime represents the next leap forward: a model that can listen and respond in natural speech almost instantly. This breakthrough addresses the fundamental limitation of earlier systems - the noticeable delay between input and response that made interactions feel mechanical rather than natural.

This is not merely speech recognition paired with text generation and then a separate text-to-speech system bolted on top. Earlier approaches typically followed a pipeline architecture where each component operated independently, creating bottlenecks and inconsistencies. Instead, GPT-5 Realtime is natively multimodal, trained to process audio as a first-class input and to produce audio as a first-class output. This integrated approach means the model understands the prosody, emotion, and nuances in spoken language directly, without information loss from intermediate text representations.

The result is a conversational agent capable of fluid, human-like dialogue with latency measured in tens of milliseconds, making it suitable for real-world conversations, tutoring, and customer service. This ultra-low latency is achieved through specialized architectures that process audio streams incrementally rather than waiting for complete utterances, along with predictive mechanisms that anticipate likely responses. The end-to-end optimization eliminates the cumulative delays inherent in pipeline approaches, creating interactions that feel remarkably human in their timing and rhythm.

Architecture and Capabilities

GPT-5 Realtime integrates multiple components into one coherent system, creating a seamless conversational experience:

  • Speech-in: Users can send raw audio (16-bit PCM WAV, 24 kHz mono is a safe default). The model transcribes and interprets speech in real time, converting acoustic signals into semantic understanding. Unlike traditional speech recognition systems that merely transcribe words, GPT-5 Realtime captures nuances, emotions, and contextual cues from the audio input, preserving the richness of human communication.
  • Speech-out: The model responds with synthetic but natural-sounding speech, streamed back as low-latency audio frames. Different voices and speaking styles can be selected to match user preferences or specific use cases. The generated speech maintains appropriate prosody, emphasis, and intonation patterns that make the interaction feel genuinely human-like rather than robotic.
  • Full multimodality: In addition to audio, GPT-5 Realtime sessions can also accept text and image inputs mid-conversation, allowing for hybrid interactions (e.g., "Look at this chart and tell me about it" while speaking). This flexibility enables seamless transitions between modalities, supporting more natural workflows where users might want to show visual information while continuing to speak, similar to how humans communicate in meetings or educational settings.
  • Low latency: Because the model is optimized for conversational flow, response latency is comparable to a human pause in speech — generally under 300 ms. This is achieved through specialized streaming architectures and predictive processing that begins generating responses before the user has finished speaking. The near-instantaneous turnaround creates a conversational rhythm that feels natural and engaging, eliminating the awkward pauses common in earlier AI systems.
  • Telephony integration: GPT-5 Realtime sessions can be connected to SIP (Session Initiation Protocol), enabling the model to act as a phone-based agent. This integration allows the model to handle inbound and outbound calls over standard telephone networks, making advanced AI accessible through the most ubiquitous communication technology worldwide, without requiring specialized equipment or applications.

Together, these features push AI systems beyond one-way transcription or delayed response, toward live conversational intelligence.

Practical Example

For consistency with our multimodal focus, we’ll use a short audio file (user_prompt_spoken.wav) where the user asks:

“Can you explain the advantages of GPT-5 as a multimodal model?”

When sent to GPT-5 Realtime, the model will:

  1. Transcribe the spoken question.
  2. Reason about the content.
  3. Generate speech that explains the advantages of GPT-5’s multimodality.

The round-trip feels like a natural dialogue with a knowledgeable assistant.

Code Example: Realtime Voice with GPT-5

The following Python script shows how to connect to the Realtime API using WebSockets, send a short WAV file as input, and save the assistant’s spoken reply as a new WAV file.

"""
Realtime Voice with GPT-5 (WebSocket API)
- Sends a short WAV file (user_prompt_spoken.wav) to GPT-5 Realtime
- Receives streamed audio back and saves it to assistant_reply.wav
Requirements: pip install websockets soundfile numpy
"""

import os
import json
import base64
import asyncio
import websockets
import numpy as np
import soundfile as sf

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-5-realtime"

INPUT_WAV = "user_prompt_spoken.wav"   # spoken question
OUTPUT_WAV = "assistant_reply.wav"     # assistant’s voice reply

def read_wav_as_base64(path: str) -> str:
    """Read WAV file and return base64-encoded string."""
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

async def main():
    if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
        raise SystemExit("Set OPENAI_API_KEY env variable.")

    # Load spoken user prompt
    user_audio_b64 = read_wav_as_base64(INPUT_WAV)

    # Connect to Realtime WebSocket
    async with websockets.connect(
        REALTIME_URL,
        extra_headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1",
        },
        max_size=20 * 1024 * 1024,
    ) as ws:
        print("Connected to GPT-5 Realtime.")

        # 1) Configure session (input/output formats, voice)
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "input_audio_format": "wav",
                "output_audio_format": "wav",
                "voice": "alloy",
                "instructions": (
                    "You are a helpful voice assistant. "
                    "Answer the user’s question clearly and concisely."
                )
            }
        }))

        # 2) Append user audio to input buffer
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": user_audio_b64
        }))
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # 3) Ask model to create a response
        await ws.send(json.dumps({"type": "response.create", "response": {}}))

        print("Waiting for GPT-5 Realtime reply...")

        # 4) Collect audio frames
        audio_bytes = bytearray()
        sample_rate = 24000  # expected sample rate

        async for msg in ws:
            evt = json.loads(msg)

            if evt.get("type") == "response.output_text.delta":
                print(evt.get("delta"), end="", flush=True)

            elif evt.get("type") == "response.output_audio.delta":
                audio_bytes.extend(base64.b64decode(evt["delta"]))

            elif evt.get("type") == "response.completed":
                print("\n[Response completed]")
                break

        # 5) Save assistant’s reply as a WAV file
        pcm16 = np.frombuffer(bytes(audio_bytes), dtype=np.int16)
        sf.write(OUTPUT_WAV, pcm16, samplerate=sample_rate, subtype="PCM_16")
        print(f"[Saved] {OUTPUT_WAV}")

if __name__ == "__main__":
    asyncio.run(main())

Download the audio sample here: https://files.cuantum.tech/audio/user_prompt_spoken.wav

Note: Save the example audio in the same location as the Python script.

Code breakdown:

  1. Session Setup
    • The client connects to the Realtime WebSocket and sends a session.update message specifying:
      • Input modality: audio (WAV).
      • Output modality: audio (WAV).
      • Selected voice (e.g., "alloy").
    • This defines the rules of the conversation.
  2. Input Buffering
    • Audio files (or live microphone frames) are base64-encoded and appended to an input buffer.
    • commit message signals the end of input.
  3. Response Creation
    • response.create message tells GPT-5 to process the buffer and generate a reply.
  4. Streaming Output
    • The server streams back two types of deltas:
      • response.output_text.delta (optional live transcript).
      • response.output_audio.delta (audio chunks).
    • Audio chunks are collected into a byte array until response.completed is received.
  5. Saving the File
    • The reply is written as a standard 24 kHz PCM16 WAV file, playable in any media player.

Applications and Implications

GPT-5 Realtime demonstrates how far multimodal LLMs have evolved:

  • Conversational Agents: Natural, low-latency assistants that can answer customer queries or provide educational tutoring over phone or web.
  • Accessibility: Voice-based interfaces for users who cannot easily type or read text.
  • Hybrid Interactions: Combine voice with images and text mid-conversation, enabling richer multi-turn exchanges.
  • Telephony Integration: Deploy AI agents that can handle SIP phone calls, routing, and form-filling.

Example Code: Live Microphone Capture with GPT-5 Realtime (speech-in → speech-out)

What it does:

  • Records ~3 seconds from your default mic
  • Streams it to GPT-5 Realtime over WebSocket
  • Saves the model’s spoken reply as assistant_reply.wav
  • Prints a live text transcript (if provided by the server)

Requirements

pip install websockets sounddevice soundfile numpy
  • OS mic permissions: allow terminal/IDE access to the microphone (macOS: System Settings → Privacy & Security → Microphone; Windows: Privacy → Microphone).
"""
Live Mic → GPT-5 Realtime → Spoken Reply
Records ~3 seconds of audio from your default microphone, streams it to GPT-5 Realtime,
and saves the assistant's spoken response to assistant_reply.wav.

Requirements:
  pip install websockets sounddevice soundfile numpy
Set your key:
  export OPENAI_API_KEY="sk-..."   (macOS/Linux)
  setx OPENAI_API_KEY "sk-..."     (Windows, new terminal)

If you prefer MP3 I/O, see the note in your book; this example uses WAV (PCM16 @ 24 kHz).
"""

import os
import json
import base64
import asyncio
import websockets
import numpy as np
import sounddevice as sd
import soundfile as sf

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-5-realtime"

# Recording settings (safe defaults for Realtime)
SAMPLE_RATE = 24000          # 24 kHz mono PCM16
CHANNELS = 1
DURATION_SECONDS = 3.0       # keep short for quick tests
OUTPUT_WAV = "assistant_reply.wav"

SYSTEM_INSTRUCTIONS = (
    "You are a helpful voice assistant. Transcribe the user if needed, "
    "then answer clearly in one or two sentences."
)

def record_from_mic(seconds: float = DURATION_SECONDS, sr: int = SAMPLE_RATE) -> bytes:
    """Record mono PCM16 audio from the default microphone and return raw bytes."""
    print(f"🎙️  Recording {seconds:.1f}s from microphone...")
    audio = sd.rec(int(sr * seconds), samplerate=sr, channels=CHANNELS, dtype="int16")
    sd.wait()
    print("✅ Done.")
    # audio is int16 numpy array; convert to raw bytes
    return audio.tobytes()

def b64encode_pcm16_wav(pcm_bytes: bytes, sr: int = SAMPLE_RATE) -> str:
    """
    Wrap raw PCM16 bytes into a WAV file in memory and return base64 string.
    Using soundfile to write to bytes buffer for simplicity.
    """
    import io
    buf = io.BytesIO()
    # convert bytes -> int16 array so soundfile can write it
    arr = np.frombuffer(pcm_bytes, dtype=np.int16)
    sf.write(buf, arr, sr, subtype="PCM_16", format="WAV")
    return base64.b64encode(buf.getvalue()).decode("utf-8")

async def main():
    if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
        raise SystemExit("Set OPENAI_API_KEY environment variable first.")

    # 1) Capture short mic audio & base64-encode as WAV
    pcm = record_from_mic()
    user_audio_b64 = b64encode_pcm16_wav(pcm)

    # 2) Connect to Realtime WS
    async with websockets.connect(
        REALTIME_URL,
        extra_headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1",
        },
        max_size=20 * 1024 * 1024,  # allow large frames
    ) as ws:
        print("🔌 Connected to GPT-5 Realtime.")

        # 3) Configure session: audio in/out (WAV), pick a voice
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "input_audio_format": "wav",
                "output_audio_format": "wav",
                "voice": "alloy",
                "instructions": SYSTEM_INSTRUCTIONS
            }
        }))

        # 4) Send mic audio (can be multiple appends for streaming mic)
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": user_audio_b64
        }))
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # Optional: add a brief text nudge
        await ws.send(json.dumps({
            "type": "response.create",
            "response": {
                "instructions": "Please respond concisely in speech."
            }
        }))

        print("🕑 Waiting for streamed reply...\n")

        # 5) Receive streamed audio/text deltas
        audio_bytes = bytearray()
        sample_rate = SAMPLE_RATE  # server commonly uses 24k; update if session reports different

        async for msg in ws:
            evt = json.loads(msg)

            # Live transcript (optional)
            if evt.get("type") == "response.output_text.delta":
                print(evt.get("delta"), end="", flush=True)

            # Audio chunks (base64-encoded PCM16 WAV frames)
            elif evt.get("type") == "response.output_audio.delta":
                audio_bytes.extend(base64.b64decode(evt["delta"]))

            elif evt.get("type") == "response.completed":
                print("\n\n✅ Response completed.")
                break

        # 6) Save assistant reply to WAV
        if audio_bytes:
            # raw bytes may already be WAV, but normalizing here is robust:
            # interpret as PCM16 stream and write as standard WAV
            pcm16 = np.frombuffer(bytes(audio_bytes), dtype=np.int16)
            sf.write(OUTPUT_WAV, pcm16, samplerate=sample_rate, subtype="PCM_16")
            print(f"💾 Saved assistant reply → {OUTPUT_WAV}")
        else:
            print("⚠️ No audio received from server.")

if __name__ == "__main__":
    asyncio.run(main())

Here's a code breakdown:

Required Libraries

The script uses several Python libraries:

  • websockets: For WebSocket communication with the GPT-5 Realtime API
  • sounddevice: To record audio from the microphone
  • soundfile: For handling WAV file operations
  • numpy: For audio data manipulation
  • Standard libraries: osjsonbase64asyncioio

Key Components

1. Configuration Settings

The script defines several important constants:

  • OPENAI_API_KEY: Authentication key for OpenAI's API
  • REALTIME_URL: WebSocket endpoint for GPT-5 Realtime
  • Recording parameters: Sample rate (24kHz), channels (mono), recording duration (3 seconds)
  • SYSTEM_INSTRUCTIONS: Prompts GPT-5 to act as a voice assistant

2. Audio Recording Function

The record_from_mic() function:

  • Uses sounddevice to capture audio at specified sample rate and duration
  • Records in mono at 16-bit PCM format
  • Returns raw audio bytes

3. WAV Encoding Function

The b64encode_pcm16_wav() function:

  • Takes raw PCM16 audio bytes
  • Wraps them in a WAV container using soundfile
  • Returns the base64-encoded string of the WAV file

4. Main Async Function

The main() async function orchestrates the entire process:

API Key Validation

  • Checks if the OpenAI API key is properly set

Audio Recording and Encoding

  • Records audio from the microphone
  • Encodes it as a base64 WAV string

WebSocket Connection

  • Establishes a secure WebSocket connection to GPT-5 Realtime
  • Sets proper headers including API key and beta flag

Session Configuration

  • Sends a session.update message to configure:
  • Input/output modalities (text and audio)
  • Audio format (WAV for both input and output)
  • Voice selection ("alloy")
  • System instructions for the assistant

Input Handling

  • Appends the recorded audio to the input buffer
  • Commits the buffer to signal completion of input
  • Optionally adds text instructions to shape the response

Response Processing

  • Collects streamed response data in real-time:
  • Text deltas (transcription of response)
  • Audio deltas (spoken audio chunks)
  • Monitors for completion signal

Output Saving

  • Converts collected audio bytes back to PCM16 format
  • Writes to a WAV file (assistant_reply.wav)

Flow of Execution

The script follows this sequence:

  1. Validate environment setup and API key
  2. Record short audio clip from microphone
  3. Connect to GPT-5 Realtime WebSocket API
  4. Configure session parameters (audio formats, voice)
  5. Send recorded audio and commit the input
  6. Request model to process audio and generate a response
  7. Receive and display text transcript while collecting audio chunks
  8. Save the complete audio response as a WAV file

Error Handling

The code includes basic error handling:

  • Checks for missing API key
  • Verifies if audio was received from the server

Technical Notes

  • Uses 24kHz mono PCM16 format, which is optimal for speech processing
  • Supports WebSocket protocol for real-time streaming
  • Uses asyncio for asynchronous operations
  • Implements proper WebSocket connection lifecycle management

Mid-Session Multimodality: Combining Audio and Images

One of GPT-5 Realtime’s most powerful abilities is to handle multiple modalities within a single ongoing conversation. Unlike earlier systems that processed text, images, or audio in isolation, Realtime can fluidly combine them as they arrive. This enables natural scenarios where a user begins by speaking a question and then adds an image for further clarification or analysis — all in the same session without restarting the dialogue.

For example, imagine a student asking aloud “Can you explain the advantages of GPT-5 as a multimodal model?” and then immediately showing a chart of data. GPT-5 Realtime can integrate both inputs, producing a spoken response that addresses the original audio question and references insights from the chart. This kind of dynamic, mid-session multimodality illustrates how the model moves beyond static question–answer patterns and toward fluid, real-time collaboration with human users.

Example: Mid-Session Multimodality (Audio question → Append Image → Spoken reply)

What it does

  1. Sends a short spoken question (WAV) to GPT-5 Realtime.
  2. Appends a chart image in the same session.
  3. Requests a spoken answer that references both the audio question and the image.
  4. Saves the reply as assistant_multimodal_reply.wav and prints any streamed text.

Requirements

pip install websockets soundfile numpy pillow
  • Put your audio prompt file (e.g., user_prompt_spoken.wav) and an image (e.g., chart.png) in the same folder.
  • Or adjust the paths below.
"""
Multimodal Mid-Session with GPT-5 Realtime
- Step 1: Send a spoken question (WAV) to GPT-5 Realtime.
- Step 2: Append an image (PNG) in the same session.
- Step 3: Ask for a spoken reply that references BOTH inputs.
- Saves the model’s voice reply to assistant_multimodal_reply.wav.

Requirements:
  pip install websockets soundfile numpy pillow
Set your key:
  export OPENAI_API_KEY="sk-..."   (macOS/Linux)
  setx OPENAI_API_KEY "sk-..."     (Windows, new terminal)
"""

import os
import io
import json
import base64
import asyncio
import websockets
import numpy as np
import soundfile as sf
from PIL import Image

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-5-realtime"

# Input files (adjust as needed)
INPUT_WAV = "user_prompt_spoken.wav"  # spoken question, e.g., “Can you explain the advantages of GPT-5 as a multimodal model?”
INPUT_IMG = "chart.png"               # a chart image to reference mid-session
OUTPUT_WAV = "assistant_multimodal_reply.wav"

# Session behavior
VOICE_NAME = "alloy"
SYSTEM_INSTRUCTIONS = (
    "You are a helpful voice assistant. Consider ALL inputs in this session. "
    "First, interpret the user's spoken question. Then, when an image is provided, "
    "analyze it and integrate both sources in your final spoken answer. "
    "Be concise and precise."
)

def read_wav_as_base64(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def read_png_as_base64(path: str) -> str:
    # Ensure we produce a clean PNG bytes payload (also validates file)
    with Image.open(path) as im:
        im = im.convert("RGBA") if im.mode not in ("RGB", "RGBA") else im
        buf = io.BytesIO()
        im.save(buf, format="PNG")
        return base64.b64encode(buf.getvalue()).decode("utf-8")

async def main():
    if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
        raise SystemExit("Set OPENAI_API_KEY environment variable first.")

    # Load inputs
    user_audio_b64 = read_wav_as_base64(INPUT_WAV)
    image_png_b64 = read_png_as_base64(INPUT_IMG)

    # Connect to Realtime WebSocket
    async with websockets.connect(
        REALTIME_URL,
        extra_headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1",
        },
        max_size=20 * 1024 * 1024,
    ) as ws:
        print("🔌 Connected to GPT-5 Realtime.")

        # 1) Configure session: we'll use audio in/out, and also allow image as input
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio", "image"],
                "input_audio_format": "wav",
                "output_audio_format": "wav",
                "voice": VOICE_NAME,
                "instructions": SYSTEM_INSTRUCTIONS
            }
        }))

        # 2) Append the user's spoken question (audio buffer)
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": user_audio_b64
        }))
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # 3) Append the image mid-session
        #    We send the PNG as base64 along with its MIME. (You can also send a URL if supported.)
        await ws.send(json.dumps({
            "type": "input_image.append",
            "image": image_png_b64,
            "mime_type": "image/png",
            # Optionally, add a hint for the model about why you're sending the image:
            "metadata": {
                "purpose": "chart_analysis",
                "caption": "A line chart showing a synthetic trend over time."
            }
        }))
        await ws.send(json.dumps({"type": "input_image.commit"}))

        # 4) Ask for a response that references BOTH the spoken question and the image
        await ws.send(json.dumps({
            "type": "response.create",
            "response": {
                "instructions": (
                    "Please answer in speech. "
                    "Explain the advantages of GPT-5 as a multimodal model, "
                    "and also summarize the main trend you observe in the provided chart. "
                    "Be concise (15–25 seconds)."
                )
            }
        }))

        print("🕑 Waiting for streamed reply...\n")

        # 5) Receive streamed text & audio
        audio_bytes = bytearray()
        sample_rate = 24000  # common server rate; adjust if your session reports differently

        async for msg in ws:
            evt = json.loads(msg)

            # Optional: live transcript/notes
            if evt.get("type") == "response.output_text.delta":
                print(evt.get("delta"), end="", flush=True)

            # Audio deltas (base64-encoded PCM16)
            elif evt.get("type") == "response.output_audio.delta":
                audio_bytes.extend(base64.b64decode(evt["delta"]))

            elif evt.get("type") == "response.completed":
                print("\n\n✅ Response completed.")
                break

        # 6) Save the final spoken reply
        if audio_bytes:
            pcm16 = np.frombuffer(bytes(audio_bytes), dtype=np.int16)
            sf.write(OUTPUT_WAV, pcm16, samplerate=sample_rate, subtype="PCM_16")
            print(f"💾 Saved assistant reply → {OUTPUT_WAV}")
        else:
            print("⚠️ No audio received from server.")

if __name__ == "__main__":
    asyncio.run(main())

Download the audio sample here: https://files.cuantum.tech/audio/user_prompt_spoken.wav

Download the chart image sample here: https://files.cuantum.tech/images/chart.png

Note: Save the example audio in the same location as the Python script.

Here's a code breakdown:

Key Components

  1. Imports and Setup: The script uses several Python libraries:
    • Standard libraries: os, io, json, base64, asyncio
    • websockets: For WebSocket communication with the GPT-5 Realtime API
    • numpy: For audio data manipulation
    • soundfile: For handling WAV file operations
    • PIL (Pillow): For image processing
  2. Configuration: The script defines important constants:
    • OPENAI_API_KEY: Retrieved from environment variables
    • REALTIME_URL: WebSocket endpoint for the GPT-5 Realtime API
    • Input/output file paths: Locations of input audio (WAV), input image (PNG), and output audio
    • VOICE_NAME: Selects "alloy" as the voice for the assistant's reply
    • SYSTEM_INSTRUCTIONS: Defines the assistant's behavior
  3. Helper Functions: Two utility functions for file handling:
    • read_wav_as_base64(): Reads a WAV file and converts it to base64 encoding
    • read_png_as_base64(): Reads a PNG image, ensures it's in the correct format, and converts it to base64
  4. Main Asynchronous Function: The core of the script with these main steps:
    • Input Validation: Checks if the API key is properly set
    • File Loading: Loads and encodes the audio and image files
    • WebSocket Connection: Establishes a connection to GPT-5 Realtime with proper headers
    • Session Configuration: Sets up a session with text, audio, and image modalities
    • Audio Input: Sends the spoken question (WAV) and commits the audio buffer
    • Image Input: Appends the chart image mid-session with metadata about its purpose
    • Response Request: Requests a spoken reply that addresses both inputs
    • Response Processing: Collects streamed text and audio chunks from the server
    • Output Saving: Converts received audio bytes to PCM16 format and saves as WAV

WebSocket Communication Flow

The script follows a specific protocol for communication with the GPT-5 Realtime API:

  1. Sends a session.update message to configure modalities and behavior
  2. Sends the audio data using input_audio_buffer.append and commits it
  3. Adds the image using input_image.append with metadata and commits it
  4. Creates a response request with specific instructions
  5. Processes incoming events in real-time:
    • Text deltas (transcription)
    • Audio deltas (spoken reply chunks)
    • Completion signal

Error Handling

The script includes basic error checking:

  • Validates the API key
  • Checks if audio was received from the server

Key Technical Aspects

The implementation showcases several important concepts:

  • Asynchronous programming with asyncio for non-blocking I/O
  • Base64 encoding for binary data transmission over WebSockets
  • Real-time streaming of both text and audio responses
  • Mid-session multimodality by combining different input types in one conversation
  • Proper WebSocket lifecycle management

This code example demonstrates the power of GPT-5 Realtime's ability to handle multiple modalities within a single ongoing conversation, allowing for more natural and fluid interactions.

5.2.4 Why Audio Integration Matters

Accessibility: Automatic transcription for the hearing impaired. This technology enables real-time conversion of spoken content into text, making digital media, meetings, and educational resources accessible to deaf and hard-of-hearing individuals. Modern transcription systems can work in real-time with high accuracy, providing captions for live events, lectures, and conversations, removing barriers to participation in many aspects of daily life and professional settings.

By integrating audio processing with language models, these systems can accurately capture nuances, different accents, and even distinguish between multiple speakers. This integration enables more contextual understanding, allowing the transcription to include important non-verbal audio cues, proper punctuation, and speaker identification. Advanced systems can also adapt to specialized terminology, regional dialects, and challenging acoustic environments, making information more accessible across diverse settings from medical appointments to entertainment media.

Education: Real-time translation and captions in classrooms. This application transforms how international students engage with lectures by providing immediate translations of spoken content. It also helps all students by generating accurate captions for recorded lectures, making review more efficient and allowing learners to search through spoken content based on keywords or concepts.

Advanced multimodal systems can detect lecture context and technical terminology, accurately translating specialized vocabulary while maintaining academic integrity. These systems can distinguish between different speakers in classroom discussions, properly attributing questions and responses in the transcription.

Furthermore, these technologies enable asynchronous learning by creating searchable archives of lectures that students can navigate by concept rather than timestamp. For students with learning differences such as ADHD or dyslexia, the synchronized visual and auditory information improves comprehension and retention.

The integration of AI with educational content also allows for personalized learning paths, where the system can identify concepts that individual students struggle with based on their engagement patterns and provide targeted supplementary material. This multimodal approach bridges accessibility gaps while enhancing the learning experience for all students.

Assistants: Voice-driven chatbots, smart speakers, and AI tutors. These systems create natural conversation flows by understanding spoken queries and generating contextually appropriate spoken responses. Advanced multimodal assistants can maintain conversational context over extended interactions, understand varying speech patterns, and respond with appropriate intonation and emphasis that matches the content being delivered.

Cross-lingual communication: Breaking down barriers with speech-to-speech translation. This technology enables conversations between people who speak different languages by capturing speech in one language, understanding its meaning, and generating natural-sounding speech in another language. Modern systems preserve speaker characteristics like tone, pace, and emotion, making the exchange feel more personal and authentic.

These systems represent a significant advancement over traditional translation tools by offering real-time communication without requiring text interfaces. The process involves three sophisticated steps: speech recognition to convert spoken words into text, machine translation to convert that text into another language, and text-to-speech synthesis to deliver the translation in a natural voice.

The latest neural translation models understand cultural nuances and idioms that literal translations often miss. For example, when a Japanese speaker uses honorifics that don't exist in English, the system can adapt the output to convey appropriate respect through tone and word choice rather than direct translation.

Additionally, these technologies can adapt to various contexts - from business negotiations where precision is critical to casual conversations where fluidity matters more. Some advanced systems even maintain consistent voice profiles across languages, allowing a Spanish speaker's unique vocal characteristics to be present in the English translation, creating a more seamless and personalized communication experience.

Unlike older systems where speech recognition and language models were separate components chained together with potential information loss at each step, modern multimodal approaches fuse them into unified architectures that process acoustic and linguistic information simultaneously. This integration creates AI that listens and responds more naturally, understanding context across modalities and handling the ambiguities inherent in human communication.

5.2 Audio & Speech Integration (Whisper, SpeechLM)

Language is not only written but spoken. The ability to listen, transcribe, and respond to speech is essential if AI is to become a seamless assistant in daily life. Recent advances in speech recognition and speech-language modeling have made it possible to integrate audio directly into large-scale language systems. This integration represents a significant leap forward in AI capabilities, as it bridges the gap between written and spoken communication.

Speech is our most natural form of communication, and by enabling AI to process audio inputs, we create more intuitive interfaces that don't require users to type or read. This is particularly important for accessibility, allowing those with limited mobility, vision impairments, or literacy challenges to interact with technology. Furthermore, speech carries additional information through tone, pace, and emphasis that text alone cannot convey, providing richer context for AI systems to understand human intent.

Let's explore two key directions in speech-enabled AI:

Whisper – OpenAI's robust speech-to-text system. This open-source model represents a breakthrough in transcription technology with its ability to handle diverse accents, background noise, and technical vocabulary.

Unlike previous speech recognition systems that struggled with real-world audio conditions, Whisper demonstrates remarkable accuracy even with challenging inputs such as podcast conversations, lecture recordings, or phone calls.

SpeechLM / SpeechGPT – models that extend transformers to directly handle audio-text tasks. These advanced systems go beyond simple transcription by maintaining the connection between acoustic features and semantic meaning.

Rather than treating speech-to-text as a separate preprocessing step, they incorporate audio understanding directly into the language modeling process, enabling more nuanced responses that consider not just what was said, but how it was said.

5.2.1 Whisper: Universal Speech Recognition

Whisper is an open-source model from OpenAI designed for speech recognition, translation, and transcription across many languages. Released in September 2022, it represents a significant advancement in audio processing technology by providing robust performance across diverse acoustic environments and speaking styles. Unlike previous speech recognition systems that often struggled with accents, background noise, or specialized vocabulary, Whisper was trained on an extensive dataset of 680,000 hours of multilingual and multitask supervised data collected from the web, giving it remarkable versatility and accuracy.

The model architecture combines a transformer-based encoder that processes audio spectrograms with a decoder similar to GPT models that generates text output. This design allows Whisper to handle the complexities of human speech, including variations in pitch, tone, cadence, and pronunciation across different languages and dialects.

What makes Whisper particularly groundbreaking is its zero-shot capabilities - it can recognize and transcribe speech in languages it wasn't explicitly fine-tuned for. Additionally, Whisper can automatically detect the spoken language, translate speech directly into English, and even handle code-switching (when speakers alternate between multiple languages within a conversation). This versatility makes it valuable for applications ranging from automatic meeting transcription to cross-lingual communication tools and accessibility services for the hearing impaired.

Key features:

  • Trained on 680,000 hours of multilingual audio from the web, including a wide variety of accents, dialects, and background conditions. This massive and diverse training dataset enables Whisper to handle real-world audio that previous systems struggled with. The dataset's scale provides broad coverage across linguistic variations, regional accents, speaking styles, and acoustic environments, giving Whisper an unprecedented ability to understand speech in virtually any context. This extensive training directly translates to Whisper's ability to transcribe speech from speakers with accents or dialects traditionally underrepresented in AI training data.
  • Handles noisy, real-world audio (e.g., phone calls, lectures, podcasts, street recordings) with remarkable resilience. Unlike earlier models that performed well only in studio-quality conditions, Whisper maintains accuracy even with background noise, overlapping speakers, or varying microphone quality. This robustness stems from its exposure to diverse acoustic environments during training, allowing it to filter out irrelevant sounds and focus on the speech signal. Whether processing a recording from a busy café, a conference room with echoing acoustics, or an outdoor interview with wind interference, Whisper can extract the spoken content with surprising accuracy.
  • Supports transcription, translation, and language identification across 99 languages. This multilingual capability allows it to automatically detect the spoken language and process content from global sources without requiring manual language selection. Whisper can seamlessly transcribe content in languages ranging from widely-spoken ones like English, Spanish, and Mandarin to less common languages like Swahili, Lithuanian, and Nepali. This language versatility makes it an invaluable tool for global communication, international research, and cross-cultural content creation. Even more impressively, Whisper can identify when speakers switch between languages mid-conversation, a phenomenon known as code-switching.
  • Features zero-shot learning capabilities, meaning it can perform tasks it wasn't explicitly fine-tuned for, adapting to new scenarios without additional training. This remarkable ability allows Whisper to generalize its knowledge to unfamiliar contexts, speakers, and acoustic environments. For example, without specific fine-tuning, it can transcribe technical jargon in fields like medicine or engineering, understand regional dialects it hasn't explicitly seen before, or adapt to novel audio recording conditions. This zero-shot capability is particularly valuable in practical applications where the diversity of real-world speech would otherwise require countless specialized models for different scenarios.

At its core, Whisper combines a log-Mel spectrogram encoder with a decoder similar to GPT, allowing it to map raw audio to natural language text. The encoder transforms audio waveforms into spectrograms—visual representations of sound frequencies over time—which capture the acoustic patterns in speech. This process begins by converting the raw audio signal into a spectrogram using the Short-Time Fourier Transform (STFT), which breaks down the audio into its frequency components.

These components are then mapped to the Mel scale, which approximates how humans perceive sound frequencies, with greater sensitivity to lower frequencies than higher ones. The resulting log-Mel spectrogram provides a compact representation of the audio that emphasizes the most perceptually relevant features.

These spectrograms are then processed through a transformer encoder that extracts meaningful features. The transformer architecture, with its self-attention mechanisms, allows the model to focus on different parts of the spectrogram simultaneously, capturing both local phonetic details and broader acoustic patterns. This is crucial for handling variations in speech like different accents, speaking rates, and background noise.

The GPT-style decoder then converts these features into text, treating transcription as a sequence prediction task similar to language modeling. This decoder works autoregressively, generating each word or token based on both the encoded audio features and the previously generated text. This approach enables Whisper to maintain contextual coherence throughout the transcription, correctly interpreting ambiguous sounds based on their surrounding context, and producing natural-sounding text that accurately reflects the original speech.

Example: Transcribing Audio with Whisper

# Comprehensive implementation of Whisper for audio transcription

# Install required libraries
# pip install git+https://github.com/openai/whisper.git
# pip install librosa matplotlib numpy

import whisper
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import time
import torch
from pathlib import Path

def visualize_audio(audio_path):
    """Visualize the audio waveform and spectrogram"""
    y, sr = librosa.load(audio_path)
    
    # Create a figure with two subplots
    plt.figure(figsize=(12, 8))
    
    # Plot waveform
    plt.subplot(2, 1, 1)
    librosa.display.waveshow(y, sr=sr)
    plt.title('Waveform')
    
    # Plot spectrogram
    plt.subplot(2, 1, 2)
    D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
    librosa.display.specshow(D, sr=sr, x_axis='time', y_axis='log')
    plt.colorbar(format='%+2.0f dB')
    plt.title('Log-frequency power spectrogram')
    
    plt.tight_layout()
    plt.show()

def transcribe_audio(audio_path, model_size="base", language=None, verbose=True):
    """
    Transcribe audio using OpenAI's Whisper model
    
    Parameters:
    - audio_path: Path to the audio file
    - model_size: Size of the Whisper model to use (tiny, base, small, medium, large)
    - language: Language code (e.g., "en" for English) or None for auto-detection
    - verbose: Whether to print progress information
    
    Returns:
    - Dictionary containing transcription results
    """
    start_time = time.time()
    
    if verbose:
        print(f"Loading Whisper model: {model_size}")
    
    # Load pre-trained Whisper model
    model = whisper.load_model(model_size)
    
    model_load_time = time.time()
    if verbose:
        print(f"Model loaded in {model_load_time - start_time:.2f} seconds")
        print(f"Transcribing: {audio_path}")
    
    # Set transcription options
    options = {}
    if language:
        options["language"] = language
    
    # Transcribe the audio file
    result = model.transcribe(audio_path, **options)
    
    end_time = time.time()
    if verbose:
        print(f"Transcription completed in {end_time - model_load_time:.2f} seconds")
        print(f"Detected language: {result['language']} (confidence: {result.get('language_probability', 0):.2f})")
        print(f"Total processing time: {end_time - start_time:.2f} seconds")
    
    return result

def save_transcription(result, output_file=None):
    """Save transcription results to a text file"""
    if output_file is None:
        output_file = "transcription_output.txt"
    
    with open(output_file, "w", encoding="utf-8") as f:
        # Write the full transcription
        f.write("FULL TRANSCRIPTION:\n")
        f.write(result["text"])
        f.write("\n\n")
        
        # Write segment-by-segment with timestamps
        f.write("SEGMENTS WITH TIMESTAMPS:\n")
        for segment in result["segments"]:
            start = segment["start"]
            end = segment["end"]
            text = segment["text"]
            f.write(f"[{start:.2f}s - {end:.2f}s] {text}\n")
    
    return output_file

def batch_transcribe(directory, extension=".mp3", output_dir=None):
    """Transcribe all audio files with the given extension in a directory"""
    if output_dir is None:
        output_dir = Path("transcription_results")
    
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)
    
    directory = Path(directory)
    audio_files = list(directory.glob(f"*{extension}"))
    
    print(f"Found {len(audio_files)} {extension} files in {directory}")
    
    for audio_file in audio_files:
        print(f"\nProcessing: {audio_file.name}")
        result = transcribe_audio(str(audio_file))
        
        output_file = output_dir / f"{audio_file.stem}_transcription.txt"
        save_transcription(result, output_file)
        print(f"Saved transcription to: {output_file}")

# Example usage
if __name__ == "__main__":
    # Set the path to your audio file
    audio_path = "speech_sample.mp3"
    
    # Check if CUDA is available for GPU acceleration
    cuda_available = torch.cuda.is_available()
    print(f"CUDA available: {cuda_available}")
    
    # Visualize the audio (optional)
    # visualize_audio(audio_path)
    
    # Transcribe the audio
    result = transcribe_audio(audio_path, model_size="base")
    
    # Print the transcription
    print("\nTRANSCRIPTION:")
    print(result["text"])
    
    # Save the transcription to a file
    output_file = save_transcription(result)
    print(f"\nSaved transcription to: {output_file}")
    
    # Example of batch processing
    # batch_transcribe("audio_folder", extension=".wav")

Download the speech sample here: https://files.cuantum.tech/audio/speech_sample.mp3

Note: Save the example audio in the same location as the Python script.

Breaking Down the Whisper Implementation:

1. Setup and Dependencies

The code begins by installing the necessary libraries: Whisper (directly from GitHub), librosa (for audio processing and visualization), matplotlib (for visualization), and numpy (for numerical operations). These libraries provide the foundation for audio processing and transcription.

2. Audio Visualization Function

The visualize_audio() function uses librosa to create two important visualizations:

  • A waveform display showing amplitude over time, which represents how the audio signal varies
  • A log-frequency spectrogram showing how energy is distributed across different frequencies over time, which helps analyze speech characteristics

These visualizations can help users understand the audio characteristics before transcription.

3. Core Transcription Function

The transcribe_audio() function is the heart of the implementation:

  • It accepts parameters for audio path, model size, language, and verbosity level
  • It loads the specified Whisper model (from tiny to large, with larger models being more accurate but slower)
  • It tracks processing time to provide performance metrics
  • It supports automatic language detection or allows specifying a language code
  • It returns a comprehensive result object containing the transcription and metadata

4. Results Processing

The save_transcription() function processes the Whisper results into user-friendly formats:

  • It saves the complete transcription text
  • It also extracts and formats individual segments with their timestamps, which is crucial for aligning transcription with audio timing
  • This enables applications like subtitle generation or time-synchronized content analysis

5. Batch Processing Capability

The batch_transcribe() function extends the utility to handle multiple audio files:

  • It processes all audio files with a specified extension in a directory
  • It organizes outputs into a dedicated directory structure
  • This is valuable for transcribing podcasts, interview series, or lecture collections

6. Example Usage

The main execution block demonstrates how to use these functions in practice:

  • It checks for GPU acceleration via CUDA, which can significantly improve performance for larger models
  • It offers options for audio visualization (commented out by default)
  • It performs transcription and displays the results
  • It saves the output to a file for future reference
  • It includes a commented example of batch processing

Advanced Features:

This implementation goes beyond basic transcription by including:

  • Performance timing to measure processing efficiency
  • Language detection reporting
  • Segment-level transcription with timestamps
  • Hardware acceleration detection
  • Audio analysis capabilities
  • Batch processing for multiple files

This example implementation provides a complete workflow for audio transcription, from preprocessing through visualization, transcription, and results management, making it suitable for both individual use cases and larger-scale applications.

Example: Advanced implementation of Whisper for real-time transcription with visualization

import whisper
import numpy as np
import pyaudio
import threading
import time
import queue
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from collections import deque
import torch
import os
from datetime import datetime

class WhisperRealtimeTranscriber:
    def __init__(self, model_size="base", language="en", energy_threshold=1000, 
                 record_timeout=2, phrase_timeout=3, max_sentences=10):
        """
        Initialize the real-time transcriber with Whisper
        
        Parameters:
        - model_size: Size of Whisper model ("tiny", "base", "small", "medium", "large")
        - language: Language code or None for auto-detection
        - energy_threshold: Minimum audio energy to consider for recording
        - record_timeout: Time in seconds to recheck if audio is speech
        - phrase_timeout: Time in seconds of silence to consider a phrase complete
        - max_sentences: Maximum number of sentences to display in history
        """
        self.model_name = model_size
        self.language = language
        self.energy_threshold = energy_threshold
        self.record_timeout = record_timeout
        self.phrase_timeout = phrase_timeout
        self.max_sentences = max_sentences
        
        # Check for GPU
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Using device: {self.device}")
        
        # Load Whisper model
        print(f"Loading Whisper {model_size} model...")
        self.model = whisper.load_model(model_size).to(self.device)
        print("Model loaded!")
        
        # Initialize audio processing
        self.audio_queue = queue.Queue()
        self.result_queue = queue.Queue()
        self.audio_data = np.zeros(0, dtype=np.float32)
        
        # For visualization
        self.audio_buffer = deque(maxlen=4000)  # ~4 seconds at 16kHz
        self.waveform_data = np.zeros(4000)
        self.spectrogram_data = np.zeros((201, 80))  # Mel spectrogram shape
        self.transcript_history = []
        self.recording = False
        self.terminated = False
        
        # Audio parameters
        self.sample_rate = 16000
        self.audio_format = pyaudio.paFloat32
        self.channels = 1
        self.chunk = 1024
        
        # Setup PyAudio
        self.p = pyaudio.PyAudio()
        
    def _get_audio_input_stream(self):
        """Create and return an input audio stream"""
        stream = self.p.open(
            format=self.audio_format,
            channels=self.channels,
            rate=self.sample_rate,
            input=True,
            frames_per_buffer=self.chunk
        )
        return stream
    
    def _audio_capture_thread(self):
        """Thread function for capturing audio"""
        stream = self._get_audio_input_stream()
        last_sample = bytes()
        phrase_time = None
        
        print("Listening for audio...")
        
        try:
            while not self.terminated:
                # Get new audio chunk
                current_sample = stream.read(self.chunk, exception_on_overflow=False)
                
                # Convert to numpy array
                data = np.frombuffer(current_sample, dtype=np.float32)
                
                # Update audio buffer for visualization
                self.audio_buffer.extend(data)
                self.waveform_data = np.array(list(self.audio_buffer))
                
                # Calculate audio energy
                energy = np.sqrt(np.mean(data**2))
                
                # Detect if audio is speech
                if energy > self.energy_threshold:
                    self.recording = True
                    
                    # Reset phrase timeout
                    phrase_time = None
                    
                    # Add audio to processing queue
                    self.audio_data = np.append(self.audio_data, data)
                
                # Handle phrase timeout
                elif self.recording:
                    if phrase_time is None:
                        phrase_time = time.time()
                    
                    # If enough silence, process the audio phrase
                    if time.time() - phrase_time > self.phrase_timeout:
                        if len(self.audio_data) > 0:
                            self.audio_queue.put(self.audio_data.copy())
                            self.audio_data = np.zeros(0, dtype=np.float32)
                        
                        self.recording = False
                        phrase_time = None
                
                # Process fixed chunks of audio regardless of speech detection
                if len(self.audio_data) > self.sample_rate * self.record_timeout:
                    self.audio_queue.put(self.audio_data.copy())
                    self.audio_data = self.audio_data[int(self.sample_rate * self.record_timeout):]
                
                time.sleep(0.01)
                
        finally:
            stream.stop_stream()
            stream.close()
    
    def _transcription_thread(self):
        """Thread function for processing audio with Whisper"""
        while not self.terminated:
            try:
                # Get audio data from queue
                if self.audio_queue.empty():
                    time.sleep(0.1)
                    continue
                
                audio_data = self.audio_queue.get()
                
                # Skip processing very short audio clips
                if len(audio_data) < 0.5 * self.sample_rate:
                    continue
                
                # Process audio with Whisper
                start_time = time.time()
                
                # Convert to format expected by Whisper
                audio_tensor = torch.tensor(audio_data).to(self.device)
                
                # Generate Mel spectrogram for visualization
                mel = whisper.log_mel_spectrogram(audio_data)
                if len(mel) > 80:
                    mel = mel[:80]
                self.spectrogram_data = mel.T.numpy()  # Transpose for visualization
                
                # Transcribe with Whisper
                options = {"language": self.language} if self.language else {}
                result = self.model.transcribe(audio_data, **options)
                
                # Get transcription result
                text = result["text"].strip()
                elapsed = time.time() - start_time
                
                # Skip empty results
                if len(text) == 0:
                    continue
                
                # Add timestamp and transcription to history
                timestamp = datetime.now().strftime("%H:%M:%S")
                entry = f"[{timestamp}] {text}"
                self.transcript_history.append(entry)
                
                # Keep only most recent entries
                if len(self.transcript_history) > self.max_sentences:
                    self.transcript_history = self.transcript_history[-self.max_sentences:]
                
                # Print result
                print(f"Transcribed ({elapsed:.2f}s): {text}")
                
            except Exception as e:
                print(f"Error in transcription thread: {e}")
                
    def _update_visualization(self, frame):
        """Update function for matplotlib animation"""
        # Clear previous plots
        plt.clf()
        
        # Plot audio waveform
        plt.subplot(3, 1, 1)
        plt.plot(self.waveform_data)
        plt.title("Audio Waveform")
        plt.ylim([-0.5, 0.5])
        
        # Plot status
        if self.recording:
            plt.gca().set_facecolor((0.9, 0.9, 1))
            plt.title("Audio Waveform - RECORDING")
        
        # Plot Mel spectrogram
        plt.subplot(3, 1, 2)
        plt.imshow(self.spectrogram_data, aspect='auto', origin='lower')
        plt.title("Mel Spectrogram")
        plt.tight_layout()
        
        # Show transcript history
        plt.subplot(3, 1, 3)
        plt.axis('off')
        history_text = "\n".join(self.transcript_history)
        plt.text(0.05, 0.95, history_text, 
                 verticalalignment='top', wrap=True, fontsize=9,
                 bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.2))
        plt.title("Transcript History")
        
        # Adjust layout
        plt.subplots_adjust(hspace=0.5)
        
    def start(self, visualize=True):
        """Start the real-time transcription system"""
        # Start audio capture thread
        audio_thread = threading.Thread(target=self._audio_capture_thread)
        audio_thread.daemon = True
        audio_thread.start()
        
        # Start transcription thread
        transcription_thread = threading.Thread(target=self._transcription_thread)
        transcription_thread.daemon = True
        transcription_thread.start()
        
        try:
            if visualize:
                # Set up visualization
                plt.figure(figsize=(10, 8))
                ani = FuncAnimation(plt.gcf(), self._update_visualization, interval=100)
                plt.show()
            else:
                # Just keep the main thread alive
                while True:
                    time.sleep(1)
        except KeyboardInterrupt:
            print("Stopping...")
        finally:
            self.terminated = True
            self.p.terminate()

    def save_transcript(self, filename=None):
        """Save the transcript history to a file"""
        if filename is None:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"transcript_{timestamp}.txt"
        
        with open(filename, "w", encoding="utf-8") as f:
            for entry in self.transcript_history:
                f.write(f"{entry}\n")
        
        print(f"Transcript saved to {filename}")

# Example usage
if __name__ == "__main__":
    # Create and start the transcriber
    transcriber = WhisperRealtimeTranscriber(
        model_size="base",
        language="en",
        energy_threshold=0.01,
        record_timeout=2,
        phrase_timeout=1
    )
    
    try:
        transcriber.start(visualize=True)
    except KeyboardInterrupt:
        pass
    finally:
        transcriber.save_transcript()
Note: To use this example code, you'll need to speak into your microphone during program execution.

Breaking Down the Real-Time Whisper Implementation:

1. Overall Architecture

This advanced implementation creates a real-time speech transcription system using Whisper. Unlike the previous example that processes existing files, this version:

  • Captures live audio input from a microphone
  • Processes audio in chunks as it arrives
  • Provides real-time visualization of the audio signal and transcription
  • Runs Whisper inference continuously on a separate thread

2. Class Structure and Initialization

The WhisperRealtimeTranscriber class encapsulates the entire system:

  • Manages multiple threads for audio capture and processing
  • Maintains queues for communication between threads
  • Configures parameters like energy thresholds for speech detection
  • Initializes visualization components including waveform and spectrogram displays
  • Sets up the Whisper model with GPU acceleration when available

3. Audio Capture System

The _audio_capture_thread method handles continuous audio input:

  • Uses PyAudio to access the microphone stream
  • Implements energy-based voice activity detection to identify speech
  • Manages "phrases" by detecting pauses between speech segments
  • Updates a circular buffer for visualization purposes
  • Queues detected speech for transcription processing

4. Whisper Transcription Engine

The _transcription_thread implements the core speech-to-text functionality:

  • Retrieves audio segments from the queue when available
  • Filters out audio clips that are too short
  • Generates mel spectrograms for both transcription and visualization
  • Runs the Whisper model inference to convert speech to text
  • Maintains a transcript history with timestamps
  • Measures and reports processing time for performance monitoring

5. Real-Time Visualization

The _update_visualization method creates an interactive dashboard:

  • Displays the audio waveform with recording status indicator
  • Shows the mel spectrogram representation used by Whisper
  • Provides a scrolling transcript history panel
  • Updates dynamically using Matplotlib's animation functionality

6. User Interface and Control Flow

The start method orchestrates the system operation:

  • Launches audio capture and transcription threads
  • Sets up the visualization if enabled
  • Handles clean shutdown on user interruption

7. Practical Applications

This implementation offers several advantages over the previous example:

  • Live transcription: Process speech as it happens rather than from files
  • Continuous operation: Run indefinitely for real-time applications
  • Visual feedback: See both the audio signal and the corresponding transcription
  • Speech detection: Automatically identify when someone is speaking
  • Performance monitoring: Track processing times to optimize for real-time use

8. Use Cases for Real-Time Whisper

This implementation is particularly useful for:

  • Live captioning for presentations or meetings
  • Real-time transcription for accessibility purposes
  • Interactive voice-controlled applications
  • Speech analytics and monitoring systems
  • Educational tools showing the relationship between speech and its transcription

9. Technical Considerations

The implementation addresses several challenges:

  • Balancing latency vs. accuracy through parameter tuning
  • Managing computational resources with threading
  • Providing visual feedback without affecting performance
  • Detecting speech vs. silence for efficient processing
  • Formatting and storing transcription results

This real-time implementation represents a significant enhancement over batch processing, enabling interactive applications where immediate transcription is required.

5.2.2 SpeechLM and SpeechGPT: Language Models that Listen

While Whisper excels at ASR (Automatic Speech Recognition), models like SpeechLM and SpeechGPT go a step further: they integrate speech and text into a single transformer framework. This represents a fundamental shift from traditional approaches where speech processing and text understanding were handled by completely separate systems.

This integration is revolutionary because it allows these models to process both modalities simultaneously rather than treating them as separate processing pipelines. By unifying speech and text in the same architecture, these models can leverage contextual information across modalities, resulting in more coherent and contextually appropriate responses. The direct connection between acoustic patterns and semantic meaning enables these models to capture nuances like tone, emphasis, and rhythm that might be lost in a pipeline approach.

To understand the significance of this advancement, consider how traditional speech systems work: first, an ASR component converts audio to text transcripts, then a separate natural language processing (NLP) system analyzes the transcript. Each transition between systems creates an opportunity for information loss. Important acoustic features like speaker emotion, sarcasm detection, or emphasis on specific words are typically stripped away during the initial transcription step.

SpeechLM and SpeechGPT, in contrast, maintain a continuous representation of the speech signal throughout the entire processing chain. This approach preserves crucial paralinguistic information—the non-verbal aspects of communication that often carry significant meaning. For instance, the same phrase spoken with different intonation patterns might convey completely different intentions, from sincere agreement to sarcastic dismissal. By keeping the acoustic signal and its linguistic interpretation linked throughout processing, these models can detect such subtleties.

The technical architecture enabling this integration typically involves specialized encoder modules that process raw audio waveforms or spectrograms into dense vector representations. These speech embeddings are then projected into the same latent space as text embeddings, allowing the transformer's attention mechanisms to establish connections between corresponding elements in both modalities. This cross-modal attention is the key innovation that enables these models to "listen" in a more human-like way.

Unlike traditional systems where speech is first converted to text and then processed by a language model (creating potential information loss at each step), these unified models maintain the richness of the original speech signal throughout processing. This preserves important paralinguistic features such as emotion, speaker identity, and conversational dynamics that are crucial for truly understanding spoken language in context.

SpeechLM (Microsoft):

Pretrained on paired audio–text data, allowing it to develop rich representations that capture both acoustic and linguistic information. This dual-modality training approach enables the model to understand not just what words are being said, but also how they're being said, including tone, emphasis, and speaker characteristics. The model processes raw audio waveforms alongside corresponding transcripts, learning to associate specific acoustic patterns with their semantic meanings. For example, it can distinguish between a question and a statement based on rising or falling intonation, even when the words are identical.

Learns to align acoustic features with linguistic tokens through innovative cross-modal attention mechanisms that map speech patterns to their textual representations. This alignment process creates a shared semantic space where speech and text can interact seamlessly, enabling more accurate interpretation of spoken language. These mechanisms work by establishing bidirectional connections between audio segments and corresponding text tokens, allowing information to flow freely across modalities. When processing a sentence, the model can simultaneously attend to both the acoustic signal and the linguistic structure, creating a unified representation that preserves both aspects of communication.

Supports tasks like speech-to-text, spoken translation, and speech understanding, with superior performance compared to pipeline approaches due to its end-to-end training methodology. By training all components together, SpeechLM avoids error propagation issues common in traditional pipeline systems where mistakes in early stages cascade through the system. In conventional approaches, if the ASR component misrecognizes a word, all downstream components (like translation or understanding) inherit that error. SpeechLM's unified approach allows later processing stages to potentially compensate for earlier uncertainties by leveraging broader contextual information and cross-modal cues, similar to how humans can understand slightly mispronounced words in context.

Utilizes self-supervised learning techniques to maximize learning from limited paired data, enabling robust performance even with limited annotations. These techniques include masked language modeling adapted for speech inputs, contrastive learning between speech and text representations, and consistency regularization across modalities. During training, the model might be presented with an audio segment where certain portions are masked out, requiring it to predict the missing acoustic information based on surrounding context and any available text. Similarly, it learns to minimize the distance between representations of the same content expressed in different modalities, helping to align the speech and text embedding spaces. This approach allows SpeechLM to leverage large quantities of unpaired speech or text data alongside smaller amounts of parallel data.

Incorporates advanced contextual understanding that allows it to better handle ambiguous speech, speaker variations, and noisy environments compared to traditional ASR systems. By maintaining a rich contextual representation throughout processing, SpeechLM can disambiguate homophones (words that sound alike but have different meanings) based on broader semantic context, adapt to different accents and speaking styles by recognizing patterns across larger segments of speech, and filter out background noise by distinguishing relevant speech patterns from irrelevant acoustic information. The model's attention mechanisms can focus on the most informative parts of the signal while de-emphasizing distracting elements, similar to how humans can follow a conversation in a crowded room—often called the "cocktail party effect."

SpeechGPT:

Extends LLMs to work directly with speech as input/output, eliminating the need for separate ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) systems in conversational applications. Traditional conversational AI systems typically require a pipeline approach where speech is first converted to text, processed by a language model, and then converted back to speech for the response. This multi-stage process introduces latency at each conversion point and often loses important acoustic information along the way.

SpeechGPT, however, integrates these components into a unified architecture that processes speech signals end-to-end. This direct integration enables smoother conversational flow, as the model processes speech signals directly without converting to intermediate text representations, reducing latency and preserving acoustic nuances that might be lost in traditional pipeline approaches. By maintaining the integrity of the original speech signal throughout processing, SpeechGPT can detect subtle variations in tone, rhythm, and emphasis that carry important communicative information beyond the literal words being spoken.

Can transcribe, understand, and even generate spoken dialogue in a unified framework, maintaining conversational context across multiple turns. Unlike traditional systems that process each utterance independently, SpeechGPT maintains a continuous memory of the conversation, allowing it to reference previous statements and generate contextually appropriate responses that acknowledge shared history between speakers.

This contextual awareness means the model can track topics across multiple exchanges, resolve ambiguous references, and respond appropriately to follow-up questions without requiring users to restate information. For example, if a user asks about the weather today and then follows up with "What about tomorrow?", SpeechGPT can understand that the second question is still about weather without explicit specification. This ability to maintain conversational state mirrors human dialogue patterns where context is implicitly understood and carried forward, creating more natural and efficient interactions.

Useful for conversational agents that naturally handle both modalities, creating more human-like interactions without modal switching delays. This seamless integration between speech and text processing mimics human communication patterns where we naturally shift between listening and speaking without conscious mode switching, enabling more fluid and natural dialogues with AI systems. In practice, this means users can speak directly to the system and receive spoken responses without perceiving any translation happening behind the scenes.

For applications like virtual assistants, customer service bots, or educational tutors, this creates a significantly more natural user experience that reduces cognitive load on users. The elimination of perceptible modal transitions also increases accessibility for users who may struggle with text interfaces, such as those with visual impairments, reading difficulties, or situations where looking at a screen is impractical (like while driving).

Demonstrates improved prosody and intonation in generated speech by leveraging the semantic understanding capabilities of the underlying LLM. By comprehending the meaning, emotion, and pragmatic intent behind responses, SpeechGPT can apply appropriate stress patterns, rhythm variations, and tonal shifts that convey not just what is being said, but how it should be said to effectively communicate meaning. This represents a significant advance over traditional TTS systems that often produce flat, monotonous speech that lacks the natural variations human speakers use to express meaning.

For instance, when expressing excitement, the system can increase pitch and speed; when conveying serious information, it can adopt a more measured pace with appropriate emphasis on key points. These prosodic features are crucial for effective communication, as they help listeners interpret the speaker's intentions, distinguish between questions and statements, identify important information, and understand emotional context. The ability to generate appropriately expressive speech makes interactions feel more natural and helps ensure that the intended meaning is accurately conveyed to users.

How They Work:

  1. Convert audio into speech embeddings using a feature extractor (like wav2vec2), which captures phonetic, prosodic, and speaker information from raw waveforms into dense vector representations. This process transforms complex audio signals into numerical matrices that preserve crucial linguistic features including pronunciation patterns, speech rhythm, emotional tone, and individual voice characteristics. The resulting embeddings create a mathematical representation of speech that models can process efficiently while maintaining the rich acoustic properties of the original audio.
  2. Align embeddings with text tokens in the transformer through cross-attention mechanisms, creating a joint representation space where acoustic and linguistic features can interact freely. These mechanisms allow the model to establish connections between corresponding elements in both modalities, mapping specific acoustic patterns to their textual counterparts. This alignment process creates bidirectional pathways that enable information to flow between speech and text representations, facilitating tasks like spoken language understanding where both the content and delivery of speech matter.
  3. Train on tasks that require both listening and understanding, such as answering questions about spoken content or following verbal instructions, to develop robust multimodal comprehension abilities. This training approach forces the model to process auditory and textual information simultaneously, extracting meaning from both channels and integrating them into a unified semantic representation. By presenting the model with increasingly complex spoken language understanding challenges, it learns to recognize not just what words are being said, but also how context, emphasis, and tone modify their meaning.
  4. Utilize specialized loss functions that encourage semantic consistency between speech and text representations, ensuring that information is preserved across modality boundaries. These loss functions compare the model's internal representations of the same content expressed in different modalities and penalize inconsistencies, driving the model to develop aligned feature spaces. By minimizing the distance between representations of equivalent content across modalities, these functions help the model build a cohesive understanding regardless of whether information arrives as text or speech.
  5. Employ curriculum learning strategies that gradually increase task complexity, starting with simple speech recognition before progressing to more complex understanding and generation tasks. This staged approach begins with basic transcription to establish fundamental audio-text mappings, then advances to more sophisticated tasks like identifying speaker intent, recognizing emotion, and generating contextually appropriate responses to spoken queries. The progressive difficulty helps the model develop a hierarchy of speech understanding capabilities, from low-level acoustic processing to high-level semantic interpretation.

Code Example: SpeechLM Implementation

from transformers import AutoProcessor, SpeechLMForSpeechToText
import torch
import soundfile as sf
import librosa

# Load pretrained SpeechLM model and processor
model_id = "microsoft/speechlm-large-960h"
processor = AutoProcessor.from_pretrained(model_id)
model = SpeechLMForSpeechToText.from_pretrained(model_id)

# Load and preprocess audio file
audio_file = "speechlm-example.mp3"
speech, sample_rate = sf.read(audio_file)

# Resample if necessary
if sample_rate != 16000:
    speech = librosa.resample(speech, orig_sr=sample_rate, target_sr=16000)
    sample_rate = 16000

# Prepare inputs
inputs = processor(speech, sampling_rate=sample_rate, return_tensors="pt")

# Generate transcription
with torch.no_grad():
    generated_ids = model.generate(
        inputs["input_features"],
        num_beams=5,
        max_length=100
    )

# Decode the output tokens
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")

# For speech understanding tasks, we can also get embeddings
with torch.no_grad():
    outputs = model(
        input_features=inputs["input_features"],
        attention_mask=inputs["attention_mask"],
        decoder_input_ids=processor.get_decoder_prompt_ids(task="asr")
    )
    
    # Get speech embeddings from the encoder
    speech_embeddings = outputs.encoder_last_hidden_state
    print(f"Speech embeddings shape: {speech_embeddings.shape}")
    
    # These embeddings can be used for downstream tasks like
    # speaker identification, emotion recognition, or semantic analysis

Download the speech sample here: https://files.cuantum.tech/audio/speechlm-example.mp3

Note: Save the example audio in the same location as the Python script.

Code Breakdown: SpeechLM Implementation

This SpeechLM code example demonstrates how to use Microsoft's SpeechLM model for speech transcription and understanding. Let's examine each component:

  1. Imports and Model Loading: The code imports necessary libraries and loads the pretrained SpeechLM model and processor from Hugging Face. SpeechLM is a speech-language model that can process raw audio waveforms and perform tasks like transcription and understanding.
  2. Audio Processing: The audio file is loaded using soundfile and potentially resampled to 16kHz (the standard sampling rate expected by most speech models). This preprocessing ensures the audio input matches the format expected by the model regardless of the source recording conditions.
  3. Input Preparation: The processor converts the raw audio waveform into the model's expected input format. This includes extracting acoustic features (similar to spectrograms) and preparing attention masks to handle variable-length inputs. These features capture the phonetic and prosodic information from the speech signal.
  4. Transcription Generation: The model.generate() method performs beam search decoding to convert the audio features into text. This process uses the model's encoder-decoder architecture to map speech representations to text tokens. The num_beams parameter controls how many alternative hypotheses the model considers during decoding, while max_length limits the output length.
  5. Decoding: The processor.batch_decode() function converts the generated token IDs back into human-readable text, removing any special tokens (like padding or end-of-sequence markers) that are used internally by the model but aren't part of the actual transcription.
  6. Speech Embeddings Extraction: Beyond simple transcription, the code demonstrates how to access the model's internal representations of speech. The encoder_last_hidden_state contains rich contextual embeddings that capture both acoustic and linguistic properties of the speech. These embeddings preserve paralinguistic features (tone, emphasis, emotion) that might be lost in text transcription.

Technical Insights on SpeechLM's Architecture

SpeechLM represents a significant advancement in speech processing for several reasons:

  • Unified encoder-decoder architecture: Unlike pipeline approaches that separate ASR and language understanding, SpeechLM processes the entire speech-to-meaning pathway in a single model, reducing error propagation between components.
  • Contextual understanding: The transformer architecture allows the model to capture long-range dependencies in speech, helping it understand content based on the broader context rather than just isolated segments.
  • Cross-modal pretraining: SpeechLM is pretrained on paired speech-text data, allowing it to develop aligned representations between acoustic and linguistic features. This alignment enables more accurate transcription and understanding of spoken language.
  • Speech embeddings: The model's encoder produces contextualized speech embeddings that preserve both linguistic content and paralinguistic features (like speaker identity, emotion, and emphasis). These rich representations can be used for downstream tasks beyond basic transcription.

Practical Applications

The speech embeddings extracted in the example could be used for:

  • Speaker recognition: Identifying who is speaking based on voice characteristics preserved in the embeddings.
  • Emotion detection: Analyzing the emotional tone of speech from acoustic patterns.
  • Intent classification: Determining what the speaker wants to accomplish (ask a question, make a request, etc.).
  • Speech translation: Converting speech in one language to text in another by connecting the speech embeddings to a translation model.

SpeechLM represents an important step toward truly integrated speech-language models that process spoken language in a more human-like way, maintaining the rich acoustic information that gives speech its nuanced meaning beyond just the words being said.

Code Example: SpeechGPT Implementation

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

# Load pretrained SpeechGPT model and processor
model_id = "microsoft/speech_gpt2_oaitr"  # Note: This is an example model ID
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id)

# Function to handle conversational speech input and output
def speech_conversation(audio_path, conversation_history=None):
    # Load audio file
    waveform, sample_rate = torchaudio.load(audio_path)
    
    # Resample if necessary (SpeechGPT typically expects 16kHz audio)
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(sample_rate, 16000)
        waveform = resampler(waveform)
        sample_rate = 16000
    
    # Convert to mono if stereo
    if waveform.shape[0] > 1:
        waveform = torch.mean(waveform, dim=0, keepdim=True)
    
    # Process audio input
    inputs = processor(
        audio=waveform.squeeze().numpy(),
        sampling_rate=sample_rate,
        return_tensors="pt",
        conversation_history=conversation_history
    )
    
    # Generate response
    with torch.no_grad():
        output = model.generate(
            input_features=inputs["input_features"],
            attention_mask=inputs.get("attention_mask"),
            max_length=100,
            num_beams=5,
            early_stopping=True,
            conversation_history=inputs.get("conversation_history")
        )
    
    # Process the output
    transcription = processor.decode(output[0], skip_special_tokens=True)
    
    # Optional: Convert response to speech
    speech_output = model.generate_speech(
        output,
        speaker_embeddings=inputs.get("speaker_embeddings")
    )
    
    # Save the generated speech
    torchaudio.save(
        "response.wav", 
        speech_output.squeeze().unsqueeze(0), 
        16000
    )
    
    # Update conversation history
    new_conversation_history = {
        "input_speech": waveform.squeeze().numpy(),
        "output_text": transcription,
        "output_speech": speech_output.squeeze().numpy()
    }
    
    if conversation_history:
        conversation_history.append(new_conversation_history)
    else:
        conversation_history = [new_conversation_history]
    
    return transcription, speech_output, conversation_history

# Example usage
if __name__ == "__main__":
    # Start a new conversation
    conversation_history = None
    
    # First interaction
    user_query = "user_question_.mp3"  # Path to audio file with user's question
    response_text, response_audio, conversation_history = speech_conversation(
        user_query, conversation_history
    )
    
    print(f"User (transcribed): {response_text}")
    
    # Second interaction (with conversation history for context)
    follow_up_query = "user_follow_up.mp3"  # Path to follow-up question audio
    response_text2, response_audio2, conversation_history = speech_conversation(
        follow_up_query, conversation_history
    )
    
    print(f"User follow-up (transcribed): {response_text2}")

Download the user question audio sample here: https://files.cuantum.tech/audio/user_question_.mp3

Download the user follow up audio sample here: https://files.cuantum.tech/audio/user_follow_up.mp3

Note: Save the example audios in the same location as the Python script.

Code Breakdown: SpeechGPT Implementation

  1. Imports and Model Loading: The code imports PyTorch, torchaudio, and Hugging Face transformers to work with the SpeechGPT model. We load a pretrained model and processor that can handle both speech input and output in a conversational context.
  2. Conversation Function: The speech_conversation function serves as the core component, handling the entire speech-to-speech conversation flow. It takes an audio path and optional conversation history as inputs.
  3. Audio Preprocessing: The function loads the audio file using torchaudio, ensures it's at the required 16kHz sample rate (resampling if necessary), and converts stereo to mono if needed. These preprocessing steps ensure the audio meets the model's input requirements.
  4. Input Processing: The processor converts the raw audio waveform into the feature representations expected by SpeechGPT. Importantly, it includes the conversation history parameter, which allows the model to maintain context across multiple turns.
  5. Response Generation: The model generates a response based on the speech input and conversation context. The generation parameters control the quality and length of the response: 
    • max_length: Limits the response length
    • num_beams: Uses beam search with 5 beams for better quality responses
    • early_stopping: Terminates generation when all beams reach an end token
    • conversation_history: Provides context from previous exchanges
  6. Speech Synthesis: Unlike traditional models that would require a separate TTS system, SpeechGPT can directly generate speech output from its internal representations. The generate_speech method converts the text response into audio, maintaining speaker characteristics if provided.
  7. Conversation State Management: The function tracks conversation history by storing each exchange (input speech, output text, output speech) in a structured format. This history is passed to subsequent calls, enabling the model to reference previous information.
  8. Example Usage: The code demonstrates a two-turn conversation, showing how to: 
    • Start a new conversation (empty history)
    • Process the first user query
    • Maintain conversation context
    • Handle a follow-up question while preserving context

Technical Insights on SpeechGPT Architecture

SpeechGPT represents a significant advancement in speech-language models by integrating several key architectural innovations:

  • End-to-end speech-to-speech framework: Unlike traditional pipeline approaches that separate ASR, language understanding, and TTS components, SpeechGPT unifies these capabilities in a single model, reducing latency and error propagation.
  • Joint speech-text representation: The model learns a shared embedding space for both speech and text, allowing for seamless transitions between modalities without information loss. This joint representation enables the model to maintain the emotional and prosodic elements of speech alongside semantic content.
  • Conversation-aware transformer: SpeechGPT extends the standard transformer architecture with additional mechanisms to track conversation state and maintain coherence across multiple turns. This includes specialized attention layers that can reference previous exchanges.
  • Prosody modeling: The speech generation component preserves natural intonation, rhythm, and emphasis patterns by incorporating prosodic features into the generation process. This results in more human-like speech output compared to traditional TTS systems.

Key Advantages Over Traditional Speech Systems

  • Contextual understanding: SpeechGPT maintains conversation state across multiple turns, allowing it to handle follow-up questions, resolve references, and build on previous exchanges without requiring users to restate context.
  • Seamless modality transitions: The unified architecture eliminates perceptible delays between speech understanding and response generation, creating more natural conversational flow.
  • Expressive speech generation: By leveraging its language understanding capabilities, SpeechGPT can apply appropriate prosody and intonation that matches the semantic and emotional content of responses.
  • Reduced latency: The end-to-end design eliminates the computational overhead of separate ASR, NLU, and TTS systems, enabling faster response times in interactive applications.

Practical Applications

SpeechGPT's unified speech-language capabilities make it particularly well-suited for:

  • Virtual assistants: Creating more natural and contextually aware voice interfaces for smart devices and applications.
  • Accessibility tools: Developing conversation systems for users with visual impairments or those who prefer speech interfaces.
  • Language learning: Building interactive tutors that can engage in spoken dialogue while maintaining context across a learning session.
  • Customer service: Powering voice bots that can handle complex, multi-turn conversations with natural speech patterns.

The integration of speech and language understanding in a single model represents a significant step toward more human-like AI communication systems that can engage in natural conversation across modalities.

Code Example: Extracting Speech Features with wav2vec2 (Hugging Face)

from transformers import Wav2Vec2Processor, Wav2Vec2Model, Wav2Vec2ForCTC
import torch
import soundfile as sf
import librosa
import matplotlib.pyplot as plt
import numpy as np
import IPython.display as ipd

# Function to load and preprocess audio
def load_audio(file_path, target_sr=16000):
    """
    Load audio file and resample if necessary
    """
    # Load audio using librosa (handles various formats better)
    try:
        audio, sample_rate = librosa.load(file_path, sr=None)
        # Resample if needed
        if sample_rate != target_sr:
            print(f"Resampling from {sample_rate}Hz to {target_sr}Hz")
            audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=target_sr)
            sample_rate = target_sr
        return audio, sample_rate
    except Exception as e:
        print(f"Error loading audio: {e}")
        return None, None

# Load pretrained speech model
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

# Also load ASR model for transcription demo
asr_model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

# Load audio file
speech, rate = load_audio("speech_sample_w.mp3")
if speech is not None:
    # Display audio waveform
    plt.figure(figsize=(10, 4))
    plt.plot(speech)
    plt.title("Audio Waveform")
    plt.xlabel("Time (samples)")
    plt.ylabel("Amplitude")
    plt.show()
    
    # Display audio for listening
    ipd.display(ipd.Audio(speech, rate=rate))
    
    # Process audio for feature extraction
    inputs = processor(speech, sampling_rate=rate, return_tensors="pt", padding=True)
    
    # Extract embeddings
    with torch.no_grad():
        # Get the hidden states (embeddings)
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
        
        # Also get the transcription from ASR model
        logits = asr_model(**inputs).logits
        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = processor.batch_decode(predicted_ids)[0]
    
    print("Audio transcription:", transcription)
    print("Shape of embeddings:", embeddings.shape)  # [batch, time, hidden_dim]
    
    # Visualize embeddings
    # Take mean across time dimension to get a single vector per feature
    mean_embeddings = embeddings.mean(dim=1).squeeze().numpy()
    
    plt.figure(figsize=(12, 6))
    plt.imshow(mean_embeddings.reshape(1, -1), aspect='auto', cmap='viridis')
    plt.colorbar()
    plt.title("Speech Embeddings Visualization")
    plt.xlabel("Feature Dimensions")
    plt.ylabel("Sample")
    plt.show()
    
    # Demonstrate feature extraction for downstream tasks
    # Example: Extract global speech representation (average pooling)
    global_speech_vector = embeddings.mean(dim=1)
    print("Global speech vector shape:", global_speech_vector.shape)  # [batch, hidden_dim]
    
    # Example: Extract frame-level features for a specific segment (middle 1 second)
    middle_frame = embeddings.shape[1] // 2
    segment_features = embeddings[0, middle_frame-25:middle_frame+25, :]  # ~1 second at 50Hz frame rate
    print("Segment features shape:", segment_features.shape)  # [frames, hidden_dim]
else:
    print("Failed to load audio file. Please check the path and file format.")

Download the audio sample here: https://files.cuantum.tech/audio/speech_sample_w.mp3

Note: Save the example audio in the same location as the Python script.

Comprehensive Code Breakdown: Speech Feature Extraction with Wav2Vec2

  • 1. Imports and Setup
    • We import the necessary libraries: Transformers for the Wav2Vec2 models, PyTorch for tensor operations, soundfile/librosa for audio processing, and visualization tools.
    • We include both the base Wav2Vec2Model (for embeddings) and Wav2Vec2ForCTC (for transcription) to demonstrate multiple use cases.
  • 2. Audio Loading and Preprocessing
    • The load_audio function handles various audio formats and automatically resamples to 16kHz if necessary (Wav2Vec2's expected sample rate).
    • Using librosa instead of soundfile provides better support for various audio formats and error handling.
  • 3. Model Initialization
    • We load the pretrained Wav2Vec2 processor and model from Hugging Face's model hub.
    • The processor handles tokenization of audio data into the format expected by the model.
    • We also load the ASR variant of the model to demonstrate speech recognition capabilities.
  • 4. Visualization
    • We plot the audio waveform to provide visual insight into the signal being processed.
    • We use IPython's audio display capabilities to allow for listening to the audio directly in notebooks.
  • 5. Feature Extraction
    • The processor converts the raw audio into the input format required by the model.
    • With torch.no_grad(), we ensure no gradients are computed during inference, saving memory.
    • We extract the last_hidden_state which contains the contextualized audio embeddings.
  • 6. Transcription
    • Using the ASR model variant, we convert the same audio input into text.
    • This demonstrates how the same audio features can be used for multiple downstream tasks.
  • 7. Embedding Visualization and Analysis
    • We visualize the embeddings using a heatmap to give insight into the feature patterns.
    • We demonstrate two common ways to use the embeddings:
    • Global representation: averaging across time to get a single vector representing the entire utterance (useful for speaker identification, emotion recognition, etc.)
    • Frame-level features: extracting time-aligned segments for fine-grained analysis (useful for alignment, pronunciation assessment, etc.)
  • 8. Error Handling
    • The code includes basic error handling to gracefully deal with issues like missing files or unsupported formats.

Technical Insights: Why This Approach Matters

  • Wav2Vec2 is a self-supervised model trained on massive amounts of unlabeled speech data, allowing it to learn robust speech representations without requiring transcriptions.
  • The extracted embeddings capture phonetic content, speaker characteristics, emotional tone, and acoustic environment information in a unified representation.
  • These embeddings serve as excellent features for downstream tasks like speech recognition, speaker identification, and emotion classification.
  • The contextual nature of the embeddings (each frame is influenced by surrounding audio) makes them more powerful than traditional acoustic features like MFCCs.

5.2.3 GPT-5 Realtime: Low-Latency Voice Interaction

While Whisper demonstrated the ability to transcribe speech with high accuracy (speech → text) and models like SpeechLM and SpeechGPT extended this by integrating spoken inputs into large language models, GPT-5 Realtime represents the next leap forward: a model that can listen and respond in natural speech almost instantly. This breakthrough addresses the fundamental limitation of earlier systems - the noticeable delay between input and response that made interactions feel mechanical rather than natural.

This is not merely speech recognition paired with text generation and then a separate text-to-speech system bolted on top. Earlier approaches typically followed a pipeline architecture where each component operated independently, creating bottlenecks and inconsistencies. Instead, GPT-5 Realtime is natively multimodal, trained to process audio as a first-class input and to produce audio as a first-class output. This integrated approach means the model understands the prosody, emotion, and nuances in spoken language directly, without information loss from intermediate text representations.

The result is a conversational agent capable of fluid, human-like dialogue with latency measured in tens of milliseconds, making it suitable for real-world conversations, tutoring, and customer service. This ultra-low latency is achieved through specialized architectures that process audio streams incrementally rather than waiting for complete utterances, along with predictive mechanisms that anticipate likely responses. The end-to-end optimization eliminates the cumulative delays inherent in pipeline approaches, creating interactions that feel remarkably human in their timing and rhythm.

Architecture and Capabilities

GPT-5 Realtime integrates multiple components into one coherent system, creating a seamless conversational experience:

  • Speech-in: Users can send raw audio (16-bit PCM WAV, 24 kHz mono is a safe default). The model transcribes and interprets speech in real time, converting acoustic signals into semantic understanding. Unlike traditional speech recognition systems that merely transcribe words, GPT-5 Realtime captures nuances, emotions, and contextual cues from the audio input, preserving the richness of human communication.
  • Speech-out: The model responds with synthetic but natural-sounding speech, streamed back as low-latency audio frames. Different voices and speaking styles can be selected to match user preferences or specific use cases. The generated speech maintains appropriate prosody, emphasis, and intonation patterns that make the interaction feel genuinely human-like rather than robotic.
  • Full multimodality: In addition to audio, GPT-5 Realtime sessions can also accept text and image inputs mid-conversation, allowing for hybrid interactions (e.g., "Look at this chart and tell me about it" while speaking). This flexibility enables seamless transitions between modalities, supporting more natural workflows where users might want to show visual information while continuing to speak, similar to how humans communicate in meetings or educational settings.
  • Low latency: Because the model is optimized for conversational flow, response latency is comparable to a human pause in speech — generally under 300 ms. This is achieved through specialized streaming architectures and predictive processing that begins generating responses before the user has finished speaking. The near-instantaneous turnaround creates a conversational rhythm that feels natural and engaging, eliminating the awkward pauses common in earlier AI systems.
  • Telephony integration: GPT-5 Realtime sessions can be connected to SIP (Session Initiation Protocol), enabling the model to act as a phone-based agent. This integration allows the model to handle inbound and outbound calls over standard telephone networks, making advanced AI accessible through the most ubiquitous communication technology worldwide, without requiring specialized equipment or applications.

Together, these features push AI systems beyond one-way transcription or delayed response, toward live conversational intelligence.

Practical Example

For consistency with our multimodal focus, we’ll use a short audio file (user_prompt_spoken.wav) where the user asks:

“Can you explain the advantages of GPT-5 as a multimodal model?”

When sent to GPT-5 Realtime, the model will:

  1. Transcribe the spoken question.
  2. Reason about the content.
  3. Generate speech that explains the advantages of GPT-5’s multimodality.

The round-trip feels like a natural dialogue with a knowledgeable assistant.

Code Example: Realtime Voice with GPT-5

The following Python script shows how to connect to the Realtime API using WebSockets, send a short WAV file as input, and save the assistant’s spoken reply as a new WAV file.

"""
Realtime Voice with GPT-5 (WebSocket API)
- Sends a short WAV file (user_prompt_spoken.wav) to GPT-5 Realtime
- Receives streamed audio back and saves it to assistant_reply.wav
Requirements: pip install websockets soundfile numpy
"""

import os
import json
import base64
import asyncio
import websockets
import numpy as np
import soundfile as sf

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-5-realtime"

INPUT_WAV = "user_prompt_spoken.wav"   # spoken question
OUTPUT_WAV = "assistant_reply.wav"     # assistant’s voice reply

def read_wav_as_base64(path: str) -> str:
    """Read WAV file and return base64-encoded string."""
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

async def main():
    if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
        raise SystemExit("Set OPENAI_API_KEY env variable.")

    # Load spoken user prompt
    user_audio_b64 = read_wav_as_base64(INPUT_WAV)

    # Connect to Realtime WebSocket
    async with websockets.connect(
        REALTIME_URL,
        extra_headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1",
        },
        max_size=20 * 1024 * 1024,
    ) as ws:
        print("Connected to GPT-5 Realtime.")

        # 1) Configure session (input/output formats, voice)
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "input_audio_format": "wav",
                "output_audio_format": "wav",
                "voice": "alloy",
                "instructions": (
                    "You are a helpful voice assistant. "
                    "Answer the user’s question clearly and concisely."
                )
            }
        }))

        # 2) Append user audio to input buffer
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": user_audio_b64
        }))
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # 3) Ask model to create a response
        await ws.send(json.dumps({"type": "response.create", "response": {}}))

        print("Waiting for GPT-5 Realtime reply...")

        # 4) Collect audio frames
        audio_bytes = bytearray()
        sample_rate = 24000  # expected sample rate

        async for msg in ws:
            evt = json.loads(msg)

            if evt.get("type") == "response.output_text.delta":
                print(evt.get("delta"), end="", flush=True)

            elif evt.get("type") == "response.output_audio.delta":
                audio_bytes.extend(base64.b64decode(evt["delta"]))

            elif evt.get("type") == "response.completed":
                print("\n[Response completed]")
                break

        # 5) Save assistant’s reply as a WAV file
        pcm16 = np.frombuffer(bytes(audio_bytes), dtype=np.int16)
        sf.write(OUTPUT_WAV, pcm16, samplerate=sample_rate, subtype="PCM_16")
        print(f"[Saved] {OUTPUT_WAV}")

if __name__ == "__main__":
    asyncio.run(main())

Download the audio sample here: https://files.cuantum.tech/audio/user_prompt_spoken.wav

Note: Save the example audio in the same location as the Python script.

Code breakdown:

  1. Session Setup
    • The client connects to the Realtime WebSocket and sends a session.update message specifying:
      • Input modality: audio (WAV).
      • Output modality: audio (WAV).
      • Selected voice (e.g., "alloy").
    • This defines the rules of the conversation.
  2. Input Buffering
    • Audio files (or live microphone frames) are base64-encoded and appended to an input buffer.
    • commit message signals the end of input.
  3. Response Creation
    • response.create message tells GPT-5 to process the buffer and generate a reply.
  4. Streaming Output
    • The server streams back two types of deltas:
      • response.output_text.delta (optional live transcript).
      • response.output_audio.delta (audio chunks).
    • Audio chunks are collected into a byte array until response.completed is received.
  5. Saving the File
    • The reply is written as a standard 24 kHz PCM16 WAV file, playable in any media player.

Applications and Implications

GPT-5 Realtime demonstrates how far multimodal LLMs have evolved:

  • Conversational Agents: Natural, low-latency assistants that can answer customer queries or provide educational tutoring over phone or web.
  • Accessibility: Voice-based interfaces for users who cannot easily type or read text.
  • Hybrid Interactions: Combine voice with images and text mid-conversation, enabling richer multi-turn exchanges.
  • Telephony Integration: Deploy AI agents that can handle SIP phone calls, routing, and form-filling.

Example Code: Live Microphone Capture with GPT-5 Realtime (speech-in → speech-out)

What it does:

  • Records ~3 seconds from your default mic
  • Streams it to GPT-5 Realtime over WebSocket
  • Saves the model’s spoken reply as assistant_reply.wav
  • Prints a live text transcript (if provided by the server)

Requirements

pip install websockets sounddevice soundfile numpy
  • OS mic permissions: allow terminal/IDE access to the microphone (macOS: System Settings → Privacy & Security → Microphone; Windows: Privacy → Microphone).
"""
Live Mic → GPT-5 Realtime → Spoken Reply
Records ~3 seconds of audio from your default microphone, streams it to GPT-5 Realtime,
and saves the assistant's spoken response to assistant_reply.wav.

Requirements:
  pip install websockets sounddevice soundfile numpy
Set your key:
  export OPENAI_API_KEY="sk-..."   (macOS/Linux)
  setx OPENAI_API_KEY "sk-..."     (Windows, new terminal)

If you prefer MP3 I/O, see the note in your book; this example uses WAV (PCM16 @ 24 kHz).
"""

import os
import json
import base64
import asyncio
import websockets
import numpy as np
import sounddevice as sd
import soundfile as sf

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-5-realtime"

# Recording settings (safe defaults for Realtime)
SAMPLE_RATE = 24000          # 24 kHz mono PCM16
CHANNELS = 1
DURATION_SECONDS = 3.0       # keep short for quick tests
OUTPUT_WAV = "assistant_reply.wav"

SYSTEM_INSTRUCTIONS = (
    "You are a helpful voice assistant. Transcribe the user if needed, "
    "then answer clearly in one or two sentences."
)

def record_from_mic(seconds: float = DURATION_SECONDS, sr: int = SAMPLE_RATE) -> bytes:
    """Record mono PCM16 audio from the default microphone and return raw bytes."""
    print(f"🎙️  Recording {seconds:.1f}s from microphone...")
    audio = sd.rec(int(sr * seconds), samplerate=sr, channels=CHANNELS, dtype="int16")
    sd.wait()
    print("✅ Done.")
    # audio is int16 numpy array; convert to raw bytes
    return audio.tobytes()

def b64encode_pcm16_wav(pcm_bytes: bytes, sr: int = SAMPLE_RATE) -> str:
    """
    Wrap raw PCM16 bytes into a WAV file in memory and return base64 string.
    Using soundfile to write to bytes buffer for simplicity.
    """
    import io
    buf = io.BytesIO()
    # convert bytes -> int16 array so soundfile can write it
    arr = np.frombuffer(pcm_bytes, dtype=np.int16)
    sf.write(buf, arr, sr, subtype="PCM_16", format="WAV")
    return base64.b64encode(buf.getvalue()).decode("utf-8")

async def main():
    if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
        raise SystemExit("Set OPENAI_API_KEY environment variable first.")

    # 1) Capture short mic audio & base64-encode as WAV
    pcm = record_from_mic()
    user_audio_b64 = b64encode_pcm16_wav(pcm)

    # 2) Connect to Realtime WS
    async with websockets.connect(
        REALTIME_URL,
        extra_headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1",
        },
        max_size=20 * 1024 * 1024,  # allow large frames
    ) as ws:
        print("🔌 Connected to GPT-5 Realtime.")

        # 3) Configure session: audio in/out (WAV), pick a voice
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "input_audio_format": "wav",
                "output_audio_format": "wav",
                "voice": "alloy",
                "instructions": SYSTEM_INSTRUCTIONS
            }
        }))

        # 4) Send mic audio (can be multiple appends for streaming mic)
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": user_audio_b64
        }))
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # Optional: add a brief text nudge
        await ws.send(json.dumps({
            "type": "response.create",
            "response": {
                "instructions": "Please respond concisely in speech."
            }
        }))

        print("🕑 Waiting for streamed reply...\n")

        # 5) Receive streamed audio/text deltas
        audio_bytes = bytearray()
        sample_rate = SAMPLE_RATE  # server commonly uses 24k; update if session reports different

        async for msg in ws:
            evt = json.loads(msg)

            # Live transcript (optional)
            if evt.get("type") == "response.output_text.delta":
                print(evt.get("delta"), end="", flush=True)

            # Audio chunks (base64-encoded PCM16 WAV frames)
            elif evt.get("type") == "response.output_audio.delta":
                audio_bytes.extend(base64.b64decode(evt["delta"]))

            elif evt.get("type") == "response.completed":
                print("\n\n✅ Response completed.")
                break

        # 6) Save assistant reply to WAV
        if audio_bytes:
            # raw bytes may already be WAV, but normalizing here is robust:
            # interpret as PCM16 stream and write as standard WAV
            pcm16 = np.frombuffer(bytes(audio_bytes), dtype=np.int16)
            sf.write(OUTPUT_WAV, pcm16, samplerate=sample_rate, subtype="PCM_16")
            print(f"💾 Saved assistant reply → {OUTPUT_WAV}")
        else:
            print("⚠️ No audio received from server.")

if __name__ == "__main__":
    asyncio.run(main())

Here's a code breakdown:

Required Libraries

The script uses several Python libraries:

  • websockets: For WebSocket communication with the GPT-5 Realtime API
  • sounddevice: To record audio from the microphone
  • soundfile: For handling WAV file operations
  • numpy: For audio data manipulation
  • Standard libraries: osjsonbase64asyncioio

Key Components

1. Configuration Settings

The script defines several important constants:

  • OPENAI_API_KEY: Authentication key for OpenAI's API
  • REALTIME_URL: WebSocket endpoint for GPT-5 Realtime
  • Recording parameters: Sample rate (24kHz), channels (mono), recording duration (3 seconds)
  • SYSTEM_INSTRUCTIONS: Prompts GPT-5 to act as a voice assistant

2. Audio Recording Function

The record_from_mic() function:

  • Uses sounddevice to capture audio at specified sample rate and duration
  • Records in mono at 16-bit PCM format
  • Returns raw audio bytes

3. WAV Encoding Function

The b64encode_pcm16_wav() function:

  • Takes raw PCM16 audio bytes
  • Wraps them in a WAV container using soundfile
  • Returns the base64-encoded string of the WAV file

4. Main Async Function

The main() async function orchestrates the entire process:

API Key Validation

  • Checks if the OpenAI API key is properly set

Audio Recording and Encoding

  • Records audio from the microphone
  • Encodes it as a base64 WAV string

WebSocket Connection

  • Establishes a secure WebSocket connection to GPT-5 Realtime
  • Sets proper headers including API key and beta flag

Session Configuration

  • Sends a session.update message to configure:
  • Input/output modalities (text and audio)
  • Audio format (WAV for both input and output)
  • Voice selection ("alloy")
  • System instructions for the assistant

Input Handling

  • Appends the recorded audio to the input buffer
  • Commits the buffer to signal completion of input
  • Optionally adds text instructions to shape the response

Response Processing

  • Collects streamed response data in real-time:
  • Text deltas (transcription of response)
  • Audio deltas (spoken audio chunks)
  • Monitors for completion signal

Output Saving

  • Converts collected audio bytes back to PCM16 format
  • Writes to a WAV file (assistant_reply.wav)

Flow of Execution

The script follows this sequence:

  1. Validate environment setup and API key
  2. Record short audio clip from microphone
  3. Connect to GPT-5 Realtime WebSocket API
  4. Configure session parameters (audio formats, voice)
  5. Send recorded audio and commit the input
  6. Request model to process audio and generate a response
  7. Receive and display text transcript while collecting audio chunks
  8. Save the complete audio response as a WAV file

Error Handling

The code includes basic error handling:

  • Checks for missing API key
  • Verifies if audio was received from the server

Technical Notes

  • Uses 24kHz mono PCM16 format, which is optimal for speech processing
  • Supports WebSocket protocol for real-time streaming
  • Uses asyncio for asynchronous operations
  • Implements proper WebSocket connection lifecycle management

Mid-Session Multimodality: Combining Audio and Images

One of GPT-5 Realtime’s most powerful abilities is to handle multiple modalities within a single ongoing conversation. Unlike earlier systems that processed text, images, or audio in isolation, Realtime can fluidly combine them as they arrive. This enables natural scenarios where a user begins by speaking a question and then adds an image for further clarification or analysis — all in the same session without restarting the dialogue.

For example, imagine a student asking aloud “Can you explain the advantages of GPT-5 as a multimodal model?” and then immediately showing a chart of data. GPT-5 Realtime can integrate both inputs, producing a spoken response that addresses the original audio question and references insights from the chart. This kind of dynamic, mid-session multimodality illustrates how the model moves beyond static question–answer patterns and toward fluid, real-time collaboration with human users.

Example: Mid-Session Multimodality (Audio question → Append Image → Spoken reply)

What it does

  1. Sends a short spoken question (WAV) to GPT-5 Realtime.
  2. Appends a chart image in the same session.
  3. Requests a spoken answer that references both the audio question and the image.
  4. Saves the reply as assistant_multimodal_reply.wav and prints any streamed text.

Requirements

pip install websockets soundfile numpy pillow
  • Put your audio prompt file (e.g., user_prompt_spoken.wav) and an image (e.g., chart.png) in the same folder.
  • Or adjust the paths below.
"""
Multimodal Mid-Session with GPT-5 Realtime
- Step 1: Send a spoken question (WAV) to GPT-5 Realtime.
- Step 2: Append an image (PNG) in the same session.
- Step 3: Ask for a spoken reply that references BOTH inputs.
- Saves the model’s voice reply to assistant_multimodal_reply.wav.

Requirements:
  pip install websockets soundfile numpy pillow
Set your key:
  export OPENAI_API_KEY="sk-..."   (macOS/Linux)
  setx OPENAI_API_KEY "sk-..."     (Windows, new terminal)
"""

import os
import io
import json
import base64
import asyncio
import websockets
import numpy as np
import soundfile as sf
from PIL import Image

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-5-realtime"

# Input files (adjust as needed)
INPUT_WAV = "user_prompt_spoken.wav"  # spoken question, e.g., “Can you explain the advantages of GPT-5 as a multimodal model?”
INPUT_IMG = "chart.png"               # a chart image to reference mid-session
OUTPUT_WAV = "assistant_multimodal_reply.wav"

# Session behavior
VOICE_NAME = "alloy"
SYSTEM_INSTRUCTIONS = (
    "You are a helpful voice assistant. Consider ALL inputs in this session. "
    "First, interpret the user's spoken question. Then, when an image is provided, "
    "analyze it and integrate both sources in your final spoken answer. "
    "Be concise and precise."
)

def read_wav_as_base64(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def read_png_as_base64(path: str) -> str:
    # Ensure we produce a clean PNG bytes payload (also validates file)
    with Image.open(path) as im:
        im = im.convert("RGBA") if im.mode not in ("RGB", "RGBA") else im
        buf = io.BytesIO()
        im.save(buf, format="PNG")
        return base64.b64encode(buf.getvalue()).decode("utf-8")

async def main():
    if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
        raise SystemExit("Set OPENAI_API_KEY environment variable first.")

    # Load inputs
    user_audio_b64 = read_wav_as_base64(INPUT_WAV)
    image_png_b64 = read_png_as_base64(INPUT_IMG)

    # Connect to Realtime WebSocket
    async with websockets.connect(
        REALTIME_URL,
        extra_headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1",
        },
        max_size=20 * 1024 * 1024,
    ) as ws:
        print("🔌 Connected to GPT-5 Realtime.")

        # 1) Configure session: we'll use audio in/out, and also allow image as input
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio", "image"],
                "input_audio_format": "wav",
                "output_audio_format": "wav",
                "voice": VOICE_NAME,
                "instructions": SYSTEM_INSTRUCTIONS
            }
        }))

        # 2) Append the user's spoken question (audio buffer)
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": user_audio_b64
        }))
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # 3) Append the image mid-session
        #    We send the PNG as base64 along with its MIME. (You can also send a URL if supported.)
        await ws.send(json.dumps({
            "type": "input_image.append",
            "image": image_png_b64,
            "mime_type": "image/png",
            # Optionally, add a hint for the model about why you're sending the image:
            "metadata": {
                "purpose": "chart_analysis",
                "caption": "A line chart showing a synthetic trend over time."
            }
        }))
        await ws.send(json.dumps({"type": "input_image.commit"}))

        # 4) Ask for a response that references BOTH the spoken question and the image
        await ws.send(json.dumps({
            "type": "response.create",
            "response": {
                "instructions": (
                    "Please answer in speech. "
                    "Explain the advantages of GPT-5 as a multimodal model, "
                    "and also summarize the main trend you observe in the provided chart. "
                    "Be concise (15–25 seconds)."
                )
            }
        }))

        print("🕑 Waiting for streamed reply...\n")

        # 5) Receive streamed text & audio
        audio_bytes = bytearray()
        sample_rate = 24000  # common server rate; adjust if your session reports differently

        async for msg in ws:
            evt = json.loads(msg)

            # Optional: live transcript/notes
            if evt.get("type") == "response.output_text.delta":
                print(evt.get("delta"), end="", flush=True)

            # Audio deltas (base64-encoded PCM16)
            elif evt.get("type") == "response.output_audio.delta":
                audio_bytes.extend(base64.b64decode(evt["delta"]))

            elif evt.get("type") == "response.completed":
                print("\n\n✅ Response completed.")
                break

        # 6) Save the final spoken reply
        if audio_bytes:
            pcm16 = np.frombuffer(bytes(audio_bytes), dtype=np.int16)
            sf.write(OUTPUT_WAV, pcm16, samplerate=sample_rate, subtype="PCM_16")
            print(f"💾 Saved assistant reply → {OUTPUT_WAV}")
        else:
            print("⚠️ No audio received from server.")

if __name__ == "__main__":
    asyncio.run(main())

Download the audio sample here: https://files.cuantum.tech/audio/user_prompt_spoken.wav

Download the chart image sample here: https://files.cuantum.tech/images/chart.png

Note: Save the example audio in the same location as the Python script.

Here's a code breakdown:

Key Components

  1. Imports and Setup: The script uses several Python libraries:
    • Standard libraries: os, io, json, base64, asyncio
    • websockets: For WebSocket communication with the GPT-5 Realtime API
    • numpy: For audio data manipulation
    • soundfile: For handling WAV file operations
    • PIL (Pillow): For image processing
  2. Configuration: The script defines important constants:
    • OPENAI_API_KEY: Retrieved from environment variables
    • REALTIME_URL: WebSocket endpoint for the GPT-5 Realtime API
    • Input/output file paths: Locations of input audio (WAV), input image (PNG), and output audio
    • VOICE_NAME: Selects "alloy" as the voice for the assistant's reply
    • SYSTEM_INSTRUCTIONS: Defines the assistant's behavior
  3. Helper Functions: Two utility functions for file handling:
    • read_wav_as_base64(): Reads a WAV file and converts it to base64 encoding
    • read_png_as_base64(): Reads a PNG image, ensures it's in the correct format, and converts it to base64
  4. Main Asynchronous Function: The core of the script with these main steps:
    • Input Validation: Checks if the API key is properly set
    • File Loading: Loads and encodes the audio and image files
    • WebSocket Connection: Establishes a connection to GPT-5 Realtime with proper headers
    • Session Configuration: Sets up a session with text, audio, and image modalities
    • Audio Input: Sends the spoken question (WAV) and commits the audio buffer
    • Image Input: Appends the chart image mid-session with metadata about its purpose
    • Response Request: Requests a spoken reply that addresses both inputs
    • Response Processing: Collects streamed text and audio chunks from the server
    • Output Saving: Converts received audio bytes to PCM16 format and saves as WAV

WebSocket Communication Flow

The script follows a specific protocol for communication with the GPT-5 Realtime API:

  1. Sends a session.update message to configure modalities and behavior
  2. Sends the audio data using input_audio_buffer.append and commits it
  3. Adds the image using input_image.append with metadata and commits it
  4. Creates a response request with specific instructions
  5. Processes incoming events in real-time:
    • Text deltas (transcription)
    • Audio deltas (spoken reply chunks)
    • Completion signal

Error Handling

The script includes basic error checking:

  • Validates the API key
  • Checks if audio was received from the server

Key Technical Aspects

The implementation showcases several important concepts:

  • Asynchronous programming with asyncio for non-blocking I/O
  • Base64 encoding for binary data transmission over WebSockets
  • Real-time streaming of both text and audio responses
  • Mid-session multimodality by combining different input types in one conversation
  • Proper WebSocket lifecycle management

This code example demonstrates the power of GPT-5 Realtime's ability to handle multiple modalities within a single ongoing conversation, allowing for more natural and fluid interactions.

5.2.4 Why Audio Integration Matters

Accessibility: Automatic transcription for the hearing impaired. This technology enables real-time conversion of spoken content into text, making digital media, meetings, and educational resources accessible to deaf and hard-of-hearing individuals. Modern transcription systems can work in real-time with high accuracy, providing captions for live events, lectures, and conversations, removing barriers to participation in many aspects of daily life and professional settings.

By integrating audio processing with language models, these systems can accurately capture nuances, different accents, and even distinguish between multiple speakers. This integration enables more contextual understanding, allowing the transcription to include important non-verbal audio cues, proper punctuation, and speaker identification. Advanced systems can also adapt to specialized terminology, regional dialects, and challenging acoustic environments, making information more accessible across diverse settings from medical appointments to entertainment media.

Education: Real-time translation and captions in classrooms. This application transforms how international students engage with lectures by providing immediate translations of spoken content. It also helps all students by generating accurate captions for recorded lectures, making review more efficient and allowing learners to search through spoken content based on keywords or concepts.

Advanced multimodal systems can detect lecture context and technical terminology, accurately translating specialized vocabulary while maintaining academic integrity. These systems can distinguish between different speakers in classroom discussions, properly attributing questions and responses in the transcription.

Furthermore, these technologies enable asynchronous learning by creating searchable archives of lectures that students can navigate by concept rather than timestamp. For students with learning differences such as ADHD or dyslexia, the synchronized visual and auditory information improves comprehension and retention.

The integration of AI with educational content also allows for personalized learning paths, where the system can identify concepts that individual students struggle with based on their engagement patterns and provide targeted supplementary material. This multimodal approach bridges accessibility gaps while enhancing the learning experience for all students.

Assistants: Voice-driven chatbots, smart speakers, and AI tutors. These systems create natural conversation flows by understanding spoken queries and generating contextually appropriate spoken responses. Advanced multimodal assistants can maintain conversational context over extended interactions, understand varying speech patterns, and respond with appropriate intonation and emphasis that matches the content being delivered.

Cross-lingual communication: Breaking down barriers with speech-to-speech translation. This technology enables conversations between people who speak different languages by capturing speech in one language, understanding its meaning, and generating natural-sounding speech in another language. Modern systems preserve speaker characteristics like tone, pace, and emotion, making the exchange feel more personal and authentic.

These systems represent a significant advancement over traditional translation tools by offering real-time communication without requiring text interfaces. The process involves three sophisticated steps: speech recognition to convert spoken words into text, machine translation to convert that text into another language, and text-to-speech synthesis to deliver the translation in a natural voice.

The latest neural translation models understand cultural nuances and idioms that literal translations often miss. For example, when a Japanese speaker uses honorifics that don't exist in English, the system can adapt the output to convey appropriate respect through tone and word choice rather than direct translation.

Additionally, these technologies can adapt to various contexts - from business negotiations where precision is critical to casual conversations where fluidity matters more. Some advanced systems even maintain consistent voice profiles across languages, allowing a Spanish speaker's unique vocal characteristics to be present in the English translation, creating a more seamless and personalized communication experience.

Unlike older systems where speech recognition and language models were separate components chained together with potential information loss at each step, modern multimodal approaches fuse them into unified architectures that process acoustic and linguistic information simultaneously. This integration creates AI that listens and responds more naturally, understanding context across modalities and handling the ambiguities inherent in human communication.

5.2 Audio & Speech Integration (Whisper, SpeechLM)

Language is not only written but spoken. The ability to listen, transcribe, and respond to speech is essential if AI is to become a seamless assistant in daily life. Recent advances in speech recognition and speech-language modeling have made it possible to integrate audio directly into large-scale language systems. This integration represents a significant leap forward in AI capabilities, as it bridges the gap between written and spoken communication.

Speech is our most natural form of communication, and by enabling AI to process audio inputs, we create more intuitive interfaces that don't require users to type or read. This is particularly important for accessibility, allowing those with limited mobility, vision impairments, or literacy challenges to interact with technology. Furthermore, speech carries additional information through tone, pace, and emphasis that text alone cannot convey, providing richer context for AI systems to understand human intent.

Let's explore two key directions in speech-enabled AI:

Whisper – OpenAI's robust speech-to-text system. This open-source model represents a breakthrough in transcription technology with its ability to handle diverse accents, background noise, and technical vocabulary.

Unlike previous speech recognition systems that struggled with real-world audio conditions, Whisper demonstrates remarkable accuracy even with challenging inputs such as podcast conversations, lecture recordings, or phone calls.

SpeechLM / SpeechGPT – models that extend transformers to directly handle audio-text tasks. These advanced systems go beyond simple transcription by maintaining the connection between acoustic features and semantic meaning.

Rather than treating speech-to-text as a separate preprocessing step, they incorporate audio understanding directly into the language modeling process, enabling more nuanced responses that consider not just what was said, but how it was said.

5.2.1 Whisper: Universal Speech Recognition

Whisper is an open-source model from OpenAI designed for speech recognition, translation, and transcription across many languages. Released in September 2022, it represents a significant advancement in audio processing technology by providing robust performance across diverse acoustic environments and speaking styles. Unlike previous speech recognition systems that often struggled with accents, background noise, or specialized vocabulary, Whisper was trained on an extensive dataset of 680,000 hours of multilingual and multitask supervised data collected from the web, giving it remarkable versatility and accuracy.

The model architecture combines a transformer-based encoder that processes audio spectrograms with a decoder similar to GPT models that generates text output. This design allows Whisper to handle the complexities of human speech, including variations in pitch, tone, cadence, and pronunciation across different languages and dialects.

What makes Whisper particularly groundbreaking is its zero-shot capabilities - it can recognize and transcribe speech in languages it wasn't explicitly fine-tuned for. Additionally, Whisper can automatically detect the spoken language, translate speech directly into English, and even handle code-switching (when speakers alternate between multiple languages within a conversation). This versatility makes it valuable for applications ranging from automatic meeting transcription to cross-lingual communication tools and accessibility services for the hearing impaired.

Key features:

  • Trained on 680,000 hours of multilingual audio from the web, including a wide variety of accents, dialects, and background conditions. This massive and diverse training dataset enables Whisper to handle real-world audio that previous systems struggled with. The dataset's scale provides broad coverage across linguistic variations, regional accents, speaking styles, and acoustic environments, giving Whisper an unprecedented ability to understand speech in virtually any context. This extensive training directly translates to Whisper's ability to transcribe speech from speakers with accents or dialects traditionally underrepresented in AI training data.
  • Handles noisy, real-world audio (e.g., phone calls, lectures, podcasts, street recordings) with remarkable resilience. Unlike earlier models that performed well only in studio-quality conditions, Whisper maintains accuracy even with background noise, overlapping speakers, or varying microphone quality. This robustness stems from its exposure to diverse acoustic environments during training, allowing it to filter out irrelevant sounds and focus on the speech signal. Whether processing a recording from a busy café, a conference room with echoing acoustics, or an outdoor interview with wind interference, Whisper can extract the spoken content with surprising accuracy.
  • Supports transcription, translation, and language identification across 99 languages. This multilingual capability allows it to automatically detect the spoken language and process content from global sources without requiring manual language selection. Whisper can seamlessly transcribe content in languages ranging from widely-spoken ones like English, Spanish, and Mandarin to less common languages like Swahili, Lithuanian, and Nepali. This language versatility makes it an invaluable tool for global communication, international research, and cross-cultural content creation. Even more impressively, Whisper can identify when speakers switch between languages mid-conversation, a phenomenon known as code-switching.
  • Features zero-shot learning capabilities, meaning it can perform tasks it wasn't explicitly fine-tuned for, adapting to new scenarios without additional training. This remarkable ability allows Whisper to generalize its knowledge to unfamiliar contexts, speakers, and acoustic environments. For example, without specific fine-tuning, it can transcribe technical jargon in fields like medicine or engineering, understand regional dialects it hasn't explicitly seen before, or adapt to novel audio recording conditions. This zero-shot capability is particularly valuable in practical applications where the diversity of real-world speech would otherwise require countless specialized models for different scenarios.

At its core, Whisper combines a log-Mel spectrogram encoder with a decoder similar to GPT, allowing it to map raw audio to natural language text. The encoder transforms audio waveforms into spectrograms—visual representations of sound frequencies over time—which capture the acoustic patterns in speech. This process begins by converting the raw audio signal into a spectrogram using the Short-Time Fourier Transform (STFT), which breaks down the audio into its frequency components.

These components are then mapped to the Mel scale, which approximates how humans perceive sound frequencies, with greater sensitivity to lower frequencies than higher ones. The resulting log-Mel spectrogram provides a compact representation of the audio that emphasizes the most perceptually relevant features.

These spectrograms are then processed through a transformer encoder that extracts meaningful features. The transformer architecture, with its self-attention mechanisms, allows the model to focus on different parts of the spectrogram simultaneously, capturing both local phonetic details and broader acoustic patterns. This is crucial for handling variations in speech like different accents, speaking rates, and background noise.

The GPT-style decoder then converts these features into text, treating transcription as a sequence prediction task similar to language modeling. This decoder works autoregressively, generating each word or token based on both the encoded audio features and the previously generated text. This approach enables Whisper to maintain contextual coherence throughout the transcription, correctly interpreting ambiguous sounds based on their surrounding context, and producing natural-sounding text that accurately reflects the original speech.

Example: Transcribing Audio with Whisper

# Comprehensive implementation of Whisper for audio transcription

# Install required libraries
# pip install git+https://github.com/openai/whisper.git
# pip install librosa matplotlib numpy

import whisper
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import time
import torch
from pathlib import Path

def visualize_audio(audio_path):
    """Visualize the audio waveform and spectrogram"""
    y, sr = librosa.load(audio_path)
    
    # Create a figure with two subplots
    plt.figure(figsize=(12, 8))
    
    # Plot waveform
    plt.subplot(2, 1, 1)
    librosa.display.waveshow(y, sr=sr)
    plt.title('Waveform')
    
    # Plot spectrogram
    plt.subplot(2, 1, 2)
    D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
    librosa.display.specshow(D, sr=sr, x_axis='time', y_axis='log')
    plt.colorbar(format='%+2.0f dB')
    plt.title('Log-frequency power spectrogram')
    
    plt.tight_layout()
    plt.show()

def transcribe_audio(audio_path, model_size="base", language=None, verbose=True):
    """
    Transcribe audio using OpenAI's Whisper model
    
    Parameters:
    - audio_path: Path to the audio file
    - model_size: Size of the Whisper model to use (tiny, base, small, medium, large)
    - language: Language code (e.g., "en" for English) or None for auto-detection
    - verbose: Whether to print progress information
    
    Returns:
    - Dictionary containing transcription results
    """
    start_time = time.time()
    
    if verbose:
        print(f"Loading Whisper model: {model_size}")
    
    # Load pre-trained Whisper model
    model = whisper.load_model(model_size)
    
    model_load_time = time.time()
    if verbose:
        print(f"Model loaded in {model_load_time - start_time:.2f} seconds")
        print(f"Transcribing: {audio_path}")
    
    # Set transcription options
    options = {}
    if language:
        options["language"] = language
    
    # Transcribe the audio file
    result = model.transcribe(audio_path, **options)
    
    end_time = time.time()
    if verbose:
        print(f"Transcription completed in {end_time - model_load_time:.2f} seconds")
        print(f"Detected language: {result['language']} (confidence: {result.get('language_probability', 0):.2f})")
        print(f"Total processing time: {end_time - start_time:.2f} seconds")
    
    return result

def save_transcription(result, output_file=None):
    """Save transcription results to a text file"""
    if output_file is None:
        output_file = "transcription_output.txt"
    
    with open(output_file, "w", encoding="utf-8") as f:
        # Write the full transcription
        f.write("FULL TRANSCRIPTION:\n")
        f.write(result["text"])
        f.write("\n\n")
        
        # Write segment-by-segment with timestamps
        f.write("SEGMENTS WITH TIMESTAMPS:\n")
        for segment in result["segments"]:
            start = segment["start"]
            end = segment["end"]
            text = segment["text"]
            f.write(f"[{start:.2f}s - {end:.2f}s] {text}\n")
    
    return output_file

def batch_transcribe(directory, extension=".mp3", output_dir=None):
    """Transcribe all audio files with the given extension in a directory"""
    if output_dir is None:
        output_dir = Path("transcription_results")
    
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)
    
    directory = Path(directory)
    audio_files = list(directory.glob(f"*{extension}"))
    
    print(f"Found {len(audio_files)} {extension} files in {directory}")
    
    for audio_file in audio_files:
        print(f"\nProcessing: {audio_file.name}")
        result = transcribe_audio(str(audio_file))
        
        output_file = output_dir / f"{audio_file.stem}_transcription.txt"
        save_transcription(result, output_file)
        print(f"Saved transcription to: {output_file}")

# Example usage
if __name__ == "__main__":
    # Set the path to your audio file
    audio_path = "speech_sample.mp3"
    
    # Check if CUDA is available for GPU acceleration
    cuda_available = torch.cuda.is_available()
    print(f"CUDA available: {cuda_available}")
    
    # Visualize the audio (optional)
    # visualize_audio(audio_path)
    
    # Transcribe the audio
    result = transcribe_audio(audio_path, model_size="base")
    
    # Print the transcription
    print("\nTRANSCRIPTION:")
    print(result["text"])
    
    # Save the transcription to a file
    output_file = save_transcription(result)
    print(f"\nSaved transcription to: {output_file}")
    
    # Example of batch processing
    # batch_transcribe("audio_folder", extension=".wav")

Download the speech sample here: https://files.cuantum.tech/audio/speech_sample.mp3

Note: Save the example audio in the same location as the Python script.

Breaking Down the Whisper Implementation:

1. Setup and Dependencies

The code begins by installing the necessary libraries: Whisper (directly from GitHub), librosa (for audio processing and visualization), matplotlib (for visualization), and numpy (for numerical operations). These libraries provide the foundation for audio processing and transcription.

2. Audio Visualization Function

The visualize_audio() function uses librosa to create two important visualizations:

  • A waveform display showing amplitude over time, which represents how the audio signal varies
  • A log-frequency spectrogram showing how energy is distributed across different frequencies over time, which helps analyze speech characteristics

These visualizations can help users understand the audio characteristics before transcription.

3. Core Transcription Function

The transcribe_audio() function is the heart of the implementation:

  • It accepts parameters for audio path, model size, language, and verbosity level
  • It loads the specified Whisper model (from tiny to large, with larger models being more accurate but slower)
  • It tracks processing time to provide performance metrics
  • It supports automatic language detection or allows specifying a language code
  • It returns a comprehensive result object containing the transcription and metadata

4. Results Processing

The save_transcription() function processes the Whisper results into user-friendly formats:

  • It saves the complete transcription text
  • It also extracts and formats individual segments with their timestamps, which is crucial for aligning transcription with audio timing
  • This enables applications like subtitle generation or time-synchronized content analysis

5. Batch Processing Capability

The batch_transcribe() function extends the utility to handle multiple audio files:

  • It processes all audio files with a specified extension in a directory
  • It organizes outputs into a dedicated directory structure
  • This is valuable for transcribing podcasts, interview series, or lecture collections

6. Example Usage

The main execution block demonstrates how to use these functions in practice:

  • It checks for GPU acceleration via CUDA, which can significantly improve performance for larger models
  • It offers options for audio visualization (commented out by default)
  • It performs transcription and displays the results
  • It saves the output to a file for future reference
  • It includes a commented example of batch processing

Advanced Features:

This implementation goes beyond basic transcription by including:

  • Performance timing to measure processing efficiency
  • Language detection reporting
  • Segment-level transcription with timestamps
  • Hardware acceleration detection
  • Audio analysis capabilities
  • Batch processing for multiple files

This example implementation provides a complete workflow for audio transcription, from preprocessing through visualization, transcription, and results management, making it suitable for both individual use cases and larger-scale applications.

Example: Advanced implementation of Whisper for real-time transcription with visualization

import whisper
import numpy as np
import pyaudio
import threading
import time
import queue
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from collections import deque
import torch
import os
from datetime import datetime

class WhisperRealtimeTranscriber:
    def __init__(self, model_size="base", language="en", energy_threshold=1000, 
                 record_timeout=2, phrase_timeout=3, max_sentences=10):
        """
        Initialize the real-time transcriber with Whisper
        
        Parameters:
        - model_size: Size of Whisper model ("tiny", "base", "small", "medium", "large")
        - language: Language code or None for auto-detection
        - energy_threshold: Minimum audio energy to consider for recording
        - record_timeout: Time in seconds to recheck if audio is speech
        - phrase_timeout: Time in seconds of silence to consider a phrase complete
        - max_sentences: Maximum number of sentences to display in history
        """
        self.model_name = model_size
        self.language = language
        self.energy_threshold = energy_threshold
        self.record_timeout = record_timeout
        self.phrase_timeout = phrase_timeout
        self.max_sentences = max_sentences
        
        # Check for GPU
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Using device: {self.device}")
        
        # Load Whisper model
        print(f"Loading Whisper {model_size} model...")
        self.model = whisper.load_model(model_size).to(self.device)
        print("Model loaded!")
        
        # Initialize audio processing
        self.audio_queue = queue.Queue()
        self.result_queue = queue.Queue()
        self.audio_data = np.zeros(0, dtype=np.float32)
        
        # For visualization
        self.audio_buffer = deque(maxlen=4000)  # ~4 seconds at 16kHz
        self.waveform_data = np.zeros(4000)
        self.spectrogram_data = np.zeros((201, 80))  # Mel spectrogram shape
        self.transcript_history = []
        self.recording = False
        self.terminated = False
        
        # Audio parameters
        self.sample_rate = 16000
        self.audio_format = pyaudio.paFloat32
        self.channels = 1
        self.chunk = 1024
        
        # Setup PyAudio
        self.p = pyaudio.PyAudio()
        
    def _get_audio_input_stream(self):
        """Create and return an input audio stream"""
        stream = self.p.open(
            format=self.audio_format,
            channels=self.channels,
            rate=self.sample_rate,
            input=True,
            frames_per_buffer=self.chunk
        )
        return stream
    
    def _audio_capture_thread(self):
        """Thread function for capturing audio"""
        stream = self._get_audio_input_stream()
        last_sample = bytes()
        phrase_time = None
        
        print("Listening for audio...")
        
        try:
            while not self.terminated:
                # Get new audio chunk
                current_sample = stream.read(self.chunk, exception_on_overflow=False)
                
                # Convert to numpy array
                data = np.frombuffer(current_sample, dtype=np.float32)
                
                # Update audio buffer for visualization
                self.audio_buffer.extend(data)
                self.waveform_data = np.array(list(self.audio_buffer))
                
                # Calculate audio energy
                energy = np.sqrt(np.mean(data**2))
                
                # Detect if audio is speech
                if energy > self.energy_threshold:
                    self.recording = True
                    
                    # Reset phrase timeout
                    phrase_time = None
                    
                    # Add audio to processing queue
                    self.audio_data = np.append(self.audio_data, data)
                
                # Handle phrase timeout
                elif self.recording:
                    if phrase_time is None:
                        phrase_time = time.time()
                    
                    # If enough silence, process the audio phrase
                    if time.time() - phrase_time > self.phrase_timeout:
                        if len(self.audio_data) > 0:
                            self.audio_queue.put(self.audio_data.copy())
                            self.audio_data = np.zeros(0, dtype=np.float32)
                        
                        self.recording = False
                        phrase_time = None
                
                # Process fixed chunks of audio regardless of speech detection
                if len(self.audio_data) > self.sample_rate * self.record_timeout:
                    self.audio_queue.put(self.audio_data.copy())
                    self.audio_data = self.audio_data[int(self.sample_rate * self.record_timeout):]
                
                time.sleep(0.01)
                
        finally:
            stream.stop_stream()
            stream.close()
    
    def _transcription_thread(self):
        """Thread function for processing audio with Whisper"""
        while not self.terminated:
            try:
                # Get audio data from queue
                if self.audio_queue.empty():
                    time.sleep(0.1)
                    continue
                
                audio_data = self.audio_queue.get()
                
                # Skip processing very short audio clips
                if len(audio_data) < 0.5 * self.sample_rate:
                    continue
                
                # Process audio with Whisper
                start_time = time.time()
                
                # Convert to format expected by Whisper
                audio_tensor = torch.tensor(audio_data).to(self.device)
                
                # Generate Mel spectrogram for visualization
                mel = whisper.log_mel_spectrogram(audio_data)
                if len(mel) > 80:
                    mel = mel[:80]
                self.spectrogram_data = mel.T.numpy()  # Transpose for visualization
                
                # Transcribe with Whisper
                options = {"language": self.language} if self.language else {}
                result = self.model.transcribe(audio_data, **options)
                
                # Get transcription result
                text = result["text"].strip()
                elapsed = time.time() - start_time
                
                # Skip empty results
                if len(text) == 0:
                    continue
                
                # Add timestamp and transcription to history
                timestamp = datetime.now().strftime("%H:%M:%S")
                entry = f"[{timestamp}] {text}"
                self.transcript_history.append(entry)
                
                # Keep only most recent entries
                if len(self.transcript_history) > self.max_sentences:
                    self.transcript_history = self.transcript_history[-self.max_sentences:]
                
                # Print result
                print(f"Transcribed ({elapsed:.2f}s): {text}")
                
            except Exception as e:
                print(f"Error in transcription thread: {e}")
                
    def _update_visualization(self, frame):
        """Update function for matplotlib animation"""
        # Clear previous plots
        plt.clf()
        
        # Plot audio waveform
        plt.subplot(3, 1, 1)
        plt.plot(self.waveform_data)
        plt.title("Audio Waveform")
        plt.ylim([-0.5, 0.5])
        
        # Plot status
        if self.recording:
            plt.gca().set_facecolor((0.9, 0.9, 1))
            plt.title("Audio Waveform - RECORDING")
        
        # Plot Mel spectrogram
        plt.subplot(3, 1, 2)
        plt.imshow(self.spectrogram_data, aspect='auto', origin='lower')
        plt.title("Mel Spectrogram")
        plt.tight_layout()
        
        # Show transcript history
        plt.subplot(3, 1, 3)
        plt.axis('off')
        history_text = "\n".join(self.transcript_history)
        plt.text(0.05, 0.95, history_text, 
                 verticalalignment='top', wrap=True, fontsize=9,
                 bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.2))
        plt.title("Transcript History")
        
        # Adjust layout
        plt.subplots_adjust(hspace=0.5)
        
    def start(self, visualize=True):
        """Start the real-time transcription system"""
        # Start audio capture thread
        audio_thread = threading.Thread(target=self._audio_capture_thread)
        audio_thread.daemon = True
        audio_thread.start()
        
        # Start transcription thread
        transcription_thread = threading.Thread(target=self._transcription_thread)
        transcription_thread.daemon = True
        transcription_thread.start()
        
        try:
            if visualize:
                # Set up visualization
                plt.figure(figsize=(10, 8))
                ani = FuncAnimation(plt.gcf(), self._update_visualization, interval=100)
                plt.show()
            else:
                # Just keep the main thread alive
                while True:
                    time.sleep(1)
        except KeyboardInterrupt:
            print("Stopping...")
        finally:
            self.terminated = True
            self.p.terminate()

    def save_transcript(self, filename=None):
        """Save the transcript history to a file"""
        if filename is None:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"transcript_{timestamp}.txt"
        
        with open(filename, "w", encoding="utf-8") as f:
            for entry in self.transcript_history:
                f.write(f"{entry}\n")
        
        print(f"Transcript saved to {filename}")

# Example usage
if __name__ == "__main__":
    # Create and start the transcriber
    transcriber = WhisperRealtimeTranscriber(
        model_size="base",
        language="en",
        energy_threshold=0.01,
        record_timeout=2,
        phrase_timeout=1
    )
    
    try:
        transcriber.start(visualize=True)
    except KeyboardInterrupt:
        pass
    finally:
        transcriber.save_transcript()
Note: To use this example code, you'll need to speak into your microphone during program execution.

Breaking Down the Real-Time Whisper Implementation:

1. Overall Architecture

This advanced implementation creates a real-time speech transcription system using Whisper. Unlike the previous example that processes existing files, this version:

  • Captures live audio input from a microphone
  • Processes audio in chunks as it arrives
  • Provides real-time visualization of the audio signal and transcription
  • Runs Whisper inference continuously on a separate thread

2. Class Structure and Initialization

The WhisperRealtimeTranscriber class encapsulates the entire system:

  • Manages multiple threads for audio capture and processing
  • Maintains queues for communication between threads
  • Configures parameters like energy thresholds for speech detection
  • Initializes visualization components including waveform and spectrogram displays
  • Sets up the Whisper model with GPU acceleration when available

3. Audio Capture System

The _audio_capture_thread method handles continuous audio input:

  • Uses PyAudio to access the microphone stream
  • Implements energy-based voice activity detection to identify speech
  • Manages "phrases" by detecting pauses between speech segments
  • Updates a circular buffer for visualization purposes
  • Queues detected speech for transcription processing

4. Whisper Transcription Engine

The _transcription_thread implements the core speech-to-text functionality:

  • Retrieves audio segments from the queue when available
  • Filters out audio clips that are too short
  • Generates mel spectrograms for both transcription and visualization
  • Runs the Whisper model inference to convert speech to text
  • Maintains a transcript history with timestamps
  • Measures and reports processing time for performance monitoring

5. Real-Time Visualization

The _update_visualization method creates an interactive dashboard:

  • Displays the audio waveform with recording status indicator
  • Shows the mel spectrogram representation used by Whisper
  • Provides a scrolling transcript history panel
  • Updates dynamically using Matplotlib's animation functionality

6. User Interface and Control Flow

The start method orchestrates the system operation:

  • Launches audio capture and transcription threads
  • Sets up the visualization if enabled
  • Handles clean shutdown on user interruption

7. Practical Applications

This implementation offers several advantages over the previous example:

  • Live transcription: Process speech as it happens rather than from files
  • Continuous operation: Run indefinitely for real-time applications
  • Visual feedback: See both the audio signal and the corresponding transcription
  • Speech detection: Automatically identify when someone is speaking
  • Performance monitoring: Track processing times to optimize for real-time use

8. Use Cases for Real-Time Whisper

This implementation is particularly useful for:

  • Live captioning for presentations or meetings
  • Real-time transcription for accessibility purposes
  • Interactive voice-controlled applications
  • Speech analytics and monitoring systems
  • Educational tools showing the relationship between speech and its transcription

9. Technical Considerations

The implementation addresses several challenges:

  • Balancing latency vs. accuracy through parameter tuning
  • Managing computational resources with threading
  • Providing visual feedback without affecting performance
  • Detecting speech vs. silence for efficient processing
  • Formatting and storing transcription results

This real-time implementation represents a significant enhancement over batch processing, enabling interactive applications where immediate transcription is required.

5.2.2 SpeechLM and SpeechGPT: Language Models that Listen

While Whisper excels at ASR (Automatic Speech Recognition), models like SpeechLM and SpeechGPT go a step further: they integrate speech and text into a single transformer framework. This represents a fundamental shift from traditional approaches where speech processing and text understanding were handled by completely separate systems.

This integration is revolutionary because it allows these models to process both modalities simultaneously rather than treating them as separate processing pipelines. By unifying speech and text in the same architecture, these models can leverage contextual information across modalities, resulting in more coherent and contextually appropriate responses. The direct connection between acoustic patterns and semantic meaning enables these models to capture nuances like tone, emphasis, and rhythm that might be lost in a pipeline approach.

To understand the significance of this advancement, consider how traditional speech systems work: first, an ASR component converts audio to text transcripts, then a separate natural language processing (NLP) system analyzes the transcript. Each transition between systems creates an opportunity for information loss. Important acoustic features like speaker emotion, sarcasm detection, or emphasis on specific words are typically stripped away during the initial transcription step.

SpeechLM and SpeechGPT, in contrast, maintain a continuous representation of the speech signal throughout the entire processing chain. This approach preserves crucial paralinguistic information—the non-verbal aspects of communication that often carry significant meaning. For instance, the same phrase spoken with different intonation patterns might convey completely different intentions, from sincere agreement to sarcastic dismissal. By keeping the acoustic signal and its linguistic interpretation linked throughout processing, these models can detect such subtleties.

The technical architecture enabling this integration typically involves specialized encoder modules that process raw audio waveforms or spectrograms into dense vector representations. These speech embeddings are then projected into the same latent space as text embeddings, allowing the transformer's attention mechanisms to establish connections between corresponding elements in both modalities. This cross-modal attention is the key innovation that enables these models to "listen" in a more human-like way.

Unlike traditional systems where speech is first converted to text and then processed by a language model (creating potential information loss at each step), these unified models maintain the richness of the original speech signal throughout processing. This preserves important paralinguistic features such as emotion, speaker identity, and conversational dynamics that are crucial for truly understanding spoken language in context.

SpeechLM (Microsoft):

Pretrained on paired audio–text data, allowing it to develop rich representations that capture both acoustic and linguistic information. This dual-modality training approach enables the model to understand not just what words are being said, but also how they're being said, including tone, emphasis, and speaker characteristics. The model processes raw audio waveforms alongside corresponding transcripts, learning to associate specific acoustic patterns with their semantic meanings. For example, it can distinguish between a question and a statement based on rising or falling intonation, even when the words are identical.

Learns to align acoustic features with linguistic tokens through innovative cross-modal attention mechanisms that map speech patterns to their textual representations. This alignment process creates a shared semantic space where speech and text can interact seamlessly, enabling more accurate interpretation of spoken language. These mechanisms work by establishing bidirectional connections between audio segments and corresponding text tokens, allowing information to flow freely across modalities. When processing a sentence, the model can simultaneously attend to both the acoustic signal and the linguistic structure, creating a unified representation that preserves both aspects of communication.

Supports tasks like speech-to-text, spoken translation, and speech understanding, with superior performance compared to pipeline approaches due to its end-to-end training methodology. By training all components together, SpeechLM avoids error propagation issues common in traditional pipeline systems where mistakes in early stages cascade through the system. In conventional approaches, if the ASR component misrecognizes a word, all downstream components (like translation or understanding) inherit that error. SpeechLM's unified approach allows later processing stages to potentially compensate for earlier uncertainties by leveraging broader contextual information and cross-modal cues, similar to how humans can understand slightly mispronounced words in context.

Utilizes self-supervised learning techniques to maximize learning from limited paired data, enabling robust performance even with limited annotations. These techniques include masked language modeling adapted for speech inputs, contrastive learning between speech and text representations, and consistency regularization across modalities. During training, the model might be presented with an audio segment where certain portions are masked out, requiring it to predict the missing acoustic information based on surrounding context and any available text. Similarly, it learns to minimize the distance between representations of the same content expressed in different modalities, helping to align the speech and text embedding spaces. This approach allows SpeechLM to leverage large quantities of unpaired speech or text data alongside smaller amounts of parallel data.

Incorporates advanced contextual understanding that allows it to better handle ambiguous speech, speaker variations, and noisy environments compared to traditional ASR systems. By maintaining a rich contextual representation throughout processing, SpeechLM can disambiguate homophones (words that sound alike but have different meanings) based on broader semantic context, adapt to different accents and speaking styles by recognizing patterns across larger segments of speech, and filter out background noise by distinguishing relevant speech patterns from irrelevant acoustic information. The model's attention mechanisms can focus on the most informative parts of the signal while de-emphasizing distracting elements, similar to how humans can follow a conversation in a crowded room—often called the "cocktail party effect."

SpeechGPT:

Extends LLMs to work directly with speech as input/output, eliminating the need for separate ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) systems in conversational applications. Traditional conversational AI systems typically require a pipeline approach where speech is first converted to text, processed by a language model, and then converted back to speech for the response. This multi-stage process introduces latency at each conversion point and often loses important acoustic information along the way.

SpeechGPT, however, integrates these components into a unified architecture that processes speech signals end-to-end. This direct integration enables smoother conversational flow, as the model processes speech signals directly without converting to intermediate text representations, reducing latency and preserving acoustic nuances that might be lost in traditional pipeline approaches. By maintaining the integrity of the original speech signal throughout processing, SpeechGPT can detect subtle variations in tone, rhythm, and emphasis that carry important communicative information beyond the literal words being spoken.

Can transcribe, understand, and even generate spoken dialogue in a unified framework, maintaining conversational context across multiple turns. Unlike traditional systems that process each utterance independently, SpeechGPT maintains a continuous memory of the conversation, allowing it to reference previous statements and generate contextually appropriate responses that acknowledge shared history between speakers.

This contextual awareness means the model can track topics across multiple exchanges, resolve ambiguous references, and respond appropriately to follow-up questions without requiring users to restate information. For example, if a user asks about the weather today and then follows up with "What about tomorrow?", SpeechGPT can understand that the second question is still about weather without explicit specification. This ability to maintain conversational state mirrors human dialogue patterns where context is implicitly understood and carried forward, creating more natural and efficient interactions.

Useful for conversational agents that naturally handle both modalities, creating more human-like interactions without modal switching delays. This seamless integration between speech and text processing mimics human communication patterns where we naturally shift between listening and speaking without conscious mode switching, enabling more fluid and natural dialogues with AI systems. In practice, this means users can speak directly to the system and receive spoken responses without perceiving any translation happening behind the scenes.

For applications like virtual assistants, customer service bots, or educational tutors, this creates a significantly more natural user experience that reduces cognitive load on users. The elimination of perceptible modal transitions also increases accessibility for users who may struggle with text interfaces, such as those with visual impairments, reading difficulties, or situations where looking at a screen is impractical (like while driving).

Demonstrates improved prosody and intonation in generated speech by leveraging the semantic understanding capabilities of the underlying LLM. By comprehending the meaning, emotion, and pragmatic intent behind responses, SpeechGPT can apply appropriate stress patterns, rhythm variations, and tonal shifts that convey not just what is being said, but how it should be said to effectively communicate meaning. This represents a significant advance over traditional TTS systems that often produce flat, monotonous speech that lacks the natural variations human speakers use to express meaning.

For instance, when expressing excitement, the system can increase pitch and speed; when conveying serious information, it can adopt a more measured pace with appropriate emphasis on key points. These prosodic features are crucial for effective communication, as they help listeners interpret the speaker's intentions, distinguish between questions and statements, identify important information, and understand emotional context. The ability to generate appropriately expressive speech makes interactions feel more natural and helps ensure that the intended meaning is accurately conveyed to users.

How They Work:

  1. Convert audio into speech embeddings using a feature extractor (like wav2vec2), which captures phonetic, prosodic, and speaker information from raw waveforms into dense vector representations. This process transforms complex audio signals into numerical matrices that preserve crucial linguistic features including pronunciation patterns, speech rhythm, emotional tone, and individual voice characteristics. The resulting embeddings create a mathematical representation of speech that models can process efficiently while maintaining the rich acoustic properties of the original audio.
  2. Align embeddings with text tokens in the transformer through cross-attention mechanisms, creating a joint representation space where acoustic and linguistic features can interact freely. These mechanisms allow the model to establish connections between corresponding elements in both modalities, mapping specific acoustic patterns to their textual counterparts. This alignment process creates bidirectional pathways that enable information to flow between speech and text representations, facilitating tasks like spoken language understanding where both the content and delivery of speech matter.
  3. Train on tasks that require both listening and understanding, such as answering questions about spoken content or following verbal instructions, to develop robust multimodal comprehension abilities. This training approach forces the model to process auditory and textual information simultaneously, extracting meaning from both channels and integrating them into a unified semantic representation. By presenting the model with increasingly complex spoken language understanding challenges, it learns to recognize not just what words are being said, but also how context, emphasis, and tone modify their meaning.
  4. Utilize specialized loss functions that encourage semantic consistency between speech and text representations, ensuring that information is preserved across modality boundaries. These loss functions compare the model's internal representations of the same content expressed in different modalities and penalize inconsistencies, driving the model to develop aligned feature spaces. By minimizing the distance between representations of equivalent content across modalities, these functions help the model build a cohesive understanding regardless of whether information arrives as text or speech.
  5. Employ curriculum learning strategies that gradually increase task complexity, starting with simple speech recognition before progressing to more complex understanding and generation tasks. This staged approach begins with basic transcription to establish fundamental audio-text mappings, then advances to more sophisticated tasks like identifying speaker intent, recognizing emotion, and generating contextually appropriate responses to spoken queries. The progressive difficulty helps the model develop a hierarchy of speech understanding capabilities, from low-level acoustic processing to high-level semantic interpretation.

Code Example: SpeechLM Implementation

from transformers import AutoProcessor, SpeechLMForSpeechToText
import torch
import soundfile as sf
import librosa

# Load pretrained SpeechLM model and processor
model_id = "microsoft/speechlm-large-960h"
processor = AutoProcessor.from_pretrained(model_id)
model = SpeechLMForSpeechToText.from_pretrained(model_id)

# Load and preprocess audio file
audio_file = "speechlm-example.mp3"
speech, sample_rate = sf.read(audio_file)

# Resample if necessary
if sample_rate != 16000:
    speech = librosa.resample(speech, orig_sr=sample_rate, target_sr=16000)
    sample_rate = 16000

# Prepare inputs
inputs = processor(speech, sampling_rate=sample_rate, return_tensors="pt")

# Generate transcription
with torch.no_grad():
    generated_ids = model.generate(
        inputs["input_features"],
        num_beams=5,
        max_length=100
    )

# Decode the output tokens
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")

# For speech understanding tasks, we can also get embeddings
with torch.no_grad():
    outputs = model(
        input_features=inputs["input_features"],
        attention_mask=inputs["attention_mask"],
        decoder_input_ids=processor.get_decoder_prompt_ids(task="asr")
    )
    
    # Get speech embeddings from the encoder
    speech_embeddings = outputs.encoder_last_hidden_state
    print(f"Speech embeddings shape: {speech_embeddings.shape}")
    
    # These embeddings can be used for downstream tasks like
    # speaker identification, emotion recognition, or semantic analysis

Download the speech sample here: https://files.cuantum.tech/audio/speechlm-example.mp3

Note: Save the example audio in the same location as the Python script.

Code Breakdown: SpeechLM Implementation

This SpeechLM code example demonstrates how to use Microsoft's SpeechLM model for speech transcription and understanding. Let's examine each component:

  1. Imports and Model Loading: The code imports necessary libraries and loads the pretrained SpeechLM model and processor from Hugging Face. SpeechLM is a speech-language model that can process raw audio waveforms and perform tasks like transcription and understanding.
  2. Audio Processing: The audio file is loaded using soundfile and potentially resampled to 16kHz (the standard sampling rate expected by most speech models). This preprocessing ensures the audio input matches the format expected by the model regardless of the source recording conditions.
  3. Input Preparation: The processor converts the raw audio waveform into the model's expected input format. This includes extracting acoustic features (similar to spectrograms) and preparing attention masks to handle variable-length inputs. These features capture the phonetic and prosodic information from the speech signal.
  4. Transcription Generation: The model.generate() method performs beam search decoding to convert the audio features into text. This process uses the model's encoder-decoder architecture to map speech representations to text tokens. The num_beams parameter controls how many alternative hypotheses the model considers during decoding, while max_length limits the output length.
  5. Decoding: The processor.batch_decode() function converts the generated token IDs back into human-readable text, removing any special tokens (like padding or end-of-sequence markers) that are used internally by the model but aren't part of the actual transcription.
  6. Speech Embeddings Extraction: Beyond simple transcription, the code demonstrates how to access the model's internal representations of speech. The encoder_last_hidden_state contains rich contextual embeddings that capture both acoustic and linguistic properties of the speech. These embeddings preserve paralinguistic features (tone, emphasis, emotion) that might be lost in text transcription.

Technical Insights on SpeechLM's Architecture

SpeechLM represents a significant advancement in speech processing for several reasons:

  • Unified encoder-decoder architecture: Unlike pipeline approaches that separate ASR and language understanding, SpeechLM processes the entire speech-to-meaning pathway in a single model, reducing error propagation between components.
  • Contextual understanding: The transformer architecture allows the model to capture long-range dependencies in speech, helping it understand content based on the broader context rather than just isolated segments.
  • Cross-modal pretraining: SpeechLM is pretrained on paired speech-text data, allowing it to develop aligned representations between acoustic and linguistic features. This alignment enables more accurate transcription and understanding of spoken language.
  • Speech embeddings: The model's encoder produces contextualized speech embeddings that preserve both linguistic content and paralinguistic features (like speaker identity, emotion, and emphasis). These rich representations can be used for downstream tasks beyond basic transcription.

Practical Applications

The speech embeddings extracted in the example could be used for:

  • Speaker recognition: Identifying who is speaking based on voice characteristics preserved in the embeddings.
  • Emotion detection: Analyzing the emotional tone of speech from acoustic patterns.
  • Intent classification: Determining what the speaker wants to accomplish (ask a question, make a request, etc.).
  • Speech translation: Converting speech in one language to text in another by connecting the speech embeddings to a translation model.

SpeechLM represents an important step toward truly integrated speech-language models that process spoken language in a more human-like way, maintaining the rich acoustic information that gives speech its nuanced meaning beyond just the words being said.

Code Example: SpeechGPT Implementation

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

# Load pretrained SpeechGPT model and processor
model_id = "microsoft/speech_gpt2_oaitr"  # Note: This is an example model ID
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id)

# Function to handle conversational speech input and output
def speech_conversation(audio_path, conversation_history=None):
    # Load audio file
    waveform, sample_rate = torchaudio.load(audio_path)
    
    # Resample if necessary (SpeechGPT typically expects 16kHz audio)
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(sample_rate, 16000)
        waveform = resampler(waveform)
        sample_rate = 16000
    
    # Convert to mono if stereo
    if waveform.shape[0] > 1:
        waveform = torch.mean(waveform, dim=0, keepdim=True)
    
    # Process audio input
    inputs = processor(
        audio=waveform.squeeze().numpy(),
        sampling_rate=sample_rate,
        return_tensors="pt",
        conversation_history=conversation_history
    )
    
    # Generate response
    with torch.no_grad():
        output = model.generate(
            input_features=inputs["input_features"],
            attention_mask=inputs.get("attention_mask"),
            max_length=100,
            num_beams=5,
            early_stopping=True,
            conversation_history=inputs.get("conversation_history")
        )
    
    # Process the output
    transcription = processor.decode(output[0], skip_special_tokens=True)
    
    # Optional: Convert response to speech
    speech_output = model.generate_speech(
        output,
        speaker_embeddings=inputs.get("speaker_embeddings")
    )
    
    # Save the generated speech
    torchaudio.save(
        "response.wav", 
        speech_output.squeeze().unsqueeze(0), 
        16000
    )
    
    # Update conversation history
    new_conversation_history = {
        "input_speech": waveform.squeeze().numpy(),
        "output_text": transcription,
        "output_speech": speech_output.squeeze().numpy()
    }
    
    if conversation_history:
        conversation_history.append(new_conversation_history)
    else:
        conversation_history = [new_conversation_history]
    
    return transcription, speech_output, conversation_history

# Example usage
if __name__ == "__main__":
    # Start a new conversation
    conversation_history = None
    
    # First interaction
    user_query = "user_question_.mp3"  # Path to audio file with user's question
    response_text, response_audio, conversation_history = speech_conversation(
        user_query, conversation_history
    )
    
    print(f"User (transcribed): {response_text}")
    
    # Second interaction (with conversation history for context)
    follow_up_query = "user_follow_up.mp3"  # Path to follow-up question audio
    response_text2, response_audio2, conversation_history = speech_conversation(
        follow_up_query, conversation_history
    )
    
    print(f"User follow-up (transcribed): {response_text2}")

Download the user question audio sample here: https://files.cuantum.tech/audio/user_question_.mp3

Download the user follow up audio sample here: https://files.cuantum.tech/audio/user_follow_up.mp3

Note: Save the example audios in the same location as the Python script.

Code Breakdown: SpeechGPT Implementation

  1. Imports and Model Loading: The code imports PyTorch, torchaudio, and Hugging Face transformers to work with the SpeechGPT model. We load a pretrained model and processor that can handle both speech input and output in a conversational context.
  2. Conversation Function: The speech_conversation function serves as the core component, handling the entire speech-to-speech conversation flow. It takes an audio path and optional conversation history as inputs.
  3. Audio Preprocessing: The function loads the audio file using torchaudio, ensures it's at the required 16kHz sample rate (resampling if necessary), and converts stereo to mono if needed. These preprocessing steps ensure the audio meets the model's input requirements.
  4. Input Processing: The processor converts the raw audio waveform into the feature representations expected by SpeechGPT. Importantly, it includes the conversation history parameter, which allows the model to maintain context across multiple turns.
  5. Response Generation: The model generates a response based on the speech input and conversation context. The generation parameters control the quality and length of the response: 
    • max_length: Limits the response length
    • num_beams: Uses beam search with 5 beams for better quality responses
    • early_stopping: Terminates generation when all beams reach an end token
    • conversation_history: Provides context from previous exchanges
  6. Speech Synthesis: Unlike traditional models that would require a separate TTS system, SpeechGPT can directly generate speech output from its internal representations. The generate_speech method converts the text response into audio, maintaining speaker characteristics if provided.
  7. Conversation State Management: The function tracks conversation history by storing each exchange (input speech, output text, output speech) in a structured format. This history is passed to subsequent calls, enabling the model to reference previous information.
  8. Example Usage: The code demonstrates a two-turn conversation, showing how to: 
    • Start a new conversation (empty history)
    • Process the first user query
    • Maintain conversation context
    • Handle a follow-up question while preserving context

Technical Insights on SpeechGPT Architecture

SpeechGPT represents a significant advancement in speech-language models by integrating several key architectural innovations:

  • End-to-end speech-to-speech framework: Unlike traditional pipeline approaches that separate ASR, language understanding, and TTS components, SpeechGPT unifies these capabilities in a single model, reducing latency and error propagation.
  • Joint speech-text representation: The model learns a shared embedding space for both speech and text, allowing for seamless transitions between modalities without information loss. This joint representation enables the model to maintain the emotional and prosodic elements of speech alongside semantic content.
  • Conversation-aware transformer: SpeechGPT extends the standard transformer architecture with additional mechanisms to track conversation state and maintain coherence across multiple turns. This includes specialized attention layers that can reference previous exchanges.
  • Prosody modeling: The speech generation component preserves natural intonation, rhythm, and emphasis patterns by incorporating prosodic features into the generation process. This results in more human-like speech output compared to traditional TTS systems.

Key Advantages Over Traditional Speech Systems

  • Contextual understanding: SpeechGPT maintains conversation state across multiple turns, allowing it to handle follow-up questions, resolve references, and build on previous exchanges without requiring users to restate context.
  • Seamless modality transitions: The unified architecture eliminates perceptible delays between speech understanding and response generation, creating more natural conversational flow.
  • Expressive speech generation: By leveraging its language understanding capabilities, SpeechGPT can apply appropriate prosody and intonation that matches the semantic and emotional content of responses.
  • Reduced latency: The end-to-end design eliminates the computational overhead of separate ASR, NLU, and TTS systems, enabling faster response times in interactive applications.

Practical Applications

SpeechGPT's unified speech-language capabilities make it particularly well-suited for:

  • Virtual assistants: Creating more natural and contextually aware voice interfaces for smart devices and applications.
  • Accessibility tools: Developing conversation systems for users with visual impairments or those who prefer speech interfaces.
  • Language learning: Building interactive tutors that can engage in spoken dialogue while maintaining context across a learning session.
  • Customer service: Powering voice bots that can handle complex, multi-turn conversations with natural speech patterns.

The integration of speech and language understanding in a single model represents a significant step toward more human-like AI communication systems that can engage in natural conversation across modalities.

Code Example: Extracting Speech Features with wav2vec2 (Hugging Face)

from transformers import Wav2Vec2Processor, Wav2Vec2Model, Wav2Vec2ForCTC
import torch
import soundfile as sf
import librosa
import matplotlib.pyplot as plt
import numpy as np
import IPython.display as ipd

# Function to load and preprocess audio
def load_audio(file_path, target_sr=16000):
    """
    Load audio file and resample if necessary
    """
    # Load audio using librosa (handles various formats better)
    try:
        audio, sample_rate = librosa.load(file_path, sr=None)
        # Resample if needed
        if sample_rate != target_sr:
            print(f"Resampling from {sample_rate}Hz to {target_sr}Hz")
            audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=target_sr)
            sample_rate = target_sr
        return audio, sample_rate
    except Exception as e:
        print(f"Error loading audio: {e}")
        return None, None

# Load pretrained speech model
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

# Also load ASR model for transcription demo
asr_model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

# Load audio file
speech, rate = load_audio("speech_sample_w.mp3")
if speech is not None:
    # Display audio waveform
    plt.figure(figsize=(10, 4))
    plt.plot(speech)
    plt.title("Audio Waveform")
    plt.xlabel("Time (samples)")
    plt.ylabel("Amplitude")
    plt.show()
    
    # Display audio for listening
    ipd.display(ipd.Audio(speech, rate=rate))
    
    # Process audio for feature extraction
    inputs = processor(speech, sampling_rate=rate, return_tensors="pt", padding=True)
    
    # Extract embeddings
    with torch.no_grad():
        # Get the hidden states (embeddings)
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
        
        # Also get the transcription from ASR model
        logits = asr_model(**inputs).logits
        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = processor.batch_decode(predicted_ids)[0]
    
    print("Audio transcription:", transcription)
    print("Shape of embeddings:", embeddings.shape)  # [batch, time, hidden_dim]
    
    # Visualize embeddings
    # Take mean across time dimension to get a single vector per feature
    mean_embeddings = embeddings.mean(dim=1).squeeze().numpy()
    
    plt.figure(figsize=(12, 6))
    plt.imshow(mean_embeddings.reshape(1, -1), aspect='auto', cmap='viridis')
    plt.colorbar()
    plt.title("Speech Embeddings Visualization")
    plt.xlabel("Feature Dimensions")
    plt.ylabel("Sample")
    plt.show()
    
    # Demonstrate feature extraction for downstream tasks
    # Example: Extract global speech representation (average pooling)
    global_speech_vector = embeddings.mean(dim=1)
    print("Global speech vector shape:", global_speech_vector.shape)  # [batch, hidden_dim]
    
    # Example: Extract frame-level features for a specific segment (middle 1 second)
    middle_frame = embeddings.shape[1] // 2
    segment_features = embeddings[0, middle_frame-25:middle_frame+25, :]  # ~1 second at 50Hz frame rate
    print("Segment features shape:", segment_features.shape)  # [frames, hidden_dim]
else:
    print("Failed to load audio file. Please check the path and file format.")

Download the audio sample here: https://files.cuantum.tech/audio/speech_sample_w.mp3

Note: Save the example audio in the same location as the Python script.

Comprehensive Code Breakdown: Speech Feature Extraction with Wav2Vec2

  • 1. Imports and Setup
    • We import the necessary libraries: Transformers for the Wav2Vec2 models, PyTorch for tensor operations, soundfile/librosa for audio processing, and visualization tools.
    • We include both the base Wav2Vec2Model (for embeddings) and Wav2Vec2ForCTC (for transcription) to demonstrate multiple use cases.
  • 2. Audio Loading and Preprocessing
    • The load_audio function handles various audio formats and automatically resamples to 16kHz if necessary (Wav2Vec2's expected sample rate).
    • Using librosa instead of soundfile provides better support for various audio formats and error handling.
  • 3. Model Initialization
    • We load the pretrained Wav2Vec2 processor and model from Hugging Face's model hub.
    • The processor handles tokenization of audio data into the format expected by the model.
    • We also load the ASR variant of the model to demonstrate speech recognition capabilities.
  • 4. Visualization
    • We plot the audio waveform to provide visual insight into the signal being processed.
    • We use IPython's audio display capabilities to allow for listening to the audio directly in notebooks.
  • 5. Feature Extraction
    • The processor converts the raw audio into the input format required by the model.
    • With torch.no_grad(), we ensure no gradients are computed during inference, saving memory.
    • We extract the last_hidden_state which contains the contextualized audio embeddings.
  • 6. Transcription
    • Using the ASR model variant, we convert the same audio input into text.
    • This demonstrates how the same audio features can be used for multiple downstream tasks.
  • 7. Embedding Visualization and Analysis
    • We visualize the embeddings using a heatmap to give insight into the feature patterns.
    • We demonstrate two common ways to use the embeddings:
    • Global representation: averaging across time to get a single vector representing the entire utterance (useful for speaker identification, emotion recognition, etc.)
    • Frame-level features: extracting time-aligned segments for fine-grained analysis (useful for alignment, pronunciation assessment, etc.)
  • 8. Error Handling
    • The code includes basic error handling to gracefully deal with issues like missing files or unsupported formats.

Technical Insights: Why This Approach Matters

  • Wav2Vec2 is a self-supervised model trained on massive amounts of unlabeled speech data, allowing it to learn robust speech representations without requiring transcriptions.
  • The extracted embeddings capture phonetic content, speaker characteristics, emotional tone, and acoustic environment information in a unified representation.
  • These embeddings serve as excellent features for downstream tasks like speech recognition, speaker identification, and emotion classification.
  • The contextual nature of the embeddings (each frame is influenced by surrounding audio) makes them more powerful than traditional acoustic features like MFCCs.

5.2.3 GPT-5 Realtime: Low-Latency Voice Interaction

While Whisper demonstrated the ability to transcribe speech with high accuracy (speech → text) and models like SpeechLM and SpeechGPT extended this by integrating spoken inputs into large language models, GPT-5 Realtime represents the next leap forward: a model that can listen and respond in natural speech almost instantly. This breakthrough addresses the fundamental limitation of earlier systems - the noticeable delay between input and response that made interactions feel mechanical rather than natural.

This is not merely speech recognition paired with text generation and then a separate text-to-speech system bolted on top. Earlier approaches typically followed a pipeline architecture where each component operated independently, creating bottlenecks and inconsistencies. Instead, GPT-5 Realtime is natively multimodal, trained to process audio as a first-class input and to produce audio as a first-class output. This integrated approach means the model understands the prosody, emotion, and nuances in spoken language directly, without information loss from intermediate text representations.

The result is a conversational agent capable of fluid, human-like dialogue with latency measured in tens of milliseconds, making it suitable for real-world conversations, tutoring, and customer service. This ultra-low latency is achieved through specialized architectures that process audio streams incrementally rather than waiting for complete utterances, along with predictive mechanisms that anticipate likely responses. The end-to-end optimization eliminates the cumulative delays inherent in pipeline approaches, creating interactions that feel remarkably human in their timing and rhythm.

Architecture and Capabilities

GPT-5 Realtime integrates multiple components into one coherent system, creating a seamless conversational experience:

  • Speech-in: Users can send raw audio (16-bit PCM WAV, 24 kHz mono is a safe default). The model transcribes and interprets speech in real time, converting acoustic signals into semantic understanding. Unlike traditional speech recognition systems that merely transcribe words, GPT-5 Realtime captures nuances, emotions, and contextual cues from the audio input, preserving the richness of human communication.
  • Speech-out: The model responds with synthetic but natural-sounding speech, streamed back as low-latency audio frames. Different voices and speaking styles can be selected to match user preferences or specific use cases. The generated speech maintains appropriate prosody, emphasis, and intonation patterns that make the interaction feel genuinely human-like rather than robotic.
  • Full multimodality: In addition to audio, GPT-5 Realtime sessions can also accept text and image inputs mid-conversation, allowing for hybrid interactions (e.g., "Look at this chart and tell me about it" while speaking). This flexibility enables seamless transitions between modalities, supporting more natural workflows where users might want to show visual information while continuing to speak, similar to how humans communicate in meetings or educational settings.
  • Low latency: Because the model is optimized for conversational flow, response latency is comparable to a human pause in speech — generally under 300 ms. This is achieved through specialized streaming architectures and predictive processing that begins generating responses before the user has finished speaking. The near-instantaneous turnaround creates a conversational rhythm that feels natural and engaging, eliminating the awkward pauses common in earlier AI systems.
  • Telephony integration: GPT-5 Realtime sessions can be connected to SIP (Session Initiation Protocol), enabling the model to act as a phone-based agent. This integration allows the model to handle inbound and outbound calls over standard telephone networks, making advanced AI accessible through the most ubiquitous communication technology worldwide, without requiring specialized equipment or applications.

Together, these features push AI systems beyond one-way transcription or delayed response, toward live conversational intelligence.

Practical Example

For consistency with our multimodal focus, we’ll use a short audio file (user_prompt_spoken.wav) where the user asks:

“Can you explain the advantages of GPT-5 as a multimodal model?”

When sent to GPT-5 Realtime, the model will:

  1. Transcribe the spoken question.
  2. Reason about the content.
  3. Generate speech that explains the advantages of GPT-5’s multimodality.

The round-trip feels like a natural dialogue with a knowledgeable assistant.

Code Example: Realtime Voice with GPT-5

The following Python script shows how to connect to the Realtime API using WebSockets, send a short WAV file as input, and save the assistant’s spoken reply as a new WAV file.

"""
Realtime Voice with GPT-5 (WebSocket API)
- Sends a short WAV file (user_prompt_spoken.wav) to GPT-5 Realtime
- Receives streamed audio back and saves it to assistant_reply.wav
Requirements: pip install websockets soundfile numpy
"""

import os
import json
import base64
import asyncio
import websockets
import numpy as np
import soundfile as sf

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-5-realtime"

INPUT_WAV = "user_prompt_spoken.wav"   # spoken question
OUTPUT_WAV = "assistant_reply.wav"     # assistant’s voice reply

def read_wav_as_base64(path: str) -> str:
    """Read WAV file and return base64-encoded string."""
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

async def main():
    if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
        raise SystemExit("Set OPENAI_API_KEY env variable.")

    # Load spoken user prompt
    user_audio_b64 = read_wav_as_base64(INPUT_WAV)

    # Connect to Realtime WebSocket
    async with websockets.connect(
        REALTIME_URL,
        extra_headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1",
        },
        max_size=20 * 1024 * 1024,
    ) as ws:
        print("Connected to GPT-5 Realtime.")

        # 1) Configure session (input/output formats, voice)
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "input_audio_format": "wav",
                "output_audio_format": "wav",
                "voice": "alloy",
                "instructions": (
                    "You are a helpful voice assistant. "
                    "Answer the user’s question clearly and concisely."
                )
            }
        }))

        # 2) Append user audio to input buffer
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": user_audio_b64
        }))
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # 3) Ask model to create a response
        await ws.send(json.dumps({"type": "response.create", "response": {}}))

        print("Waiting for GPT-5 Realtime reply...")

        # 4) Collect audio frames
        audio_bytes = bytearray()
        sample_rate = 24000  # expected sample rate

        async for msg in ws:
            evt = json.loads(msg)

            if evt.get("type") == "response.output_text.delta":
                print(evt.get("delta"), end="", flush=True)

            elif evt.get("type") == "response.output_audio.delta":
                audio_bytes.extend(base64.b64decode(evt["delta"]))

            elif evt.get("type") == "response.completed":
                print("\n[Response completed]")
                break

        # 5) Save assistant’s reply as a WAV file
        pcm16 = np.frombuffer(bytes(audio_bytes), dtype=np.int16)
        sf.write(OUTPUT_WAV, pcm16, samplerate=sample_rate, subtype="PCM_16")
        print(f"[Saved] {OUTPUT_WAV}")

if __name__ == "__main__":
    asyncio.run(main())

Download the audio sample here: https://files.cuantum.tech/audio/user_prompt_spoken.wav

Note: Save the example audio in the same location as the Python script.

Code breakdown:

  1. Session Setup
    • The client connects to the Realtime WebSocket and sends a session.update message specifying:
      • Input modality: audio (WAV).
      • Output modality: audio (WAV).
      • Selected voice (e.g., "alloy").
    • This defines the rules of the conversation.
  2. Input Buffering
    • Audio files (or live microphone frames) are base64-encoded and appended to an input buffer.
    • commit message signals the end of input.
  3. Response Creation
    • response.create message tells GPT-5 to process the buffer and generate a reply.
  4. Streaming Output
    • The server streams back two types of deltas:
      • response.output_text.delta (optional live transcript).
      • response.output_audio.delta (audio chunks).
    • Audio chunks are collected into a byte array until response.completed is received.
  5. Saving the File
    • The reply is written as a standard 24 kHz PCM16 WAV file, playable in any media player.

Applications and Implications

GPT-5 Realtime demonstrates how far multimodal LLMs have evolved:

  • Conversational Agents: Natural, low-latency assistants that can answer customer queries or provide educational tutoring over phone or web.
  • Accessibility: Voice-based interfaces for users who cannot easily type or read text.
  • Hybrid Interactions: Combine voice with images and text mid-conversation, enabling richer multi-turn exchanges.
  • Telephony Integration: Deploy AI agents that can handle SIP phone calls, routing, and form-filling.

Example Code: Live Microphone Capture with GPT-5 Realtime (speech-in → speech-out)

What it does:

  • Records ~3 seconds from your default mic
  • Streams it to GPT-5 Realtime over WebSocket
  • Saves the model’s spoken reply as assistant_reply.wav
  • Prints a live text transcript (if provided by the server)

Requirements

pip install websockets sounddevice soundfile numpy
  • OS mic permissions: allow terminal/IDE access to the microphone (macOS: System Settings → Privacy & Security → Microphone; Windows: Privacy → Microphone).
"""
Live Mic → GPT-5 Realtime → Spoken Reply
Records ~3 seconds of audio from your default microphone, streams it to GPT-5 Realtime,
and saves the assistant's spoken response to assistant_reply.wav.

Requirements:
  pip install websockets sounddevice soundfile numpy
Set your key:
  export OPENAI_API_KEY="sk-..."   (macOS/Linux)
  setx OPENAI_API_KEY "sk-..."     (Windows, new terminal)

If you prefer MP3 I/O, see the note in your book; this example uses WAV (PCM16 @ 24 kHz).
"""

import os
import json
import base64
import asyncio
import websockets
import numpy as np
import sounddevice as sd
import soundfile as sf

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-5-realtime"

# Recording settings (safe defaults for Realtime)
SAMPLE_RATE = 24000          # 24 kHz mono PCM16
CHANNELS = 1
DURATION_SECONDS = 3.0       # keep short for quick tests
OUTPUT_WAV = "assistant_reply.wav"

SYSTEM_INSTRUCTIONS = (
    "You are a helpful voice assistant. Transcribe the user if needed, "
    "then answer clearly in one or two sentences."
)

def record_from_mic(seconds: float = DURATION_SECONDS, sr: int = SAMPLE_RATE) -> bytes:
    """Record mono PCM16 audio from the default microphone and return raw bytes."""
    print(f"🎙️  Recording {seconds:.1f}s from microphone...")
    audio = sd.rec(int(sr * seconds), samplerate=sr, channels=CHANNELS, dtype="int16")
    sd.wait()
    print("✅ Done.")
    # audio is int16 numpy array; convert to raw bytes
    return audio.tobytes()

def b64encode_pcm16_wav(pcm_bytes: bytes, sr: int = SAMPLE_RATE) -> str:
    """
    Wrap raw PCM16 bytes into a WAV file in memory and return base64 string.
    Using soundfile to write to bytes buffer for simplicity.
    """
    import io
    buf = io.BytesIO()
    # convert bytes -> int16 array so soundfile can write it
    arr = np.frombuffer(pcm_bytes, dtype=np.int16)
    sf.write(buf, arr, sr, subtype="PCM_16", format="WAV")
    return base64.b64encode(buf.getvalue()).decode("utf-8")

async def main():
    if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
        raise SystemExit("Set OPENAI_API_KEY environment variable first.")

    # 1) Capture short mic audio & base64-encode as WAV
    pcm = record_from_mic()
    user_audio_b64 = b64encode_pcm16_wav(pcm)

    # 2) Connect to Realtime WS
    async with websockets.connect(
        REALTIME_URL,
        extra_headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1",
        },
        max_size=20 * 1024 * 1024,  # allow large frames
    ) as ws:
        print("🔌 Connected to GPT-5 Realtime.")

        # 3) Configure session: audio in/out (WAV), pick a voice
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "input_audio_format": "wav",
                "output_audio_format": "wav",
                "voice": "alloy",
                "instructions": SYSTEM_INSTRUCTIONS
            }
        }))

        # 4) Send mic audio (can be multiple appends for streaming mic)
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": user_audio_b64
        }))
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # Optional: add a brief text nudge
        await ws.send(json.dumps({
            "type": "response.create",
            "response": {
                "instructions": "Please respond concisely in speech."
            }
        }))

        print("🕑 Waiting for streamed reply...\n")

        # 5) Receive streamed audio/text deltas
        audio_bytes = bytearray()
        sample_rate = SAMPLE_RATE  # server commonly uses 24k; update if session reports different

        async for msg in ws:
            evt = json.loads(msg)

            # Live transcript (optional)
            if evt.get("type") == "response.output_text.delta":
                print(evt.get("delta"), end="", flush=True)

            # Audio chunks (base64-encoded PCM16 WAV frames)
            elif evt.get("type") == "response.output_audio.delta":
                audio_bytes.extend(base64.b64decode(evt["delta"]))

            elif evt.get("type") == "response.completed":
                print("\n\n✅ Response completed.")
                break

        # 6) Save assistant reply to WAV
        if audio_bytes:
            # raw bytes may already be WAV, but normalizing here is robust:
            # interpret as PCM16 stream and write as standard WAV
            pcm16 = np.frombuffer(bytes(audio_bytes), dtype=np.int16)
            sf.write(OUTPUT_WAV, pcm16, samplerate=sample_rate, subtype="PCM_16")
            print(f"💾 Saved assistant reply → {OUTPUT_WAV}")
        else:
            print("⚠️ No audio received from server.")

if __name__ == "__main__":
    asyncio.run(main())

Here's a code breakdown:

Required Libraries

The script uses several Python libraries:

  • websockets: For WebSocket communication with the GPT-5 Realtime API
  • sounddevice: To record audio from the microphone
  • soundfile: For handling WAV file operations
  • numpy: For audio data manipulation
  • Standard libraries: osjsonbase64asyncioio

Key Components

1. Configuration Settings

The script defines several important constants:

  • OPENAI_API_KEY: Authentication key for OpenAI's API
  • REALTIME_URL: WebSocket endpoint for GPT-5 Realtime
  • Recording parameters: Sample rate (24kHz), channels (mono), recording duration (3 seconds)
  • SYSTEM_INSTRUCTIONS: Prompts GPT-5 to act as a voice assistant

2. Audio Recording Function

The record_from_mic() function:

  • Uses sounddevice to capture audio at specified sample rate and duration
  • Records in mono at 16-bit PCM format
  • Returns raw audio bytes

3. WAV Encoding Function

The b64encode_pcm16_wav() function:

  • Takes raw PCM16 audio bytes
  • Wraps them in a WAV container using soundfile
  • Returns the base64-encoded string of the WAV file

4. Main Async Function

The main() async function orchestrates the entire process:

API Key Validation

  • Checks if the OpenAI API key is properly set

Audio Recording and Encoding

  • Records audio from the microphone
  • Encodes it as a base64 WAV string

WebSocket Connection

  • Establishes a secure WebSocket connection to GPT-5 Realtime
  • Sets proper headers including API key and beta flag

Session Configuration

  • Sends a session.update message to configure:
  • Input/output modalities (text and audio)
  • Audio format (WAV for both input and output)
  • Voice selection ("alloy")
  • System instructions for the assistant

Input Handling

  • Appends the recorded audio to the input buffer
  • Commits the buffer to signal completion of input
  • Optionally adds text instructions to shape the response

Response Processing

  • Collects streamed response data in real-time:
  • Text deltas (transcription of response)
  • Audio deltas (spoken audio chunks)
  • Monitors for completion signal

Output Saving

  • Converts collected audio bytes back to PCM16 format
  • Writes to a WAV file (assistant_reply.wav)

Flow of Execution

The script follows this sequence:

  1. Validate environment setup and API key
  2. Record short audio clip from microphone
  3. Connect to GPT-5 Realtime WebSocket API
  4. Configure session parameters (audio formats, voice)
  5. Send recorded audio and commit the input
  6. Request model to process audio and generate a response
  7. Receive and display text transcript while collecting audio chunks
  8. Save the complete audio response as a WAV file

Error Handling

The code includes basic error handling:

  • Checks for missing API key
  • Verifies if audio was received from the server

Technical Notes

  • Uses 24kHz mono PCM16 format, which is optimal for speech processing
  • Supports WebSocket protocol for real-time streaming
  • Uses asyncio for asynchronous operations
  • Implements proper WebSocket connection lifecycle management

Mid-Session Multimodality: Combining Audio and Images

One of GPT-5 Realtime’s most powerful abilities is to handle multiple modalities within a single ongoing conversation. Unlike earlier systems that processed text, images, or audio in isolation, Realtime can fluidly combine them as they arrive. This enables natural scenarios where a user begins by speaking a question and then adds an image for further clarification or analysis — all in the same session without restarting the dialogue.

For example, imagine a student asking aloud “Can you explain the advantages of GPT-5 as a multimodal model?” and then immediately showing a chart of data. GPT-5 Realtime can integrate both inputs, producing a spoken response that addresses the original audio question and references insights from the chart. This kind of dynamic, mid-session multimodality illustrates how the model moves beyond static question–answer patterns and toward fluid, real-time collaboration with human users.

Example: Mid-Session Multimodality (Audio question → Append Image → Spoken reply)

What it does

  1. Sends a short spoken question (WAV) to GPT-5 Realtime.
  2. Appends a chart image in the same session.
  3. Requests a spoken answer that references both the audio question and the image.
  4. Saves the reply as assistant_multimodal_reply.wav and prints any streamed text.

Requirements

pip install websockets soundfile numpy pillow
  • Put your audio prompt file (e.g., user_prompt_spoken.wav) and an image (e.g., chart.png) in the same folder.
  • Or adjust the paths below.
"""
Multimodal Mid-Session with GPT-5 Realtime
- Step 1: Send a spoken question (WAV) to GPT-5 Realtime.
- Step 2: Append an image (PNG) in the same session.
- Step 3: Ask for a spoken reply that references BOTH inputs.
- Saves the model’s voice reply to assistant_multimodal_reply.wav.

Requirements:
  pip install websockets soundfile numpy pillow
Set your key:
  export OPENAI_API_KEY="sk-..."   (macOS/Linux)
  setx OPENAI_API_KEY "sk-..."     (Windows, new terminal)
"""

import os
import io
import json
import base64
import asyncio
import websockets
import numpy as np
import soundfile as sf
from PIL import Image

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-5-realtime"

# Input files (adjust as needed)
INPUT_WAV = "user_prompt_spoken.wav"  # spoken question, e.g., “Can you explain the advantages of GPT-5 as a multimodal model?”
INPUT_IMG = "chart.png"               # a chart image to reference mid-session
OUTPUT_WAV = "assistant_multimodal_reply.wav"

# Session behavior
VOICE_NAME = "alloy"
SYSTEM_INSTRUCTIONS = (
    "You are a helpful voice assistant. Consider ALL inputs in this session. "
    "First, interpret the user's spoken question. Then, when an image is provided, "
    "analyze it and integrate both sources in your final spoken answer. "
    "Be concise and precise."
)

def read_wav_as_base64(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def read_png_as_base64(path: str) -> str:
    # Ensure we produce a clean PNG bytes payload (also validates file)
    with Image.open(path) as im:
        im = im.convert("RGBA") if im.mode not in ("RGB", "RGBA") else im
        buf = io.BytesIO()
        im.save(buf, format="PNG")
        return base64.b64encode(buf.getvalue()).decode("utf-8")

async def main():
    if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
        raise SystemExit("Set OPENAI_API_KEY environment variable first.")

    # Load inputs
    user_audio_b64 = read_wav_as_base64(INPUT_WAV)
    image_png_b64 = read_png_as_base64(INPUT_IMG)

    # Connect to Realtime WebSocket
    async with websockets.connect(
        REALTIME_URL,
        extra_headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1",
        },
        max_size=20 * 1024 * 1024,
    ) as ws:
        print("🔌 Connected to GPT-5 Realtime.")

        # 1) Configure session: we'll use audio in/out, and also allow image as input
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio", "image"],
                "input_audio_format": "wav",
                "output_audio_format": "wav",
                "voice": VOICE_NAME,
                "instructions": SYSTEM_INSTRUCTIONS
            }
        }))

        # 2) Append the user's spoken question (audio buffer)
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": user_audio_b64
        }))
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # 3) Append the image mid-session
        #    We send the PNG as base64 along with its MIME. (You can also send a URL if supported.)
        await ws.send(json.dumps({
            "type": "input_image.append",
            "image": image_png_b64,
            "mime_type": "image/png",
            # Optionally, add a hint for the model about why you're sending the image:
            "metadata": {
                "purpose": "chart_analysis",
                "caption": "A line chart showing a synthetic trend over time."
            }
        }))
        await ws.send(json.dumps({"type": "input_image.commit"}))

        # 4) Ask for a response that references BOTH the spoken question and the image
        await ws.send(json.dumps({
            "type": "response.create",
            "response": {
                "instructions": (
                    "Please answer in speech. "
                    "Explain the advantages of GPT-5 as a multimodal model, "
                    "and also summarize the main trend you observe in the provided chart. "
                    "Be concise (15–25 seconds)."
                )
            }
        }))

        print("🕑 Waiting for streamed reply...\n")

        # 5) Receive streamed text & audio
        audio_bytes = bytearray()
        sample_rate = 24000  # common server rate; adjust if your session reports differently

        async for msg in ws:
            evt = json.loads(msg)

            # Optional: live transcript/notes
            if evt.get("type") == "response.output_text.delta":
                print(evt.get("delta"), end="", flush=True)

            # Audio deltas (base64-encoded PCM16)
            elif evt.get("type") == "response.output_audio.delta":
                audio_bytes.extend(base64.b64decode(evt["delta"]))

            elif evt.get("type") == "response.completed":
                print("\n\n✅ Response completed.")
                break

        # 6) Save the final spoken reply
        if audio_bytes:
            pcm16 = np.frombuffer(bytes(audio_bytes), dtype=np.int16)
            sf.write(OUTPUT_WAV, pcm16, samplerate=sample_rate, subtype="PCM_16")
            print(f"💾 Saved assistant reply → {OUTPUT_WAV}")
        else:
            print("⚠️ No audio received from server.")

if __name__ == "__main__":
    asyncio.run(main())

Download the audio sample here: https://files.cuantum.tech/audio/user_prompt_spoken.wav

Download the chart image sample here: https://files.cuantum.tech/images/chart.png

Note: Save the example audio in the same location as the Python script.

Here's a code breakdown:

Key Components

  1. Imports and Setup: The script uses several Python libraries:
    • Standard libraries: os, io, json, base64, asyncio
    • websockets: For WebSocket communication with the GPT-5 Realtime API
    • numpy: For audio data manipulation
    • soundfile: For handling WAV file operations
    • PIL (Pillow): For image processing
  2. Configuration: The script defines important constants:
    • OPENAI_API_KEY: Retrieved from environment variables
    • REALTIME_URL: WebSocket endpoint for the GPT-5 Realtime API
    • Input/output file paths: Locations of input audio (WAV), input image (PNG), and output audio
    • VOICE_NAME: Selects "alloy" as the voice for the assistant's reply
    • SYSTEM_INSTRUCTIONS: Defines the assistant's behavior
  3. Helper Functions: Two utility functions for file handling:
    • read_wav_as_base64(): Reads a WAV file and converts it to base64 encoding
    • read_png_as_base64(): Reads a PNG image, ensures it's in the correct format, and converts it to base64
  4. Main Asynchronous Function: The core of the script with these main steps:
    • Input Validation: Checks if the API key is properly set
    • File Loading: Loads and encodes the audio and image files
    • WebSocket Connection: Establishes a connection to GPT-5 Realtime with proper headers
    • Session Configuration: Sets up a session with text, audio, and image modalities
    • Audio Input: Sends the spoken question (WAV) and commits the audio buffer
    • Image Input: Appends the chart image mid-session with metadata about its purpose
    • Response Request: Requests a spoken reply that addresses both inputs
    • Response Processing: Collects streamed text and audio chunks from the server
    • Output Saving: Converts received audio bytes to PCM16 format and saves as WAV

WebSocket Communication Flow

The script follows a specific protocol for communication with the GPT-5 Realtime API:

  1. Sends a session.update message to configure modalities and behavior
  2. Sends the audio data using input_audio_buffer.append and commits it
  3. Adds the image using input_image.append with metadata and commits it
  4. Creates a response request with specific instructions
  5. Processes incoming events in real-time:
    • Text deltas (transcription)
    • Audio deltas (spoken reply chunks)
    • Completion signal

Error Handling

The script includes basic error checking:

  • Validates the API key
  • Checks if audio was received from the server

Key Technical Aspects

The implementation showcases several important concepts:

  • Asynchronous programming with asyncio for non-blocking I/O
  • Base64 encoding for binary data transmission over WebSockets
  • Real-time streaming of both text and audio responses
  • Mid-session multimodality by combining different input types in one conversation
  • Proper WebSocket lifecycle management

This code example demonstrates the power of GPT-5 Realtime's ability to handle multiple modalities within a single ongoing conversation, allowing for more natural and fluid interactions.

5.2.4 Why Audio Integration Matters

Accessibility: Automatic transcription for the hearing impaired. This technology enables real-time conversion of spoken content into text, making digital media, meetings, and educational resources accessible to deaf and hard-of-hearing individuals. Modern transcription systems can work in real-time with high accuracy, providing captions for live events, lectures, and conversations, removing barriers to participation in many aspects of daily life and professional settings.

By integrating audio processing with language models, these systems can accurately capture nuances, different accents, and even distinguish between multiple speakers. This integration enables more contextual understanding, allowing the transcription to include important non-verbal audio cues, proper punctuation, and speaker identification. Advanced systems can also adapt to specialized terminology, regional dialects, and challenging acoustic environments, making information more accessible across diverse settings from medical appointments to entertainment media.

Education: Real-time translation and captions in classrooms. This application transforms how international students engage with lectures by providing immediate translations of spoken content. It also helps all students by generating accurate captions for recorded lectures, making review more efficient and allowing learners to search through spoken content based on keywords or concepts.

Advanced multimodal systems can detect lecture context and technical terminology, accurately translating specialized vocabulary while maintaining academic integrity. These systems can distinguish between different speakers in classroom discussions, properly attributing questions and responses in the transcription.

Furthermore, these technologies enable asynchronous learning by creating searchable archives of lectures that students can navigate by concept rather than timestamp. For students with learning differences such as ADHD or dyslexia, the synchronized visual and auditory information improves comprehension and retention.

The integration of AI with educational content also allows for personalized learning paths, where the system can identify concepts that individual students struggle with based on their engagement patterns and provide targeted supplementary material. This multimodal approach bridges accessibility gaps while enhancing the learning experience for all students.

Assistants: Voice-driven chatbots, smart speakers, and AI tutors. These systems create natural conversation flows by understanding spoken queries and generating contextually appropriate spoken responses. Advanced multimodal assistants can maintain conversational context over extended interactions, understand varying speech patterns, and respond with appropriate intonation and emphasis that matches the content being delivered.

Cross-lingual communication: Breaking down barriers with speech-to-speech translation. This technology enables conversations between people who speak different languages by capturing speech in one language, understanding its meaning, and generating natural-sounding speech in another language. Modern systems preserve speaker characteristics like tone, pace, and emotion, making the exchange feel more personal and authentic.

These systems represent a significant advancement over traditional translation tools by offering real-time communication without requiring text interfaces. The process involves three sophisticated steps: speech recognition to convert spoken words into text, machine translation to convert that text into another language, and text-to-speech synthesis to deliver the translation in a natural voice.

The latest neural translation models understand cultural nuances and idioms that literal translations often miss. For example, when a Japanese speaker uses honorifics that don't exist in English, the system can adapt the output to convey appropriate respect through tone and word choice rather than direct translation.

Additionally, these technologies can adapt to various contexts - from business negotiations where precision is critical to casual conversations where fluidity matters more. Some advanced systems even maintain consistent voice profiles across languages, allowing a Spanish speaker's unique vocal characteristics to be present in the English translation, creating a more seamless and personalized communication experience.

Unlike older systems where speech recognition and language models were separate components chained together with potential information loss at each step, modern multimodal approaches fuse them into unified architectures that process acoustic and linguistic information simultaneously. This integration creates AI that listens and responds more naturally, understanding context across modalities and handling the ambiguities inherent in human communication.

5.2 Audio & Speech Integration (Whisper, SpeechLM)

Language is not only written but spoken. The ability to listen, transcribe, and respond to speech is essential if AI is to become a seamless assistant in daily life. Recent advances in speech recognition and speech-language modeling have made it possible to integrate audio directly into large-scale language systems. This integration represents a significant leap forward in AI capabilities, as it bridges the gap between written and spoken communication.

Speech is our most natural form of communication, and by enabling AI to process audio inputs, we create more intuitive interfaces that don't require users to type or read. This is particularly important for accessibility, allowing those with limited mobility, vision impairments, or literacy challenges to interact with technology. Furthermore, speech carries additional information through tone, pace, and emphasis that text alone cannot convey, providing richer context for AI systems to understand human intent.

Let's explore two key directions in speech-enabled AI:

Whisper – OpenAI's robust speech-to-text system. This open-source model represents a breakthrough in transcription technology with its ability to handle diverse accents, background noise, and technical vocabulary.

Unlike previous speech recognition systems that struggled with real-world audio conditions, Whisper demonstrates remarkable accuracy even with challenging inputs such as podcast conversations, lecture recordings, or phone calls.

SpeechLM / SpeechGPT – models that extend transformers to directly handle audio-text tasks. These advanced systems go beyond simple transcription by maintaining the connection between acoustic features and semantic meaning.

Rather than treating speech-to-text as a separate preprocessing step, they incorporate audio understanding directly into the language modeling process, enabling more nuanced responses that consider not just what was said, but how it was said.

5.2.1 Whisper: Universal Speech Recognition

Whisper is an open-source model from OpenAI designed for speech recognition, translation, and transcription across many languages. Released in September 2022, it represents a significant advancement in audio processing technology by providing robust performance across diverse acoustic environments and speaking styles. Unlike previous speech recognition systems that often struggled with accents, background noise, or specialized vocabulary, Whisper was trained on an extensive dataset of 680,000 hours of multilingual and multitask supervised data collected from the web, giving it remarkable versatility and accuracy.

The model architecture combines a transformer-based encoder that processes audio spectrograms with a decoder similar to GPT models that generates text output. This design allows Whisper to handle the complexities of human speech, including variations in pitch, tone, cadence, and pronunciation across different languages and dialects.

What makes Whisper particularly groundbreaking is its zero-shot capabilities - it can recognize and transcribe speech in languages it wasn't explicitly fine-tuned for. Additionally, Whisper can automatically detect the spoken language, translate speech directly into English, and even handle code-switching (when speakers alternate between multiple languages within a conversation). This versatility makes it valuable for applications ranging from automatic meeting transcription to cross-lingual communication tools and accessibility services for the hearing impaired.

Key features:

  • Trained on 680,000 hours of multilingual audio from the web, including a wide variety of accents, dialects, and background conditions. This massive and diverse training dataset enables Whisper to handle real-world audio that previous systems struggled with. The dataset's scale provides broad coverage across linguistic variations, regional accents, speaking styles, and acoustic environments, giving Whisper an unprecedented ability to understand speech in virtually any context. This extensive training directly translates to Whisper's ability to transcribe speech from speakers with accents or dialects traditionally underrepresented in AI training data.
  • Handles noisy, real-world audio (e.g., phone calls, lectures, podcasts, street recordings) with remarkable resilience. Unlike earlier models that performed well only in studio-quality conditions, Whisper maintains accuracy even with background noise, overlapping speakers, or varying microphone quality. This robustness stems from its exposure to diverse acoustic environments during training, allowing it to filter out irrelevant sounds and focus on the speech signal. Whether processing a recording from a busy café, a conference room with echoing acoustics, or an outdoor interview with wind interference, Whisper can extract the spoken content with surprising accuracy.
  • Supports transcription, translation, and language identification across 99 languages. This multilingual capability allows it to automatically detect the spoken language and process content from global sources without requiring manual language selection. Whisper can seamlessly transcribe content in languages ranging from widely-spoken ones like English, Spanish, and Mandarin to less common languages like Swahili, Lithuanian, and Nepali. This language versatility makes it an invaluable tool for global communication, international research, and cross-cultural content creation. Even more impressively, Whisper can identify when speakers switch between languages mid-conversation, a phenomenon known as code-switching.
  • Features zero-shot learning capabilities, meaning it can perform tasks it wasn't explicitly fine-tuned for, adapting to new scenarios without additional training. This remarkable ability allows Whisper to generalize its knowledge to unfamiliar contexts, speakers, and acoustic environments. For example, without specific fine-tuning, it can transcribe technical jargon in fields like medicine or engineering, understand regional dialects it hasn't explicitly seen before, or adapt to novel audio recording conditions. This zero-shot capability is particularly valuable in practical applications where the diversity of real-world speech would otherwise require countless specialized models for different scenarios.

At its core, Whisper combines a log-Mel spectrogram encoder with a decoder similar to GPT, allowing it to map raw audio to natural language text. The encoder transforms audio waveforms into spectrograms—visual representations of sound frequencies over time—which capture the acoustic patterns in speech. This process begins by converting the raw audio signal into a spectrogram using the Short-Time Fourier Transform (STFT), which breaks down the audio into its frequency components.

These components are then mapped to the Mel scale, which approximates how humans perceive sound frequencies, with greater sensitivity to lower frequencies than higher ones. The resulting log-Mel spectrogram provides a compact representation of the audio that emphasizes the most perceptually relevant features.

These spectrograms are then processed through a transformer encoder that extracts meaningful features. The transformer architecture, with its self-attention mechanisms, allows the model to focus on different parts of the spectrogram simultaneously, capturing both local phonetic details and broader acoustic patterns. This is crucial for handling variations in speech like different accents, speaking rates, and background noise.

The GPT-style decoder then converts these features into text, treating transcription as a sequence prediction task similar to language modeling. This decoder works autoregressively, generating each word or token based on both the encoded audio features and the previously generated text. This approach enables Whisper to maintain contextual coherence throughout the transcription, correctly interpreting ambiguous sounds based on their surrounding context, and producing natural-sounding text that accurately reflects the original speech.

Example: Transcribing Audio with Whisper

# Comprehensive implementation of Whisper for audio transcription

# Install required libraries
# pip install git+https://github.com/openai/whisper.git
# pip install librosa matplotlib numpy

import whisper
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import time
import torch
from pathlib import Path

def visualize_audio(audio_path):
    """Visualize the audio waveform and spectrogram"""
    y, sr = librosa.load(audio_path)
    
    # Create a figure with two subplots
    plt.figure(figsize=(12, 8))
    
    # Plot waveform
    plt.subplot(2, 1, 1)
    librosa.display.waveshow(y, sr=sr)
    plt.title('Waveform')
    
    # Plot spectrogram
    plt.subplot(2, 1, 2)
    D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
    librosa.display.specshow(D, sr=sr, x_axis='time', y_axis='log')
    plt.colorbar(format='%+2.0f dB')
    plt.title('Log-frequency power spectrogram')
    
    plt.tight_layout()
    plt.show()

def transcribe_audio(audio_path, model_size="base", language=None, verbose=True):
    """
    Transcribe audio using OpenAI's Whisper model
    
    Parameters:
    - audio_path: Path to the audio file
    - model_size: Size of the Whisper model to use (tiny, base, small, medium, large)
    - language: Language code (e.g., "en" for English) or None for auto-detection
    - verbose: Whether to print progress information
    
    Returns:
    - Dictionary containing transcription results
    """
    start_time = time.time()
    
    if verbose:
        print(f"Loading Whisper model: {model_size}")
    
    # Load pre-trained Whisper model
    model = whisper.load_model(model_size)
    
    model_load_time = time.time()
    if verbose:
        print(f"Model loaded in {model_load_time - start_time:.2f} seconds")
        print(f"Transcribing: {audio_path}")
    
    # Set transcription options
    options = {}
    if language:
        options["language"] = language
    
    # Transcribe the audio file
    result = model.transcribe(audio_path, **options)
    
    end_time = time.time()
    if verbose:
        print(f"Transcription completed in {end_time - model_load_time:.2f} seconds")
        print(f"Detected language: {result['language']} (confidence: {result.get('language_probability', 0):.2f})")
        print(f"Total processing time: {end_time - start_time:.2f} seconds")
    
    return result

def save_transcription(result, output_file=None):
    """Save transcription results to a text file"""
    if output_file is None:
        output_file = "transcription_output.txt"
    
    with open(output_file, "w", encoding="utf-8") as f:
        # Write the full transcription
        f.write("FULL TRANSCRIPTION:\n")
        f.write(result["text"])
        f.write("\n\n")
        
        # Write segment-by-segment with timestamps
        f.write("SEGMENTS WITH TIMESTAMPS:\n")
        for segment in result["segments"]:
            start = segment["start"]
            end = segment["end"]
            text = segment["text"]
            f.write(f"[{start:.2f}s - {end:.2f}s] {text}\n")
    
    return output_file

def batch_transcribe(directory, extension=".mp3", output_dir=None):
    """Transcribe all audio files with the given extension in a directory"""
    if output_dir is None:
        output_dir = Path("transcription_results")
    
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)
    
    directory = Path(directory)
    audio_files = list(directory.glob(f"*{extension}"))
    
    print(f"Found {len(audio_files)} {extension} files in {directory}")
    
    for audio_file in audio_files:
        print(f"\nProcessing: {audio_file.name}")
        result = transcribe_audio(str(audio_file))
        
        output_file = output_dir / f"{audio_file.stem}_transcription.txt"
        save_transcription(result, output_file)
        print(f"Saved transcription to: {output_file}")

# Example usage
if __name__ == "__main__":
    # Set the path to your audio file
    audio_path = "speech_sample.mp3"
    
    # Check if CUDA is available for GPU acceleration
    cuda_available = torch.cuda.is_available()
    print(f"CUDA available: {cuda_available}")
    
    # Visualize the audio (optional)
    # visualize_audio(audio_path)
    
    # Transcribe the audio
    result = transcribe_audio(audio_path, model_size="base")
    
    # Print the transcription
    print("\nTRANSCRIPTION:")
    print(result["text"])
    
    # Save the transcription to a file
    output_file = save_transcription(result)
    print(f"\nSaved transcription to: {output_file}")
    
    # Example of batch processing
    # batch_transcribe("audio_folder", extension=".wav")

Download the speech sample here: https://files.cuantum.tech/audio/speech_sample.mp3

Note: Save the example audio in the same location as the Python script.

Breaking Down the Whisper Implementation:

1. Setup and Dependencies

The code begins by installing the necessary libraries: Whisper (directly from GitHub), librosa (for audio processing and visualization), matplotlib (for visualization), and numpy (for numerical operations). These libraries provide the foundation for audio processing and transcription.

2. Audio Visualization Function

The visualize_audio() function uses librosa to create two important visualizations:

  • A waveform display showing amplitude over time, which represents how the audio signal varies
  • A log-frequency spectrogram showing how energy is distributed across different frequencies over time, which helps analyze speech characteristics

These visualizations can help users understand the audio characteristics before transcription.

3. Core Transcription Function

The transcribe_audio() function is the heart of the implementation:

  • It accepts parameters for audio path, model size, language, and verbosity level
  • It loads the specified Whisper model (from tiny to large, with larger models being more accurate but slower)
  • It tracks processing time to provide performance metrics
  • It supports automatic language detection or allows specifying a language code
  • It returns a comprehensive result object containing the transcription and metadata

4. Results Processing

The save_transcription() function processes the Whisper results into user-friendly formats:

  • It saves the complete transcription text
  • It also extracts and formats individual segments with their timestamps, which is crucial for aligning transcription with audio timing
  • This enables applications like subtitle generation or time-synchronized content analysis

5. Batch Processing Capability

The batch_transcribe() function extends the utility to handle multiple audio files:

  • It processes all audio files with a specified extension in a directory
  • It organizes outputs into a dedicated directory structure
  • This is valuable for transcribing podcasts, interview series, or lecture collections

6. Example Usage

The main execution block demonstrates how to use these functions in practice:

  • It checks for GPU acceleration via CUDA, which can significantly improve performance for larger models
  • It offers options for audio visualization (commented out by default)
  • It performs transcription and displays the results
  • It saves the output to a file for future reference
  • It includes a commented example of batch processing

Advanced Features:

This implementation goes beyond basic transcription by including:

  • Performance timing to measure processing efficiency
  • Language detection reporting
  • Segment-level transcription with timestamps
  • Hardware acceleration detection
  • Audio analysis capabilities
  • Batch processing for multiple files

This example implementation provides a complete workflow for audio transcription, from preprocessing through visualization, transcription, and results management, making it suitable for both individual use cases and larger-scale applications.

Example: Advanced implementation of Whisper for real-time transcription with visualization

import whisper
import numpy as np
import pyaudio
import threading
import time
import queue
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from collections import deque
import torch
import os
from datetime import datetime

class WhisperRealtimeTranscriber:
    def __init__(self, model_size="base", language="en", energy_threshold=1000, 
                 record_timeout=2, phrase_timeout=3, max_sentences=10):
        """
        Initialize the real-time transcriber with Whisper
        
        Parameters:
        - model_size: Size of Whisper model ("tiny", "base", "small", "medium", "large")
        - language: Language code or None for auto-detection
        - energy_threshold: Minimum audio energy to consider for recording
        - record_timeout: Time in seconds to recheck if audio is speech
        - phrase_timeout: Time in seconds of silence to consider a phrase complete
        - max_sentences: Maximum number of sentences to display in history
        """
        self.model_name = model_size
        self.language = language
        self.energy_threshold = energy_threshold
        self.record_timeout = record_timeout
        self.phrase_timeout = phrase_timeout
        self.max_sentences = max_sentences
        
        # Check for GPU
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Using device: {self.device}")
        
        # Load Whisper model
        print(f"Loading Whisper {model_size} model...")
        self.model = whisper.load_model(model_size).to(self.device)
        print("Model loaded!")
        
        # Initialize audio processing
        self.audio_queue = queue.Queue()
        self.result_queue = queue.Queue()
        self.audio_data = np.zeros(0, dtype=np.float32)
        
        # For visualization
        self.audio_buffer = deque(maxlen=4000)  # ~4 seconds at 16kHz
        self.waveform_data = np.zeros(4000)
        self.spectrogram_data = np.zeros((201, 80))  # Mel spectrogram shape
        self.transcript_history = []
        self.recording = False
        self.terminated = False
        
        # Audio parameters
        self.sample_rate = 16000
        self.audio_format = pyaudio.paFloat32
        self.channels = 1
        self.chunk = 1024
        
        # Setup PyAudio
        self.p = pyaudio.PyAudio()
        
    def _get_audio_input_stream(self):
        """Create and return an input audio stream"""
        stream = self.p.open(
            format=self.audio_format,
            channels=self.channels,
            rate=self.sample_rate,
            input=True,
            frames_per_buffer=self.chunk
        )
        return stream
    
    def _audio_capture_thread(self):
        """Thread function for capturing audio"""
        stream = self._get_audio_input_stream()
        last_sample = bytes()
        phrase_time = None
        
        print("Listening for audio...")
        
        try:
            while not self.terminated:
                # Get new audio chunk
                current_sample = stream.read(self.chunk, exception_on_overflow=False)
                
                # Convert to numpy array
                data = np.frombuffer(current_sample, dtype=np.float32)
                
                # Update audio buffer for visualization
                self.audio_buffer.extend(data)
                self.waveform_data = np.array(list(self.audio_buffer))
                
                # Calculate audio energy
                energy = np.sqrt(np.mean(data**2))
                
                # Detect if audio is speech
                if energy > self.energy_threshold:
                    self.recording = True
                    
                    # Reset phrase timeout
                    phrase_time = None
                    
                    # Add audio to processing queue
                    self.audio_data = np.append(self.audio_data, data)
                
                # Handle phrase timeout
                elif self.recording:
                    if phrase_time is None:
                        phrase_time = time.time()
                    
                    # If enough silence, process the audio phrase
                    if time.time() - phrase_time > self.phrase_timeout:
                        if len(self.audio_data) > 0:
                            self.audio_queue.put(self.audio_data.copy())
                            self.audio_data = np.zeros(0, dtype=np.float32)
                        
                        self.recording = False
                        phrase_time = None
                
                # Process fixed chunks of audio regardless of speech detection
                if len(self.audio_data) > self.sample_rate * self.record_timeout:
                    self.audio_queue.put(self.audio_data.copy())
                    self.audio_data = self.audio_data[int(self.sample_rate * self.record_timeout):]
                
                time.sleep(0.01)
                
        finally:
            stream.stop_stream()
            stream.close()
    
    def _transcription_thread(self):
        """Thread function for processing audio with Whisper"""
        while not self.terminated:
            try:
                # Get audio data from queue
                if self.audio_queue.empty():
                    time.sleep(0.1)
                    continue
                
                audio_data = self.audio_queue.get()
                
                # Skip processing very short audio clips
                if len(audio_data) < 0.5 * self.sample_rate:
                    continue
                
                # Process audio with Whisper
                start_time = time.time()
                
                # Convert to format expected by Whisper
                audio_tensor = torch.tensor(audio_data).to(self.device)
                
                # Generate Mel spectrogram for visualization
                mel = whisper.log_mel_spectrogram(audio_data)
                if len(mel) > 80:
                    mel = mel[:80]
                self.spectrogram_data = mel.T.numpy()  # Transpose for visualization
                
                # Transcribe with Whisper
                options = {"language": self.language} if self.language else {}
                result = self.model.transcribe(audio_data, **options)
                
                # Get transcription result
                text = result["text"].strip()
                elapsed = time.time() - start_time
                
                # Skip empty results
                if len(text) == 0:
                    continue
                
                # Add timestamp and transcription to history
                timestamp = datetime.now().strftime("%H:%M:%S")
                entry = f"[{timestamp}] {text}"
                self.transcript_history.append(entry)
                
                # Keep only most recent entries
                if len(self.transcript_history) > self.max_sentences:
                    self.transcript_history = self.transcript_history[-self.max_sentences:]
                
                # Print result
                print(f"Transcribed ({elapsed:.2f}s): {text}")
                
            except Exception as e:
                print(f"Error in transcription thread: {e}")
                
    def _update_visualization(self, frame):
        """Update function for matplotlib animation"""
        # Clear previous plots
        plt.clf()
        
        # Plot audio waveform
        plt.subplot(3, 1, 1)
        plt.plot(self.waveform_data)
        plt.title("Audio Waveform")
        plt.ylim([-0.5, 0.5])
        
        # Plot status
        if self.recording:
            plt.gca().set_facecolor((0.9, 0.9, 1))
            plt.title("Audio Waveform - RECORDING")
        
        # Plot Mel spectrogram
        plt.subplot(3, 1, 2)
        plt.imshow(self.spectrogram_data, aspect='auto', origin='lower')
        plt.title("Mel Spectrogram")
        plt.tight_layout()
        
        # Show transcript history
        plt.subplot(3, 1, 3)
        plt.axis('off')
        history_text = "\n".join(self.transcript_history)
        plt.text(0.05, 0.95, history_text, 
                 verticalalignment='top', wrap=True, fontsize=9,
                 bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.2))
        plt.title("Transcript History")
        
        # Adjust layout
        plt.subplots_adjust(hspace=0.5)
        
    def start(self, visualize=True):
        """Start the real-time transcription system"""
        # Start audio capture thread
        audio_thread = threading.Thread(target=self._audio_capture_thread)
        audio_thread.daemon = True
        audio_thread.start()
        
        # Start transcription thread
        transcription_thread = threading.Thread(target=self._transcription_thread)
        transcription_thread.daemon = True
        transcription_thread.start()
        
        try:
            if visualize:
                # Set up visualization
                plt.figure(figsize=(10, 8))
                ani = FuncAnimation(plt.gcf(), self._update_visualization, interval=100)
                plt.show()
            else:
                # Just keep the main thread alive
                while True:
                    time.sleep(1)
        except KeyboardInterrupt:
            print("Stopping...")
        finally:
            self.terminated = True
            self.p.terminate()

    def save_transcript(self, filename=None):
        """Save the transcript history to a file"""
        if filename is None:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"transcript_{timestamp}.txt"
        
        with open(filename, "w", encoding="utf-8") as f:
            for entry in self.transcript_history:
                f.write(f"{entry}\n")
        
        print(f"Transcript saved to {filename}")

# Example usage
if __name__ == "__main__":
    # Create and start the transcriber
    transcriber = WhisperRealtimeTranscriber(
        model_size="base",
        language="en",
        energy_threshold=0.01,
        record_timeout=2,
        phrase_timeout=1
    )
    
    try:
        transcriber.start(visualize=True)
    except KeyboardInterrupt:
        pass
    finally:
        transcriber.save_transcript()
Note: To use this example code, you'll need to speak into your microphone during program execution.

Breaking Down the Real-Time Whisper Implementation:

1. Overall Architecture

This advanced implementation creates a real-time speech transcription system using Whisper. Unlike the previous example that processes existing files, this version:

  • Captures live audio input from a microphone
  • Processes audio in chunks as it arrives
  • Provides real-time visualization of the audio signal and transcription
  • Runs Whisper inference continuously on a separate thread

2. Class Structure and Initialization

The WhisperRealtimeTranscriber class encapsulates the entire system:

  • Manages multiple threads for audio capture and processing
  • Maintains queues for communication between threads
  • Configures parameters like energy thresholds for speech detection
  • Initializes visualization components including waveform and spectrogram displays
  • Sets up the Whisper model with GPU acceleration when available

3. Audio Capture System

The _audio_capture_thread method handles continuous audio input:

  • Uses PyAudio to access the microphone stream
  • Implements energy-based voice activity detection to identify speech
  • Manages "phrases" by detecting pauses between speech segments
  • Updates a circular buffer for visualization purposes
  • Queues detected speech for transcription processing

4. Whisper Transcription Engine

The _transcription_thread implements the core speech-to-text functionality:

  • Retrieves audio segments from the queue when available
  • Filters out audio clips that are too short
  • Generates mel spectrograms for both transcription and visualization
  • Runs the Whisper model inference to convert speech to text
  • Maintains a transcript history with timestamps
  • Measures and reports processing time for performance monitoring

5. Real-Time Visualization

The _update_visualization method creates an interactive dashboard:

  • Displays the audio waveform with recording status indicator
  • Shows the mel spectrogram representation used by Whisper
  • Provides a scrolling transcript history panel
  • Updates dynamically using Matplotlib's animation functionality

6. User Interface and Control Flow

The start method orchestrates the system operation:

  • Launches audio capture and transcription threads
  • Sets up the visualization if enabled
  • Handles clean shutdown on user interruption

7. Practical Applications

This implementation offers several advantages over the previous example:

  • Live transcription: Process speech as it happens rather than from files
  • Continuous operation: Run indefinitely for real-time applications
  • Visual feedback: See both the audio signal and the corresponding transcription
  • Speech detection: Automatically identify when someone is speaking
  • Performance monitoring: Track processing times to optimize for real-time use

8. Use Cases for Real-Time Whisper

This implementation is particularly useful for:

  • Live captioning for presentations or meetings
  • Real-time transcription for accessibility purposes
  • Interactive voice-controlled applications
  • Speech analytics and monitoring systems
  • Educational tools showing the relationship between speech and its transcription

9. Technical Considerations

The implementation addresses several challenges:

  • Balancing latency vs. accuracy through parameter tuning
  • Managing computational resources with threading
  • Providing visual feedback without affecting performance
  • Detecting speech vs. silence for efficient processing
  • Formatting and storing transcription results

This real-time implementation represents a significant enhancement over batch processing, enabling interactive applications where immediate transcription is required.

5.2.2 SpeechLM and SpeechGPT: Language Models that Listen

While Whisper excels at ASR (Automatic Speech Recognition), models like SpeechLM and SpeechGPT go a step further: they integrate speech and text into a single transformer framework. This represents a fundamental shift from traditional approaches where speech processing and text understanding were handled by completely separate systems.

This integration is revolutionary because it allows these models to process both modalities simultaneously rather than treating them as separate processing pipelines. By unifying speech and text in the same architecture, these models can leverage contextual information across modalities, resulting in more coherent and contextually appropriate responses. The direct connection between acoustic patterns and semantic meaning enables these models to capture nuances like tone, emphasis, and rhythm that might be lost in a pipeline approach.

To understand the significance of this advancement, consider how traditional speech systems work: first, an ASR component converts audio to text transcripts, then a separate natural language processing (NLP) system analyzes the transcript. Each transition between systems creates an opportunity for information loss. Important acoustic features like speaker emotion, sarcasm detection, or emphasis on specific words are typically stripped away during the initial transcription step.

SpeechLM and SpeechGPT, in contrast, maintain a continuous representation of the speech signal throughout the entire processing chain. This approach preserves crucial paralinguistic information—the non-verbal aspects of communication that often carry significant meaning. For instance, the same phrase spoken with different intonation patterns might convey completely different intentions, from sincere agreement to sarcastic dismissal. By keeping the acoustic signal and its linguistic interpretation linked throughout processing, these models can detect such subtleties.

The technical architecture enabling this integration typically involves specialized encoder modules that process raw audio waveforms or spectrograms into dense vector representations. These speech embeddings are then projected into the same latent space as text embeddings, allowing the transformer's attention mechanisms to establish connections between corresponding elements in both modalities. This cross-modal attention is the key innovation that enables these models to "listen" in a more human-like way.

Unlike traditional systems where speech is first converted to text and then processed by a language model (creating potential information loss at each step), these unified models maintain the richness of the original speech signal throughout processing. This preserves important paralinguistic features such as emotion, speaker identity, and conversational dynamics that are crucial for truly understanding spoken language in context.

SpeechLM (Microsoft):

Pretrained on paired audio–text data, allowing it to develop rich representations that capture both acoustic and linguistic information. This dual-modality training approach enables the model to understand not just what words are being said, but also how they're being said, including tone, emphasis, and speaker characteristics. The model processes raw audio waveforms alongside corresponding transcripts, learning to associate specific acoustic patterns with their semantic meanings. For example, it can distinguish between a question and a statement based on rising or falling intonation, even when the words are identical.

Learns to align acoustic features with linguistic tokens through innovative cross-modal attention mechanisms that map speech patterns to their textual representations. This alignment process creates a shared semantic space where speech and text can interact seamlessly, enabling more accurate interpretation of spoken language. These mechanisms work by establishing bidirectional connections between audio segments and corresponding text tokens, allowing information to flow freely across modalities. When processing a sentence, the model can simultaneously attend to both the acoustic signal and the linguistic structure, creating a unified representation that preserves both aspects of communication.

Supports tasks like speech-to-text, spoken translation, and speech understanding, with superior performance compared to pipeline approaches due to its end-to-end training methodology. By training all components together, SpeechLM avoids error propagation issues common in traditional pipeline systems where mistakes in early stages cascade through the system. In conventional approaches, if the ASR component misrecognizes a word, all downstream components (like translation or understanding) inherit that error. SpeechLM's unified approach allows later processing stages to potentially compensate for earlier uncertainties by leveraging broader contextual information and cross-modal cues, similar to how humans can understand slightly mispronounced words in context.

Utilizes self-supervised learning techniques to maximize learning from limited paired data, enabling robust performance even with limited annotations. These techniques include masked language modeling adapted for speech inputs, contrastive learning between speech and text representations, and consistency regularization across modalities. During training, the model might be presented with an audio segment where certain portions are masked out, requiring it to predict the missing acoustic information based on surrounding context and any available text. Similarly, it learns to minimize the distance between representations of the same content expressed in different modalities, helping to align the speech and text embedding spaces. This approach allows SpeechLM to leverage large quantities of unpaired speech or text data alongside smaller amounts of parallel data.

Incorporates advanced contextual understanding that allows it to better handle ambiguous speech, speaker variations, and noisy environments compared to traditional ASR systems. By maintaining a rich contextual representation throughout processing, SpeechLM can disambiguate homophones (words that sound alike but have different meanings) based on broader semantic context, adapt to different accents and speaking styles by recognizing patterns across larger segments of speech, and filter out background noise by distinguishing relevant speech patterns from irrelevant acoustic information. The model's attention mechanisms can focus on the most informative parts of the signal while de-emphasizing distracting elements, similar to how humans can follow a conversation in a crowded room—often called the "cocktail party effect."

SpeechGPT:

Extends LLMs to work directly with speech as input/output, eliminating the need for separate ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) systems in conversational applications. Traditional conversational AI systems typically require a pipeline approach where speech is first converted to text, processed by a language model, and then converted back to speech for the response. This multi-stage process introduces latency at each conversion point and often loses important acoustic information along the way.

SpeechGPT, however, integrates these components into a unified architecture that processes speech signals end-to-end. This direct integration enables smoother conversational flow, as the model processes speech signals directly without converting to intermediate text representations, reducing latency and preserving acoustic nuances that might be lost in traditional pipeline approaches. By maintaining the integrity of the original speech signal throughout processing, SpeechGPT can detect subtle variations in tone, rhythm, and emphasis that carry important communicative information beyond the literal words being spoken.

Can transcribe, understand, and even generate spoken dialogue in a unified framework, maintaining conversational context across multiple turns. Unlike traditional systems that process each utterance independently, SpeechGPT maintains a continuous memory of the conversation, allowing it to reference previous statements and generate contextually appropriate responses that acknowledge shared history between speakers.

This contextual awareness means the model can track topics across multiple exchanges, resolve ambiguous references, and respond appropriately to follow-up questions without requiring users to restate information. For example, if a user asks about the weather today and then follows up with "What about tomorrow?", SpeechGPT can understand that the second question is still about weather without explicit specification. This ability to maintain conversational state mirrors human dialogue patterns where context is implicitly understood and carried forward, creating more natural and efficient interactions.

Useful for conversational agents that naturally handle both modalities, creating more human-like interactions without modal switching delays. This seamless integration between speech and text processing mimics human communication patterns where we naturally shift between listening and speaking without conscious mode switching, enabling more fluid and natural dialogues with AI systems. In practice, this means users can speak directly to the system and receive spoken responses without perceiving any translation happening behind the scenes.

For applications like virtual assistants, customer service bots, or educational tutors, this creates a significantly more natural user experience that reduces cognitive load on users. The elimination of perceptible modal transitions also increases accessibility for users who may struggle with text interfaces, such as those with visual impairments, reading difficulties, or situations where looking at a screen is impractical (like while driving).

Demonstrates improved prosody and intonation in generated speech by leveraging the semantic understanding capabilities of the underlying LLM. By comprehending the meaning, emotion, and pragmatic intent behind responses, SpeechGPT can apply appropriate stress patterns, rhythm variations, and tonal shifts that convey not just what is being said, but how it should be said to effectively communicate meaning. This represents a significant advance over traditional TTS systems that often produce flat, monotonous speech that lacks the natural variations human speakers use to express meaning.

For instance, when expressing excitement, the system can increase pitch and speed; when conveying serious information, it can adopt a more measured pace with appropriate emphasis on key points. These prosodic features are crucial for effective communication, as they help listeners interpret the speaker's intentions, distinguish between questions and statements, identify important information, and understand emotional context. The ability to generate appropriately expressive speech makes interactions feel more natural and helps ensure that the intended meaning is accurately conveyed to users.

How They Work:

  1. Convert audio into speech embeddings using a feature extractor (like wav2vec2), which captures phonetic, prosodic, and speaker information from raw waveforms into dense vector representations. This process transforms complex audio signals into numerical matrices that preserve crucial linguistic features including pronunciation patterns, speech rhythm, emotional tone, and individual voice characteristics. The resulting embeddings create a mathematical representation of speech that models can process efficiently while maintaining the rich acoustic properties of the original audio.
  2. Align embeddings with text tokens in the transformer through cross-attention mechanisms, creating a joint representation space where acoustic and linguistic features can interact freely. These mechanisms allow the model to establish connections between corresponding elements in both modalities, mapping specific acoustic patterns to their textual counterparts. This alignment process creates bidirectional pathways that enable information to flow between speech and text representations, facilitating tasks like spoken language understanding where both the content and delivery of speech matter.
  3. Train on tasks that require both listening and understanding, such as answering questions about spoken content or following verbal instructions, to develop robust multimodal comprehension abilities. This training approach forces the model to process auditory and textual information simultaneously, extracting meaning from both channels and integrating them into a unified semantic representation. By presenting the model with increasingly complex spoken language understanding challenges, it learns to recognize not just what words are being said, but also how context, emphasis, and tone modify their meaning.
  4. Utilize specialized loss functions that encourage semantic consistency between speech and text representations, ensuring that information is preserved across modality boundaries. These loss functions compare the model's internal representations of the same content expressed in different modalities and penalize inconsistencies, driving the model to develop aligned feature spaces. By minimizing the distance between representations of equivalent content across modalities, these functions help the model build a cohesive understanding regardless of whether information arrives as text or speech.
  5. Employ curriculum learning strategies that gradually increase task complexity, starting with simple speech recognition before progressing to more complex understanding and generation tasks. This staged approach begins with basic transcription to establish fundamental audio-text mappings, then advances to more sophisticated tasks like identifying speaker intent, recognizing emotion, and generating contextually appropriate responses to spoken queries. The progressive difficulty helps the model develop a hierarchy of speech understanding capabilities, from low-level acoustic processing to high-level semantic interpretation.

Code Example: SpeechLM Implementation

from transformers import AutoProcessor, SpeechLMForSpeechToText
import torch
import soundfile as sf
import librosa

# Load pretrained SpeechLM model and processor
model_id = "microsoft/speechlm-large-960h"
processor = AutoProcessor.from_pretrained(model_id)
model = SpeechLMForSpeechToText.from_pretrained(model_id)

# Load and preprocess audio file
audio_file = "speechlm-example.mp3"
speech, sample_rate = sf.read(audio_file)

# Resample if necessary
if sample_rate != 16000:
    speech = librosa.resample(speech, orig_sr=sample_rate, target_sr=16000)
    sample_rate = 16000

# Prepare inputs
inputs = processor(speech, sampling_rate=sample_rate, return_tensors="pt")

# Generate transcription
with torch.no_grad():
    generated_ids = model.generate(
        inputs["input_features"],
        num_beams=5,
        max_length=100
    )

# Decode the output tokens
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")

# For speech understanding tasks, we can also get embeddings
with torch.no_grad():
    outputs = model(
        input_features=inputs["input_features"],
        attention_mask=inputs["attention_mask"],
        decoder_input_ids=processor.get_decoder_prompt_ids(task="asr")
    )
    
    # Get speech embeddings from the encoder
    speech_embeddings = outputs.encoder_last_hidden_state
    print(f"Speech embeddings shape: {speech_embeddings.shape}")
    
    # These embeddings can be used for downstream tasks like
    # speaker identification, emotion recognition, or semantic analysis

Download the speech sample here: https://files.cuantum.tech/audio/speechlm-example.mp3

Note: Save the example audio in the same location as the Python script.

Code Breakdown: SpeechLM Implementation

This SpeechLM code example demonstrates how to use Microsoft's SpeechLM model for speech transcription and understanding. Let's examine each component:

  1. Imports and Model Loading: The code imports necessary libraries and loads the pretrained SpeechLM model and processor from Hugging Face. SpeechLM is a speech-language model that can process raw audio waveforms and perform tasks like transcription and understanding.
  2. Audio Processing: The audio file is loaded using soundfile and potentially resampled to 16kHz (the standard sampling rate expected by most speech models). This preprocessing ensures the audio input matches the format expected by the model regardless of the source recording conditions.
  3. Input Preparation: The processor converts the raw audio waveform into the model's expected input format. This includes extracting acoustic features (similar to spectrograms) and preparing attention masks to handle variable-length inputs. These features capture the phonetic and prosodic information from the speech signal.
  4. Transcription Generation: The model.generate() method performs beam search decoding to convert the audio features into text. This process uses the model's encoder-decoder architecture to map speech representations to text tokens. The num_beams parameter controls how many alternative hypotheses the model considers during decoding, while max_length limits the output length.
  5. Decoding: The processor.batch_decode() function converts the generated token IDs back into human-readable text, removing any special tokens (like padding or end-of-sequence markers) that are used internally by the model but aren't part of the actual transcription.
  6. Speech Embeddings Extraction: Beyond simple transcription, the code demonstrates how to access the model's internal representations of speech. The encoder_last_hidden_state contains rich contextual embeddings that capture both acoustic and linguistic properties of the speech. These embeddings preserve paralinguistic features (tone, emphasis, emotion) that might be lost in text transcription.

Technical Insights on SpeechLM's Architecture

SpeechLM represents a significant advancement in speech processing for several reasons:

  • Unified encoder-decoder architecture: Unlike pipeline approaches that separate ASR and language understanding, SpeechLM processes the entire speech-to-meaning pathway in a single model, reducing error propagation between components.
  • Contextual understanding: The transformer architecture allows the model to capture long-range dependencies in speech, helping it understand content based on the broader context rather than just isolated segments.
  • Cross-modal pretraining: SpeechLM is pretrained on paired speech-text data, allowing it to develop aligned representations between acoustic and linguistic features. This alignment enables more accurate transcription and understanding of spoken language.
  • Speech embeddings: The model's encoder produces contextualized speech embeddings that preserve both linguistic content and paralinguistic features (like speaker identity, emotion, and emphasis). These rich representations can be used for downstream tasks beyond basic transcription.

Practical Applications

The speech embeddings extracted in the example could be used for:

  • Speaker recognition: Identifying who is speaking based on voice characteristics preserved in the embeddings.
  • Emotion detection: Analyzing the emotional tone of speech from acoustic patterns.
  • Intent classification: Determining what the speaker wants to accomplish (ask a question, make a request, etc.).
  • Speech translation: Converting speech in one language to text in another by connecting the speech embeddings to a translation model.

SpeechLM represents an important step toward truly integrated speech-language models that process spoken language in a more human-like way, maintaining the rich acoustic information that gives speech its nuanced meaning beyond just the words being said.

Code Example: SpeechGPT Implementation

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

# Load pretrained SpeechGPT model and processor
model_id = "microsoft/speech_gpt2_oaitr"  # Note: This is an example model ID
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id)

# Function to handle conversational speech input and output
def speech_conversation(audio_path, conversation_history=None):
    # Load audio file
    waveform, sample_rate = torchaudio.load(audio_path)
    
    # Resample if necessary (SpeechGPT typically expects 16kHz audio)
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(sample_rate, 16000)
        waveform = resampler(waveform)
        sample_rate = 16000
    
    # Convert to mono if stereo
    if waveform.shape[0] > 1:
        waveform = torch.mean(waveform, dim=0, keepdim=True)
    
    # Process audio input
    inputs = processor(
        audio=waveform.squeeze().numpy(),
        sampling_rate=sample_rate,
        return_tensors="pt",
        conversation_history=conversation_history
    )
    
    # Generate response
    with torch.no_grad():
        output = model.generate(
            input_features=inputs["input_features"],
            attention_mask=inputs.get("attention_mask"),
            max_length=100,
            num_beams=5,
            early_stopping=True,
            conversation_history=inputs.get("conversation_history")
        )
    
    # Process the output
    transcription = processor.decode(output[0], skip_special_tokens=True)
    
    # Optional: Convert response to speech
    speech_output = model.generate_speech(
        output,
        speaker_embeddings=inputs.get("speaker_embeddings")
    )
    
    # Save the generated speech
    torchaudio.save(
        "response.wav", 
        speech_output.squeeze().unsqueeze(0), 
        16000
    )
    
    # Update conversation history
    new_conversation_history = {
        "input_speech": waveform.squeeze().numpy(),
        "output_text": transcription,
        "output_speech": speech_output.squeeze().numpy()
    }
    
    if conversation_history:
        conversation_history.append(new_conversation_history)
    else:
        conversation_history = [new_conversation_history]
    
    return transcription, speech_output, conversation_history

# Example usage
if __name__ == "__main__":
    # Start a new conversation
    conversation_history = None
    
    # First interaction
    user_query = "user_question_.mp3"  # Path to audio file with user's question
    response_text, response_audio, conversation_history = speech_conversation(
        user_query, conversation_history
    )
    
    print(f"User (transcribed): {response_text}")
    
    # Second interaction (with conversation history for context)
    follow_up_query = "user_follow_up.mp3"  # Path to follow-up question audio
    response_text2, response_audio2, conversation_history = speech_conversation(
        follow_up_query, conversation_history
    )
    
    print(f"User follow-up (transcribed): {response_text2}")

Download the user question audio sample here: https://files.cuantum.tech/audio/user_question_.mp3

Download the user follow up audio sample here: https://files.cuantum.tech/audio/user_follow_up.mp3

Note: Save the example audios in the same location as the Python script.

Code Breakdown: SpeechGPT Implementation

  1. Imports and Model Loading: The code imports PyTorch, torchaudio, and Hugging Face transformers to work with the SpeechGPT model. We load a pretrained model and processor that can handle both speech input and output in a conversational context.
  2. Conversation Function: The speech_conversation function serves as the core component, handling the entire speech-to-speech conversation flow. It takes an audio path and optional conversation history as inputs.
  3. Audio Preprocessing: The function loads the audio file using torchaudio, ensures it's at the required 16kHz sample rate (resampling if necessary), and converts stereo to mono if needed. These preprocessing steps ensure the audio meets the model's input requirements.
  4. Input Processing: The processor converts the raw audio waveform into the feature representations expected by SpeechGPT. Importantly, it includes the conversation history parameter, which allows the model to maintain context across multiple turns.
  5. Response Generation: The model generates a response based on the speech input and conversation context. The generation parameters control the quality and length of the response: 
    • max_length: Limits the response length
    • num_beams: Uses beam search with 5 beams for better quality responses
    • early_stopping: Terminates generation when all beams reach an end token
    • conversation_history: Provides context from previous exchanges
  6. Speech Synthesis: Unlike traditional models that would require a separate TTS system, SpeechGPT can directly generate speech output from its internal representations. The generate_speech method converts the text response into audio, maintaining speaker characteristics if provided.
  7. Conversation State Management: The function tracks conversation history by storing each exchange (input speech, output text, output speech) in a structured format. This history is passed to subsequent calls, enabling the model to reference previous information.
  8. Example Usage: The code demonstrates a two-turn conversation, showing how to: 
    • Start a new conversation (empty history)
    • Process the first user query
    • Maintain conversation context
    • Handle a follow-up question while preserving context

Technical Insights on SpeechGPT Architecture

SpeechGPT represents a significant advancement in speech-language models by integrating several key architectural innovations:

  • End-to-end speech-to-speech framework: Unlike traditional pipeline approaches that separate ASR, language understanding, and TTS components, SpeechGPT unifies these capabilities in a single model, reducing latency and error propagation.
  • Joint speech-text representation: The model learns a shared embedding space for both speech and text, allowing for seamless transitions between modalities without information loss. This joint representation enables the model to maintain the emotional and prosodic elements of speech alongside semantic content.
  • Conversation-aware transformer: SpeechGPT extends the standard transformer architecture with additional mechanisms to track conversation state and maintain coherence across multiple turns. This includes specialized attention layers that can reference previous exchanges.
  • Prosody modeling: The speech generation component preserves natural intonation, rhythm, and emphasis patterns by incorporating prosodic features into the generation process. This results in more human-like speech output compared to traditional TTS systems.

Key Advantages Over Traditional Speech Systems

  • Contextual understanding: SpeechGPT maintains conversation state across multiple turns, allowing it to handle follow-up questions, resolve references, and build on previous exchanges without requiring users to restate context.
  • Seamless modality transitions: The unified architecture eliminates perceptible delays between speech understanding and response generation, creating more natural conversational flow.
  • Expressive speech generation: By leveraging its language understanding capabilities, SpeechGPT can apply appropriate prosody and intonation that matches the semantic and emotional content of responses.
  • Reduced latency: The end-to-end design eliminates the computational overhead of separate ASR, NLU, and TTS systems, enabling faster response times in interactive applications.

Practical Applications

SpeechGPT's unified speech-language capabilities make it particularly well-suited for:

  • Virtual assistants: Creating more natural and contextually aware voice interfaces for smart devices and applications.
  • Accessibility tools: Developing conversation systems for users with visual impairments or those who prefer speech interfaces.
  • Language learning: Building interactive tutors that can engage in spoken dialogue while maintaining context across a learning session.
  • Customer service: Powering voice bots that can handle complex, multi-turn conversations with natural speech patterns.

The integration of speech and language understanding in a single model represents a significant step toward more human-like AI communication systems that can engage in natural conversation across modalities.

Code Example: Extracting Speech Features with wav2vec2 (Hugging Face)

from transformers import Wav2Vec2Processor, Wav2Vec2Model, Wav2Vec2ForCTC
import torch
import soundfile as sf
import librosa
import matplotlib.pyplot as plt
import numpy as np
import IPython.display as ipd

# Function to load and preprocess audio
def load_audio(file_path, target_sr=16000):
    """
    Load audio file and resample if necessary
    """
    # Load audio using librosa (handles various formats better)
    try:
        audio, sample_rate = librosa.load(file_path, sr=None)
        # Resample if needed
        if sample_rate != target_sr:
            print(f"Resampling from {sample_rate}Hz to {target_sr}Hz")
            audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=target_sr)
            sample_rate = target_sr
        return audio, sample_rate
    except Exception as e:
        print(f"Error loading audio: {e}")
        return None, None

# Load pretrained speech model
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

# Also load ASR model for transcription demo
asr_model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

# Load audio file
speech, rate = load_audio("speech_sample_w.mp3")
if speech is not None:
    # Display audio waveform
    plt.figure(figsize=(10, 4))
    plt.plot(speech)
    plt.title("Audio Waveform")
    plt.xlabel("Time (samples)")
    plt.ylabel("Amplitude")
    plt.show()
    
    # Display audio for listening
    ipd.display(ipd.Audio(speech, rate=rate))
    
    # Process audio for feature extraction
    inputs = processor(speech, sampling_rate=rate, return_tensors="pt", padding=True)
    
    # Extract embeddings
    with torch.no_grad():
        # Get the hidden states (embeddings)
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
        
        # Also get the transcription from ASR model
        logits = asr_model(**inputs).logits
        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = processor.batch_decode(predicted_ids)[0]
    
    print("Audio transcription:", transcription)
    print("Shape of embeddings:", embeddings.shape)  # [batch, time, hidden_dim]
    
    # Visualize embeddings
    # Take mean across time dimension to get a single vector per feature
    mean_embeddings = embeddings.mean(dim=1).squeeze().numpy()
    
    plt.figure(figsize=(12, 6))
    plt.imshow(mean_embeddings.reshape(1, -1), aspect='auto', cmap='viridis')
    plt.colorbar()
    plt.title("Speech Embeddings Visualization")
    plt.xlabel("Feature Dimensions")
    plt.ylabel("Sample")
    plt.show()
    
    # Demonstrate feature extraction for downstream tasks
    # Example: Extract global speech representation (average pooling)
    global_speech_vector = embeddings.mean(dim=1)
    print("Global speech vector shape:", global_speech_vector.shape)  # [batch, hidden_dim]
    
    # Example: Extract frame-level features for a specific segment (middle 1 second)
    middle_frame = embeddings.shape[1] // 2
    segment_features = embeddings[0, middle_frame-25:middle_frame+25, :]  # ~1 second at 50Hz frame rate
    print("Segment features shape:", segment_features.shape)  # [frames, hidden_dim]
else:
    print("Failed to load audio file. Please check the path and file format.")

Download the audio sample here: https://files.cuantum.tech/audio/speech_sample_w.mp3

Note: Save the example audio in the same location as the Python script.

Comprehensive Code Breakdown: Speech Feature Extraction with Wav2Vec2

  • 1. Imports and Setup
    • We import the necessary libraries: Transformers for the Wav2Vec2 models, PyTorch for tensor operations, soundfile/librosa for audio processing, and visualization tools.
    • We include both the base Wav2Vec2Model (for embeddings) and Wav2Vec2ForCTC (for transcription) to demonstrate multiple use cases.
  • 2. Audio Loading and Preprocessing
    • The load_audio function handles various audio formats and automatically resamples to 16kHz if necessary (Wav2Vec2's expected sample rate).
    • Using librosa instead of soundfile provides better support for various audio formats and error handling.
  • 3. Model Initialization
    • We load the pretrained Wav2Vec2 processor and model from Hugging Face's model hub.
    • The processor handles tokenization of audio data into the format expected by the model.
    • We also load the ASR variant of the model to demonstrate speech recognition capabilities.
  • 4. Visualization
    • We plot the audio waveform to provide visual insight into the signal being processed.
    • We use IPython's audio display capabilities to allow for listening to the audio directly in notebooks.
  • 5. Feature Extraction
    • The processor converts the raw audio into the input format required by the model.
    • With torch.no_grad(), we ensure no gradients are computed during inference, saving memory.
    • We extract the last_hidden_state which contains the contextualized audio embeddings.
  • 6. Transcription
    • Using the ASR model variant, we convert the same audio input into text.
    • This demonstrates how the same audio features can be used for multiple downstream tasks.
  • 7. Embedding Visualization and Analysis
    • We visualize the embeddings using a heatmap to give insight into the feature patterns.
    • We demonstrate two common ways to use the embeddings:
    • Global representation: averaging across time to get a single vector representing the entire utterance (useful for speaker identification, emotion recognition, etc.)
    • Frame-level features: extracting time-aligned segments for fine-grained analysis (useful for alignment, pronunciation assessment, etc.)
  • 8. Error Handling
    • The code includes basic error handling to gracefully deal with issues like missing files or unsupported formats.

Technical Insights: Why This Approach Matters

  • Wav2Vec2 is a self-supervised model trained on massive amounts of unlabeled speech data, allowing it to learn robust speech representations without requiring transcriptions.
  • The extracted embeddings capture phonetic content, speaker characteristics, emotional tone, and acoustic environment information in a unified representation.
  • These embeddings serve as excellent features for downstream tasks like speech recognition, speaker identification, and emotion classification.
  • The contextual nature of the embeddings (each frame is influenced by surrounding audio) makes them more powerful than traditional acoustic features like MFCCs.

5.2.3 GPT-5 Realtime: Low-Latency Voice Interaction

While Whisper demonstrated the ability to transcribe speech with high accuracy (speech → text) and models like SpeechLM and SpeechGPT extended this by integrating spoken inputs into large language models, GPT-5 Realtime represents the next leap forward: a model that can listen and respond in natural speech almost instantly. This breakthrough addresses the fundamental limitation of earlier systems - the noticeable delay between input and response that made interactions feel mechanical rather than natural.

This is not merely speech recognition paired with text generation and then a separate text-to-speech system bolted on top. Earlier approaches typically followed a pipeline architecture where each component operated independently, creating bottlenecks and inconsistencies. Instead, GPT-5 Realtime is natively multimodal, trained to process audio as a first-class input and to produce audio as a first-class output. This integrated approach means the model understands the prosody, emotion, and nuances in spoken language directly, without information loss from intermediate text representations.

The result is a conversational agent capable of fluid, human-like dialogue with latency measured in tens of milliseconds, making it suitable for real-world conversations, tutoring, and customer service. This ultra-low latency is achieved through specialized architectures that process audio streams incrementally rather than waiting for complete utterances, along with predictive mechanisms that anticipate likely responses. The end-to-end optimization eliminates the cumulative delays inherent in pipeline approaches, creating interactions that feel remarkably human in their timing and rhythm.

Architecture and Capabilities

GPT-5 Realtime integrates multiple components into one coherent system, creating a seamless conversational experience:

  • Speech-in: Users can send raw audio (16-bit PCM WAV, 24 kHz mono is a safe default). The model transcribes and interprets speech in real time, converting acoustic signals into semantic understanding. Unlike traditional speech recognition systems that merely transcribe words, GPT-5 Realtime captures nuances, emotions, and contextual cues from the audio input, preserving the richness of human communication.
  • Speech-out: The model responds with synthetic but natural-sounding speech, streamed back as low-latency audio frames. Different voices and speaking styles can be selected to match user preferences or specific use cases. The generated speech maintains appropriate prosody, emphasis, and intonation patterns that make the interaction feel genuinely human-like rather than robotic.
  • Full multimodality: In addition to audio, GPT-5 Realtime sessions can also accept text and image inputs mid-conversation, allowing for hybrid interactions (e.g., "Look at this chart and tell me about it" while speaking). This flexibility enables seamless transitions between modalities, supporting more natural workflows where users might want to show visual information while continuing to speak, similar to how humans communicate in meetings or educational settings.
  • Low latency: Because the model is optimized for conversational flow, response latency is comparable to a human pause in speech — generally under 300 ms. This is achieved through specialized streaming architectures and predictive processing that begins generating responses before the user has finished speaking. The near-instantaneous turnaround creates a conversational rhythm that feels natural and engaging, eliminating the awkward pauses common in earlier AI systems.
  • Telephony integration: GPT-5 Realtime sessions can be connected to SIP (Session Initiation Protocol), enabling the model to act as a phone-based agent. This integration allows the model to handle inbound and outbound calls over standard telephone networks, making advanced AI accessible through the most ubiquitous communication technology worldwide, without requiring specialized equipment or applications.

Together, these features push AI systems beyond one-way transcription or delayed response, toward live conversational intelligence.

Practical Example

For consistency with our multimodal focus, we’ll use a short audio file (user_prompt_spoken.wav) where the user asks:

“Can you explain the advantages of GPT-5 as a multimodal model?”

When sent to GPT-5 Realtime, the model will:

  1. Transcribe the spoken question.
  2. Reason about the content.
  3. Generate speech that explains the advantages of GPT-5’s multimodality.

The round-trip feels like a natural dialogue with a knowledgeable assistant.

Code Example: Realtime Voice with GPT-5

The following Python script shows how to connect to the Realtime API using WebSockets, send a short WAV file as input, and save the assistant’s spoken reply as a new WAV file.

"""
Realtime Voice with GPT-5 (WebSocket API)
- Sends a short WAV file (user_prompt_spoken.wav) to GPT-5 Realtime
- Receives streamed audio back and saves it to assistant_reply.wav
Requirements: pip install websockets soundfile numpy
"""

import os
import json
import base64
import asyncio
import websockets
import numpy as np
import soundfile as sf

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-5-realtime"

INPUT_WAV = "user_prompt_spoken.wav"   # spoken question
OUTPUT_WAV = "assistant_reply.wav"     # assistant’s voice reply

def read_wav_as_base64(path: str) -> str:
    """Read WAV file and return base64-encoded string."""
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

async def main():
    if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
        raise SystemExit("Set OPENAI_API_KEY env variable.")

    # Load spoken user prompt
    user_audio_b64 = read_wav_as_base64(INPUT_WAV)

    # Connect to Realtime WebSocket
    async with websockets.connect(
        REALTIME_URL,
        extra_headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1",
        },
        max_size=20 * 1024 * 1024,
    ) as ws:
        print("Connected to GPT-5 Realtime.")

        # 1) Configure session (input/output formats, voice)
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "input_audio_format": "wav",
                "output_audio_format": "wav",
                "voice": "alloy",
                "instructions": (
                    "You are a helpful voice assistant. "
                    "Answer the user’s question clearly and concisely."
                )
            }
        }))

        # 2) Append user audio to input buffer
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": user_audio_b64
        }))
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # 3) Ask model to create a response
        await ws.send(json.dumps({"type": "response.create", "response": {}}))

        print("Waiting for GPT-5 Realtime reply...")

        # 4) Collect audio frames
        audio_bytes = bytearray()
        sample_rate = 24000  # expected sample rate

        async for msg in ws:
            evt = json.loads(msg)

            if evt.get("type") == "response.output_text.delta":
                print(evt.get("delta"), end="", flush=True)

            elif evt.get("type") == "response.output_audio.delta":
                audio_bytes.extend(base64.b64decode(evt["delta"]))

            elif evt.get("type") == "response.completed":
                print("\n[Response completed]")
                break

        # 5) Save assistant’s reply as a WAV file
        pcm16 = np.frombuffer(bytes(audio_bytes), dtype=np.int16)
        sf.write(OUTPUT_WAV, pcm16, samplerate=sample_rate, subtype="PCM_16")
        print(f"[Saved] {OUTPUT_WAV}")

if __name__ == "__main__":
    asyncio.run(main())

Download the audio sample here: https://files.cuantum.tech/audio/user_prompt_spoken.wav

Note: Save the example audio in the same location as the Python script.

Code breakdown:

  1. Session Setup
    • The client connects to the Realtime WebSocket and sends a session.update message specifying:
      • Input modality: audio (WAV).
      • Output modality: audio (WAV).
      • Selected voice (e.g., "alloy").
    • This defines the rules of the conversation.
  2. Input Buffering
    • Audio files (or live microphone frames) are base64-encoded and appended to an input buffer.
    • commit message signals the end of input.
  3. Response Creation
    • response.create message tells GPT-5 to process the buffer and generate a reply.
  4. Streaming Output
    • The server streams back two types of deltas:
      • response.output_text.delta (optional live transcript).
      • response.output_audio.delta (audio chunks).
    • Audio chunks are collected into a byte array until response.completed is received.
  5. Saving the File
    • The reply is written as a standard 24 kHz PCM16 WAV file, playable in any media player.

Applications and Implications

GPT-5 Realtime demonstrates how far multimodal LLMs have evolved:

  • Conversational Agents: Natural, low-latency assistants that can answer customer queries or provide educational tutoring over phone or web.
  • Accessibility: Voice-based interfaces for users who cannot easily type or read text.
  • Hybrid Interactions: Combine voice with images and text mid-conversation, enabling richer multi-turn exchanges.
  • Telephony Integration: Deploy AI agents that can handle SIP phone calls, routing, and form-filling.

Example Code: Live Microphone Capture with GPT-5 Realtime (speech-in → speech-out)

What it does:

  • Records ~3 seconds from your default mic
  • Streams it to GPT-5 Realtime over WebSocket
  • Saves the model’s spoken reply as assistant_reply.wav
  • Prints a live text transcript (if provided by the server)

Requirements

pip install websockets sounddevice soundfile numpy
  • OS mic permissions: allow terminal/IDE access to the microphone (macOS: System Settings → Privacy & Security → Microphone; Windows: Privacy → Microphone).
"""
Live Mic → GPT-5 Realtime → Spoken Reply
Records ~3 seconds of audio from your default microphone, streams it to GPT-5 Realtime,
and saves the assistant's spoken response to assistant_reply.wav.

Requirements:
  pip install websockets sounddevice soundfile numpy
Set your key:
  export OPENAI_API_KEY="sk-..."   (macOS/Linux)
  setx OPENAI_API_KEY "sk-..."     (Windows, new terminal)

If you prefer MP3 I/O, see the note in your book; this example uses WAV (PCM16 @ 24 kHz).
"""

import os
import json
import base64
import asyncio
import websockets
import numpy as np
import sounddevice as sd
import soundfile as sf

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-5-realtime"

# Recording settings (safe defaults for Realtime)
SAMPLE_RATE = 24000          # 24 kHz mono PCM16
CHANNELS = 1
DURATION_SECONDS = 3.0       # keep short for quick tests
OUTPUT_WAV = "assistant_reply.wav"

SYSTEM_INSTRUCTIONS = (
    "You are a helpful voice assistant. Transcribe the user if needed, "
    "then answer clearly in one or two sentences."
)

def record_from_mic(seconds: float = DURATION_SECONDS, sr: int = SAMPLE_RATE) -> bytes:
    """Record mono PCM16 audio from the default microphone and return raw bytes."""
    print(f"🎙️  Recording {seconds:.1f}s from microphone...")
    audio = sd.rec(int(sr * seconds), samplerate=sr, channels=CHANNELS, dtype="int16")
    sd.wait()
    print("✅ Done.")
    # audio is int16 numpy array; convert to raw bytes
    return audio.tobytes()

def b64encode_pcm16_wav(pcm_bytes: bytes, sr: int = SAMPLE_RATE) -> str:
    """
    Wrap raw PCM16 bytes into a WAV file in memory and return base64 string.
    Using soundfile to write to bytes buffer for simplicity.
    """
    import io
    buf = io.BytesIO()
    # convert bytes -> int16 array so soundfile can write it
    arr = np.frombuffer(pcm_bytes, dtype=np.int16)
    sf.write(buf, arr, sr, subtype="PCM_16", format="WAV")
    return base64.b64encode(buf.getvalue()).decode("utf-8")

async def main():
    if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
        raise SystemExit("Set OPENAI_API_KEY environment variable first.")

    # 1) Capture short mic audio & base64-encode as WAV
    pcm = record_from_mic()
    user_audio_b64 = b64encode_pcm16_wav(pcm)

    # 2) Connect to Realtime WS
    async with websockets.connect(
        REALTIME_URL,
        extra_headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1",
        },
        max_size=20 * 1024 * 1024,  # allow large frames
    ) as ws:
        print("🔌 Connected to GPT-5 Realtime.")

        # 3) Configure session: audio in/out (WAV), pick a voice
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "input_audio_format": "wav",
                "output_audio_format": "wav",
                "voice": "alloy",
                "instructions": SYSTEM_INSTRUCTIONS
            }
        }))

        # 4) Send mic audio (can be multiple appends for streaming mic)
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": user_audio_b64
        }))
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # Optional: add a brief text nudge
        await ws.send(json.dumps({
            "type": "response.create",
            "response": {
                "instructions": "Please respond concisely in speech."
            }
        }))

        print("🕑 Waiting for streamed reply...\n")

        # 5) Receive streamed audio/text deltas
        audio_bytes = bytearray()
        sample_rate = SAMPLE_RATE  # server commonly uses 24k; update if session reports different

        async for msg in ws:
            evt = json.loads(msg)

            # Live transcript (optional)
            if evt.get("type") == "response.output_text.delta":
                print(evt.get("delta"), end="", flush=True)

            # Audio chunks (base64-encoded PCM16 WAV frames)
            elif evt.get("type") == "response.output_audio.delta":
                audio_bytes.extend(base64.b64decode(evt["delta"]))

            elif evt.get("type") == "response.completed":
                print("\n\n✅ Response completed.")
                break

        # 6) Save assistant reply to WAV
        if audio_bytes:
            # raw bytes may already be WAV, but normalizing here is robust:
            # interpret as PCM16 stream and write as standard WAV
            pcm16 = np.frombuffer(bytes(audio_bytes), dtype=np.int16)
            sf.write(OUTPUT_WAV, pcm16, samplerate=sample_rate, subtype="PCM_16")
            print(f"💾 Saved assistant reply → {OUTPUT_WAV}")
        else:
            print("⚠️ No audio received from server.")

if __name__ == "__main__":
    asyncio.run(main())

Here's a code breakdown:

Required Libraries

The script uses several Python libraries:

  • websockets: For WebSocket communication with the GPT-5 Realtime API
  • sounddevice: To record audio from the microphone
  • soundfile: For handling WAV file operations
  • numpy: For audio data manipulation
  • Standard libraries: osjsonbase64asyncioio

Key Components

1. Configuration Settings

The script defines several important constants:

  • OPENAI_API_KEY: Authentication key for OpenAI's API
  • REALTIME_URL: WebSocket endpoint for GPT-5 Realtime
  • Recording parameters: Sample rate (24kHz), channels (mono), recording duration (3 seconds)
  • SYSTEM_INSTRUCTIONS: Prompts GPT-5 to act as a voice assistant

2. Audio Recording Function

The record_from_mic() function:

  • Uses sounddevice to capture audio at specified sample rate and duration
  • Records in mono at 16-bit PCM format
  • Returns raw audio bytes

3. WAV Encoding Function

The b64encode_pcm16_wav() function:

  • Takes raw PCM16 audio bytes
  • Wraps them in a WAV container using soundfile
  • Returns the base64-encoded string of the WAV file

4. Main Async Function

The main() async function orchestrates the entire process:

API Key Validation

  • Checks if the OpenAI API key is properly set

Audio Recording and Encoding

  • Records audio from the microphone
  • Encodes it as a base64 WAV string

WebSocket Connection

  • Establishes a secure WebSocket connection to GPT-5 Realtime
  • Sets proper headers including API key and beta flag

Session Configuration

  • Sends a session.update message to configure:
  • Input/output modalities (text and audio)
  • Audio format (WAV for both input and output)
  • Voice selection ("alloy")
  • System instructions for the assistant

Input Handling

  • Appends the recorded audio to the input buffer
  • Commits the buffer to signal completion of input
  • Optionally adds text instructions to shape the response

Response Processing

  • Collects streamed response data in real-time:
  • Text deltas (transcription of response)
  • Audio deltas (spoken audio chunks)
  • Monitors for completion signal

Output Saving

  • Converts collected audio bytes back to PCM16 format
  • Writes to a WAV file (assistant_reply.wav)

Flow of Execution

The script follows this sequence:

  1. Validate environment setup and API key
  2. Record short audio clip from microphone
  3. Connect to GPT-5 Realtime WebSocket API
  4. Configure session parameters (audio formats, voice)
  5. Send recorded audio and commit the input
  6. Request model to process audio and generate a response
  7. Receive and display text transcript while collecting audio chunks
  8. Save the complete audio response as a WAV file

Error Handling

The code includes basic error handling:

  • Checks for missing API key
  • Verifies if audio was received from the server

Technical Notes

  • Uses 24kHz mono PCM16 format, which is optimal for speech processing
  • Supports WebSocket protocol for real-time streaming
  • Uses asyncio for asynchronous operations
  • Implements proper WebSocket connection lifecycle management

Mid-Session Multimodality: Combining Audio and Images

One of GPT-5 Realtime’s most powerful abilities is to handle multiple modalities within a single ongoing conversation. Unlike earlier systems that processed text, images, or audio in isolation, Realtime can fluidly combine them as they arrive. This enables natural scenarios where a user begins by speaking a question and then adds an image for further clarification or analysis — all in the same session without restarting the dialogue.

For example, imagine a student asking aloud “Can you explain the advantages of GPT-5 as a multimodal model?” and then immediately showing a chart of data. GPT-5 Realtime can integrate both inputs, producing a spoken response that addresses the original audio question and references insights from the chart. This kind of dynamic, mid-session multimodality illustrates how the model moves beyond static question–answer patterns and toward fluid, real-time collaboration with human users.

Example: Mid-Session Multimodality (Audio question → Append Image → Spoken reply)

What it does

  1. Sends a short spoken question (WAV) to GPT-5 Realtime.
  2. Appends a chart image in the same session.
  3. Requests a spoken answer that references both the audio question and the image.
  4. Saves the reply as assistant_multimodal_reply.wav and prints any streamed text.

Requirements

pip install websockets soundfile numpy pillow
  • Put your audio prompt file (e.g., user_prompt_spoken.wav) and an image (e.g., chart.png) in the same folder.
  • Or adjust the paths below.
"""
Multimodal Mid-Session with GPT-5 Realtime
- Step 1: Send a spoken question (WAV) to GPT-5 Realtime.
- Step 2: Append an image (PNG) in the same session.
- Step 3: Ask for a spoken reply that references BOTH inputs.
- Saves the model’s voice reply to assistant_multimodal_reply.wav.

Requirements:
  pip install websockets soundfile numpy pillow
Set your key:
  export OPENAI_API_KEY="sk-..."   (macOS/Linux)
  setx OPENAI_API_KEY "sk-..."     (Windows, new terminal)
"""

import os
import io
import json
import base64
import asyncio
import websockets
import numpy as np
import soundfile as sf
from PIL import Image

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-5-realtime"

# Input files (adjust as needed)
INPUT_WAV = "user_prompt_spoken.wav"  # spoken question, e.g., “Can you explain the advantages of GPT-5 as a multimodal model?”
INPUT_IMG = "chart.png"               # a chart image to reference mid-session
OUTPUT_WAV = "assistant_multimodal_reply.wav"

# Session behavior
VOICE_NAME = "alloy"
SYSTEM_INSTRUCTIONS = (
    "You are a helpful voice assistant. Consider ALL inputs in this session. "
    "First, interpret the user's spoken question. Then, when an image is provided, "
    "analyze it and integrate both sources in your final spoken answer. "
    "Be concise and precise."
)

def read_wav_as_base64(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def read_png_as_base64(path: str) -> str:
    # Ensure we produce a clean PNG bytes payload (also validates file)
    with Image.open(path) as im:
        im = im.convert("RGBA") if im.mode not in ("RGB", "RGBA") else im
        buf = io.BytesIO()
        im.save(buf, format="PNG")
        return base64.b64encode(buf.getvalue()).decode("utf-8")

async def main():
    if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_OPENAI_API_KEY":
        raise SystemExit("Set OPENAI_API_KEY environment variable first.")

    # Load inputs
    user_audio_b64 = read_wav_as_base64(INPUT_WAV)
    image_png_b64 = read_png_as_base64(INPUT_IMG)

    # Connect to Realtime WebSocket
    async with websockets.connect(
        REALTIME_URL,
        extra_headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1",
        },
        max_size=20 * 1024 * 1024,
    ) as ws:
        print("🔌 Connected to GPT-5 Realtime.")

        # 1) Configure session: we'll use audio in/out, and also allow image as input
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio", "image"],
                "input_audio_format": "wav",
                "output_audio_format": "wav",
                "voice": VOICE_NAME,
                "instructions": SYSTEM_INSTRUCTIONS
            }
        }))

        # 2) Append the user's spoken question (audio buffer)
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": user_audio_b64
        }))
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # 3) Append the image mid-session
        #    We send the PNG as base64 along with its MIME. (You can also send a URL if supported.)
        await ws.send(json.dumps({
            "type": "input_image.append",
            "image": image_png_b64,
            "mime_type": "image/png",
            # Optionally, add a hint for the model about why you're sending the image:
            "metadata": {
                "purpose": "chart_analysis",
                "caption": "A line chart showing a synthetic trend over time."
            }
        }))
        await ws.send(json.dumps({"type": "input_image.commit"}))

        # 4) Ask for a response that references BOTH the spoken question and the image
        await ws.send(json.dumps({
            "type": "response.create",
            "response": {
                "instructions": (
                    "Please answer in speech. "
                    "Explain the advantages of GPT-5 as a multimodal model, "
                    "and also summarize the main trend you observe in the provided chart. "
                    "Be concise (15–25 seconds)."
                )
            }
        }))

        print("🕑 Waiting for streamed reply...\n")

        # 5) Receive streamed text & audio
        audio_bytes = bytearray()
        sample_rate = 24000  # common server rate; adjust if your session reports differently

        async for msg in ws:
            evt = json.loads(msg)

            # Optional: live transcript/notes
            if evt.get("type") == "response.output_text.delta":
                print(evt.get("delta"), end="", flush=True)

            # Audio deltas (base64-encoded PCM16)
            elif evt.get("type") == "response.output_audio.delta":
                audio_bytes.extend(base64.b64decode(evt["delta"]))

            elif evt.get("type") == "response.completed":
                print("\n\n✅ Response completed.")
                break

        # 6) Save the final spoken reply
        if audio_bytes:
            pcm16 = np.frombuffer(bytes(audio_bytes), dtype=np.int16)
            sf.write(OUTPUT_WAV, pcm16, samplerate=sample_rate, subtype="PCM_16")
            print(f"💾 Saved assistant reply → {OUTPUT_WAV}")
        else:
            print("⚠️ No audio received from server.")

if __name__ == "__main__":
    asyncio.run(main())

Download the audio sample here: https://files.cuantum.tech/audio/user_prompt_spoken.wav

Download the chart image sample here: https://files.cuantum.tech/images/chart.png

Note: Save the example audio in the same location as the Python script.

Here's a code breakdown:

Key Components

  1. Imports and Setup: The script uses several Python libraries:
    • Standard libraries: os, io, json, base64, asyncio
    • websockets: For WebSocket communication with the GPT-5 Realtime API
    • numpy: For audio data manipulation
    • soundfile: For handling WAV file operations
    • PIL (Pillow): For image processing
  2. Configuration: The script defines important constants:
    • OPENAI_API_KEY: Retrieved from environment variables
    • REALTIME_URL: WebSocket endpoint for the GPT-5 Realtime API
    • Input/output file paths: Locations of input audio (WAV), input image (PNG), and output audio
    • VOICE_NAME: Selects "alloy" as the voice for the assistant's reply
    • SYSTEM_INSTRUCTIONS: Defines the assistant's behavior
  3. Helper Functions: Two utility functions for file handling:
    • read_wav_as_base64(): Reads a WAV file and converts it to base64 encoding
    • read_png_as_base64(): Reads a PNG image, ensures it's in the correct format, and converts it to base64
  4. Main Asynchronous Function: The core of the script with these main steps:
    • Input Validation: Checks if the API key is properly set
    • File Loading: Loads and encodes the audio and image files
    • WebSocket Connection: Establishes a connection to GPT-5 Realtime with proper headers
    • Session Configuration: Sets up a session with text, audio, and image modalities
    • Audio Input: Sends the spoken question (WAV) and commits the audio buffer
    • Image Input: Appends the chart image mid-session with metadata about its purpose
    • Response Request: Requests a spoken reply that addresses both inputs
    • Response Processing: Collects streamed text and audio chunks from the server
    • Output Saving: Converts received audio bytes to PCM16 format and saves as WAV

WebSocket Communication Flow

The script follows a specific protocol for communication with the GPT-5 Realtime API:

  1. Sends a session.update message to configure modalities and behavior
  2. Sends the audio data using input_audio_buffer.append and commits it
  3. Adds the image using input_image.append with metadata and commits it
  4. Creates a response request with specific instructions
  5. Processes incoming events in real-time:
    • Text deltas (transcription)
    • Audio deltas (spoken reply chunks)
    • Completion signal

Error Handling

The script includes basic error checking:

  • Validates the API key
  • Checks if audio was received from the server

Key Technical Aspects

The implementation showcases several important concepts:

  • Asynchronous programming with asyncio for non-blocking I/O
  • Base64 encoding for binary data transmission over WebSockets
  • Real-time streaming of both text and audio responses
  • Mid-session multimodality by combining different input types in one conversation
  • Proper WebSocket lifecycle management

This code example demonstrates the power of GPT-5 Realtime's ability to handle multiple modalities within a single ongoing conversation, allowing for more natural and fluid interactions.

5.2.4 Why Audio Integration Matters

Accessibility: Automatic transcription for the hearing impaired. This technology enables real-time conversion of spoken content into text, making digital media, meetings, and educational resources accessible to deaf and hard-of-hearing individuals. Modern transcription systems can work in real-time with high accuracy, providing captions for live events, lectures, and conversations, removing barriers to participation in many aspects of daily life and professional settings.

By integrating audio processing with language models, these systems can accurately capture nuances, different accents, and even distinguish between multiple speakers. This integration enables more contextual understanding, allowing the transcription to include important non-verbal audio cues, proper punctuation, and speaker identification. Advanced systems can also adapt to specialized terminology, regional dialects, and challenging acoustic environments, making information more accessible across diverse settings from medical appointments to entertainment media.

Education: Real-time translation and captions in classrooms. This application transforms how international students engage with lectures by providing immediate translations of spoken content. It also helps all students by generating accurate captions for recorded lectures, making review more efficient and allowing learners to search through spoken content based on keywords or concepts.

Advanced multimodal systems can detect lecture context and technical terminology, accurately translating specialized vocabulary while maintaining academic integrity. These systems can distinguish between different speakers in classroom discussions, properly attributing questions and responses in the transcription.

Furthermore, these technologies enable asynchronous learning by creating searchable archives of lectures that students can navigate by concept rather than timestamp. For students with learning differences such as ADHD or dyslexia, the synchronized visual and auditory information improves comprehension and retention.

The integration of AI with educational content also allows for personalized learning paths, where the system can identify concepts that individual students struggle with based on their engagement patterns and provide targeted supplementary material. This multimodal approach bridges accessibility gaps while enhancing the learning experience for all students.

Assistants: Voice-driven chatbots, smart speakers, and AI tutors. These systems create natural conversation flows by understanding spoken queries and generating contextually appropriate spoken responses. Advanced multimodal assistants can maintain conversational context over extended interactions, understand varying speech patterns, and respond with appropriate intonation and emphasis that matches the content being delivered.

Cross-lingual communication: Breaking down barriers with speech-to-speech translation. This technology enables conversations between people who speak different languages by capturing speech in one language, understanding its meaning, and generating natural-sounding speech in another language. Modern systems preserve speaker characteristics like tone, pace, and emotion, making the exchange feel more personal and authentic.

These systems represent a significant advancement over traditional translation tools by offering real-time communication without requiring text interfaces. The process involves three sophisticated steps: speech recognition to convert spoken words into text, machine translation to convert that text into another language, and text-to-speech synthesis to deliver the translation in a natural voice.

The latest neural translation models understand cultural nuances and idioms that literal translations often miss. For example, when a Japanese speaker uses honorifics that don't exist in English, the system can adapt the output to convey appropriate respect through tone and word choice rather than direct translation.

Additionally, these technologies can adapt to various contexts - from business negotiations where precision is critical to casual conversations where fluidity matters more. Some advanced systems even maintain consistent voice profiles across languages, allowing a Spanish speaker's unique vocal characteristics to be present in the English translation, creating a more seamless and personalized communication experience.

Unlike older systems where speech recognition and language models were separate components chained together with potential information loss at each step, modern multimodal approaches fuse them into unified architectures that process acoustic and linguistic information simultaneously. This integration creates AI that listens and responds more naturally, understanding context across modalities and handling the ambiguities inherent in human communication.