Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconOpenAI API Bible – Volume 1
OpenAI API Bible – Volume 1

Chapter 7: Memory and Multi-Turn Conversations

7.2 Thread Management and Context Windows

In multi-turn conversations, especially those that extend over multiple messages or sessions, managing the context effectively is crucial for maintaining meaningful dialogue. This becomes particularly important in AI applications where conversations can become complex and lengthy. OpenAI models operate within a fixed "context window"—a maximum number of tokens they can consider in one call (up to 128K tokens for some models). Think of this window as the model's working memory, similar to how humans can only keep a certain amount of information in their immediate thoughts.

When your conversation extends beyond this window limit, older messages may be truncated or removed from consideration, causing the model to lose important context. This can lead to several issues: the model might forget earlier parts of the conversation, repeat information, or fail to maintain consistency with previously established facts or preferences. For example, if a user references something mentioned earlier in the conversation that was truncated, the model won't have access to that information to provide an appropriate response.

Proper thread management is therefore essential—it ensures that your applications remain coherent and focused, even as conversations grow in length and complexity. This involves implementing strategies to preserve crucial context while efficiently managing the token limit, such as summarizing previous exchanges, maintaining relevant information, and intelligently deciding which parts of the conversation to keep or remove. Through effective thread management, you can create more natural, context-aware conversations that maintain their coherence and usefulness over extended interactions.

7.2.1 What Is a Context Window?

A context window is a fundamental concept in AI language models that defines the maximum amount of text the model can process and understand at once. This window acts like the model's working memory - similar to how humans can only hold a certain amount of information in their immediate thoughts. The size of this window is measured in tokens, which are the fundamental building blocks the model uses to process language.

To better understand tokens, think of them as the model's vocabulary pieces. While we humans naturally break language into words, AI models break text down differently. A single word might become multiple tokens - for example, "understanding" becomes "under" + "standing", while common words like "the" or "and" are usually single tokens. Even punctuation marks and spaces count as tokens. This tokenization helps the model process text more efficiently and understand language patterns better.

When you have a conversation with an AI model, it handles your interaction in a fascinating way. Every piece of the conversation - whether it's the initial instructions (system messages), your questions (user messages), or the AI's responses (assistant messages) - gets woven together into one continuous sequence. This combined text then goes through the tokenization process, where it's broken down into tokens and analyzed as a complete unit.

However, there's a crucial limitation to be aware of: the context window has a fixed size. When your conversation grows too large and exceeds this token limit, the model has to make room for new information. It does this by dropping older messages from the beginning of the conversation, much like how your phone's messaging app might only show the most recent messages in a chat window. This is why sometimes the model might not remember something mentioned much earlier in a long conversation.

  • GPT-3.5-turbo: ~16K tokens (approximately equivalent to 12,000 words or 48 pages of text). This substantial context window allows for detailed conversations and complex tasks, though it may need management in longer interactions.
  • GPT-4o and GPT-4o-mini: ~128K tokens (approximately equivalent to 96,000 words or 384 pages of text, varying based on your subscription plan). This significantly larger context window enables much longer conversations and more complex analyses, making it suitable for extensive documentation review, long-form content creation, and detailed analytical tasks.

7.2.2 Thread Management Strategies

1. Segmentation into Logical Threads

When building AI applications that handle conversations, it's crucial to implement proper thread segmentation. This means organizing conversations into separate, independent message arrays (threads) based on different users, topics, or contexts. Think of it like having different chat rooms or conversation channels - each maintains its own history and context without interfering with others.

For example, if you have a customer service bot helping multiple customers simultaneously, each customer's conversation should be stored in a separate thread. This ensures that when Customer A asks about their order status, the bot won't accidentally reference Customer B's shipping details. Similarly, if your application handles different topics (like medical advice vs technical support), keeping separate threads prevents confusion and maintains contextual accuracy.

This segmentation approach offers several benefits:

  • Clear conversation boundaries
  • Better context management
  • Improved response accuracy
  • Easier debugging and monitoring
  • Enhanced privacy between different users or topics

The key is to implement a robust system for creating, managing, and storing these separate threads while ensuring each maintains its own complete conversation history.

# Example: Separate threads for different users.
threads = {
    "user_1": [{"role": "system", "content": "You are a friendly assistant."}],
    "user_2": [{"role": "system", "content": "You are an expert math tutor."}]
}

def append_message(user_id, role, content):
    threads[user_id].append({"role": role, "content": content})

This code demonstrates a basic implementation of conversation thread management. Let me break it down:

Data Structure:

  • The code uses a dictionary called 'threads' to store separate conversation threads for different users
  • Each user (user_1, user_2) has their own array of messages with a unique system message defining the assistant's role (friendly assistant vs math tutor)

Functionality:

  • The append_message() function allows adding new messages to a specific user's thread
  • It takes three parameters: user_id (to identify the thread), role (who's speaking), and content (the message)

This simple implementation ensures that conversations remain separate and contextually appropriate for each user, preventing cross-contamination of conversations between different users.

Here's a comprehensive implementation

# Example: Thread management with OpenAI API
import openai
from typing import Dict, List
import time

class ThreadManager:
    def __init__(self, api_key: str):
        self.threads: Dict[str, List[Dict]] = {}
        openai.api_key = api_key
        
    def create_thread(self, user_id: str, system_role: str) -> None:
        """Initialize a new thread for a user with system message."""
        self.threads[user_id] = [
            {"role": "system", "content": system_role}
        ]
    
    def append_message(self, user_id: str, role: str, content: str) -> None:
        """Add a message to user's thread."""
        if user_id not in self.threads:
            self.create_thread(user_id, "You are a helpful assistant.")
        self.threads[user_id].append({
            "role": role,
            "content": content,
            "timestamp": time.time()
        })
    
    async def get_response(self, user_id: str, temperature: float = 0.7) -> str:
        """Get AI response using GPT-4o."""
        try:
            response = await openai.ChatCompletion.acreate(
                model="gpt-4o",
                messages=self.threads[user_id],
                temperature=temperature,
                max_tokens=1000
            )
            ai_response = response.choices[0].message.content
            self.append_message(user_id, "assistant", ai_response)
            return ai_response
        except Exception as e:
            return f"Error: {str(e)}"

# Usage example
async def main():
    thread_mgr = ThreadManager("your-api-key-here")
    
    # Create threads for different users
    thread_mgr.create_thread("user_1", "You are a friendly assistant.")
    thread_mgr.create_thread("user_2", "You are an expert math tutor.")
    
    # Simulate conversation
    thread_mgr.append_message("user_1", "user", "Hello! How are you?")
    response = await thread_mgr.get_response("user_1")
    print(f"Assistant: {response}")

This example shows an implementation of a thread management system for handling multiple conversations with an AI assistant. Here are the key components:

ThreadManager Class:

  • Manages separate conversation threads for different users
  • Stores threads in a dictionary with user IDs as keys

Main Functions:

  • create_thread(): Sets up new conversations with a system role (e.g., "friendly assistant" or "math tutor")
  • append_message(): Adds new messages to a user's conversation thread with timestamps
  • get_response(): Makes API calls to GPT-4o to get AI responses

Key Features:

  • Error handling for API calls
  • Asynchronous support for better performance
  • Automatic thread creation if none exists
  • Message timestamp tracking

The code helps prevent cross-contamination of conversations between different users by keeping each user's conversation separate and contextually appropriate.

Code Breakdown:

Class Structure

  • ThreadManager class handles all thread-related operations
  • Uses a dictionary to store threads with user IDs as keys
  • Initializes with OpenAI API key

Key Methods

  • create_thread(): Initializes new conversation threads
  • append_message(): Adds messages to existing threads
  • get_response(): Handles API calls to GPT-4o

Improvements over Basic Version

  • Proper error handling
  • Async support for better performance
  • Timestamp tracking for messages
  • Temperature control for response variation

Safety Features

  • Automatic thread creation if not exists
  • Try-except block for API calls
  • Type hints for better code maintainability

2. Token Counting and Trimming

Use a tokenizer like tiktoken to accurately count tokens in your message history. A tokenizer is essential because it breaks down text the same way the AI model does, ensuring accurate token counts. When your conversation history approaches the model's context window limit, you'll need to implement a trimming strategy.

The most effective approach is to remove older messages in complete chunks (full messages rather than partial ones) to maintain conversation coherence. This preserves the natural flow of the dialogue while ensuring you stay within token limits.

For example, you might remove the oldest user-assistant message pair first, keeping the system message and most recent interactions intact. This approach is preferable to removing individual tokens or partial messages, which could lead to confusing or incomplete context.

Code example:

import tiktoken
from openai import OpenAI

client = OpenAI()

def count_tokens(messages, model="gpt-4o"):
    """Count tokens for a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print(f"Warning: Model `{model}` not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    num_tokens = 0
    for message in messages:
        num_tokens += 4  # Format tax for each message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":  # Name tax
                num_tokens += 1
    num_tokens += 2  # Format tax for the entire message
    return num_tokens

def trim_history(messages, max_tokens=10000, model="gpt-4o"):
    """Trim conversation history to fit within token limit."""
    total_tokens = count_tokens(messages, model)
    while total_tokens > max_tokens and len(messages) > 1:
        # Calculate tokens of the message to be removed
        tokens_to_remove = count_tokens([messages[1]], model) + 4  # +4 for format tax
        if "name" in messages[1]:
            tokens_to_remove += 1  # Name tax
        messages.pop(1)
        total_tokens -= tokens_to_remove
    return messages

# Example usage:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "name": "Alice", "content": "Tell me something else."},
    {"role": "assistant", "content": "The Eiffel Tower is a famous landmark in Paris."},
]

token_count = count_tokens(messages)
print(f"Initial token count: {token_count}")

trimmed_messages = trim_history(messages, max_tokens=300)
trimmed_token_count = count_tokens(trimmed_messages)
print(f"Token count after trimming: {trimmed_token_count}")
print(f"Trimmed messages: {trimmed_messages}")

Code Breakdown:

1. Imports and Setup

  • Imports tiktoken for token counting
  • Imports OpenAI client for API interaction
  • Initializes OpenAI client instance

2. count_tokens Function

Purpose: Accurately counts tokens in a message list according to OpenAI's tokenization rules.

  • Parameters:
    • messages: List of message dictionaries
    • model: Target model name (default: "gpt-4o")
  • Key Features:
    • Uses model-specific tokenizer when available
    • Falls back to cl100k_base encoding if model not found
    • Accounts for message format tax (+4 tokens per message)
    • Adds name tax (+1 token) when present
    • Includes format tax for entire message (+2 tokens)

3. trim_history Function

Purpose: Trims conversation history to fit within token limit while preserving recent context.

  • Parameters:
    • messages: List of message dictionaries
    • max_tokens: Maximum allowed tokens (default: 10000)
    • model: Target model name (default: "gpt-4o")
  • Algorithm:
    • Counts total tokens in current history
    • Removes oldest messages (starting at index 1) until under limit
    • Preserves system message (index 0)
    • Accurately tracks token reduction including format taxes

4. Example Implementation

  • Sample conversation:
    • System message defining assistant role
    • Two user messages (one with name)
    • Two assistant responses
  • Demonstration:
    • Counts initial tokens in conversation
    • Trims history to fit within 300 tokens
    • Shows token count before and after trimming
    • Displays final trimmed message list

The example usage shows how to apply these functions to a conversation history, counting the initial tokens and then trimming it to fit within a 300-token limit.

3. Summarization of Older Context

For very long conversations that span many exchanges, implementing a summarization strategy becomes crucial for managing context effectively while staying within token limits. This approach involves condensing earlier parts of the conversation into a concise summary, which preserves the essential information while using significantly fewer tokens.

You can leverage the AI model's own capabilities by periodically calling it to analyze the conversation history and generate a meaningful summary. This summary can then replace a larger block of previous messages, maintaining the conversation's context while drastically reducing token usage.

The summarization process is particularly effective because it retains key discussion points and decisions while eliminating redundant or less relevant details, ensuring that future responses remain contextually appropriate without requiring the full conversation history.

Code example:

from openai import OpenAI
import tiktoken  # Import tiktoken

def summarize_context(history, api_key, model="gpt-3.5-turbo", max_summary_tokens=80):
    """Summarizes a conversation history.

    Args:
        history: A list of message dictionaries.
        api_key: Your OpenAI API key.
        model: The OpenAI model to use for summarization.
        max_summary_tokens: The maximum number of tokens for the summary.

    Returns:
        A list containing a single message with the summary, or the original
        history if summarization fails.
    """
    client = OpenAI(api_key=api_key)
    encoding = tiktoken.encoding_for_model(model) # Added encoding

    # Improved prompt with more specific instructions
    prompt = [
        {"role": "system", "content": "You are a helpful assistant that provides concise summaries of conversations."},
        {"role": "user", "content": (
            "Please provide a two-sentence summary of the following conversation, focusing on the key topics discussed and the main points:\n" +
            "\n".join([f'{m["role"]}: {m["content"]}' for m in history])
        )}
    ]

    prompt_token_count = len(encoding.encode(prompt[1]["content"])) # Count tokens
    if prompt_token_count > 4000: # Check if prompt is too long.
        print("Warning: Prompt is longer than 4000 tokens.  Consider Trimming History")

    try:
        response = client.chat.completions.create(
            model=model,
            messages=prompt,
            max_tokens=max_summary_tokens,
            temperature=0.2,  # Lower temperature for more focused summaries
        )
        summary = response.choices[0].message.content
        return [{"role": "system", "content": "Conversation summary: " + summary}]
    except Exception as e:
        print(f"Error during summarization: {str(e)}")
        return history  # Return original history if summarization fails

This code implements a conversation summarization function that helps manage long conversation histories. Here's a breakdown of its key components:

Function Overview:

The summarize_context function takes a conversation history and converts it into a concise two-sentence summary. Key parameters include:

  • history: The conversation messages to summarize
  • api_key: OpenAI API authentication
  • model: The AI model to use (defaults to GPT-3.5-turbo)
  • max_summary_tokens: Maximum length of the summary

Key Features:

  • Uses tiktoken for accurate token counting
  • Implements token limit checks (warns if over 4000 tokens)
  • Uses a low temperature (0.2) for more consistent summaries
  • Falls back to original history if summarization fails

Process:

  • Initializes OpenAI client and sets up token encoding
  • Creates a prompt with specific instructions for summarization
  • Formats the conversation history into a readable format
  • Makes API call to generate the summary
  • Returns the summary as a system message or original history if there's an error

This function is particularly useful when conversation histories become too long, as it helps maintain context while reducing token usage. The summarized version can then be prepended to more recent messages to maintain conversation continuity.

7.2.3 Putting It All Together

Now, let's explore how to seamlessly combine the trimming and summarization techniques we discussed into a unified, effective workflow. This integration is crucial for maintaining optimal conversation management, as it allows us to handle both immediate token constraints and long-term context preservation in a single, coordinated process.

The following code example demonstrates how these components work together to create a robust conversation management system that can handle everything from basic message processing to complex context maintenance.

from openai import OpenAI
from typing import List, Dict
import tiktoken
from datetime import datetime

class ConversationManager:
    def __init__(self, api_key: str, model: str = "gpt-3.5-turbo"):
        self.client = OpenAI(api_key=api_key)
        self.model = model
        self.threads: Dict[str, List[Dict]] = {}
        self.encoding = tiktoken.encoding_for_model(model)
        
    def append_message(self, user_id: str, role: str, content: str) -> None:
        """Add a new message to the user's conversation thread."""
        if user_id not in self.threads:
            self.threads[user_id] = [
                {"role": "system", "content": "You are a helpful assistant."}
            ]
        self.threads[user_id].append({"role": role, "content": content})

    def count_tokens(self, messages: List[Dict]) -> int:
        """Count tokens in the message list."""
        num_tokens = 0
        for message in messages:
            num_tokens += 4  # Message format tax
            for key, value in message.items():
                num_tokens += len(self.encoding.encode(str(value)))
                if key == "name":
                    num_tokens += 1  # Name field tax
        return num_tokens + 2  # Add final format tax

    def summarize_context(self, history: List[Dict], max_summary_tokens: int = 300) -> List[Dict]:
        """Generate a summary of the conversation history."""
        try:
            prompt = [
                {"role": "system", "content": "Summarize this conversation in two concise sentences:"},
                {"role": "user", "content": "\n".join([f"{m['role']}: {m['content']}" 
                                                     for m in history if m['role'] != "system"])}
            ]
            
            response = self.client.chat.completions.create(
                model=self.model,
                messages=prompt,
                max_tokens=max_summary_tokens,
                temperature=0.3
            )
            
            timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            return [{
                "role": "system",
                "content": f"Summary as of {timestamp}: {response.choices[0].message.content}"
            }]
        except Exception as e:
            print(f"Summarization failed: {str(e)}")
            return history

    def trim_history(self, messages: List[Dict], max_tokens: int) -> List[Dict]:
        """Trim conversation history to fit within token limit."""
        while self.count_tokens(messages) > max_tokens and len(messages) > 1:
            # Remove second message (preserve system message)
            messages.pop(1)
        return messages

    def manage_context_thread(self, user_id: str, new_message: str, 
                            max_tokens: int = 4000, 
                            summary_threshold: int = 3000) -> List[Dict]:
        """Manage conversation context with token limits and summarization."""
        # Add new message
        self.append_message(user_id, "user", new_message)
        history = self.threads[user_id]
        current_tokens = self.count_tokens(history)

        # Log for debugging
        print(f"Current token count: {current_tokens}")
        
        # If approaching token limit, summarize older context
        if current_tokens > summary_threshold:
            # Keep system message and last 3 message pairs (6 messages)
            recent_messages = history[-6:] if len(history) > 6 else history
            older_messages = history[1:-6] if len(history) > 6 else []
            
            if older_messages:
                summary = self.summarize_context(older_messages)
                history = [history[0]] + summary + recent_messages
                print("Context summarized")

        # Ensure we're under max tokens
        history = self.trim_history(history, max_tokens)
        self.threads[user_id] = history
        
        return history

# Example usage
if __name__ == "__main__":
    manager = ConversationManager("your-api-key-here")
    
    # Simulate a conversation
    user_id = "user123"
    messages = [
        "Hello! How can you help me today?",
        "I'd like to learn about machine learning.",
        "Can you explain neural networks?",
        "What's the difference between supervised and unsupervised learning?"
    ]
    
    # Process messages
    for msg in messages:
        history = manager.manage_context_thread(user_id, msg)
        print(f"\nMessage: {msg}")
        print(f"Thread length: {len(history)}")
        print(f"Token count: {manager.count_tokens(history)}")

Code Breakdown:

  1. Class Structure
  • ConversationManager class handles all aspects of conversation management
    • Initializes with OpenAI API key and model selection
    • Maintains conversation threads for multiple users
    • Uses tiktoken for accurate token counting
  1. Core Methods
  • append_message()
    • Adds new messages to user-specific conversation threads
    • Initializes new threads with a system message
  • count_tokens()
    • Accurately counts tokens including format taxes
    • Accounts for message structure and name fields
  1. Advanced Features
  • summarize_context()
    • Uses OpenAI API to generate concise summaries
    • Includes timestamps for context
    • Handles errors gracefully
  • trim_history()
    • Removes oldest messages while preserving system message
    • Ensures conversation stays within token limits
  1. Main Management Logic
  • manage_context_thread()
    • Implements three-phase context management:
      • Addition of new messages
      • Summarization of older context
      • Token limit enforcement
    • Uses separate thresholds for summarization and maximum tokens
  1. Usage Example
  • Demonstrates practical implementation with multiple messages
  • Includes token counting and thread length monitoring
  • Shows how to maintain conversation context over multiple exchanges

With this approach:

  1. New messages are added.
  2. History length is checked.
  3. If over the limit, you summarize the entire conversation, keep the summary, and the last few messages.
  4. Trim any remaining excess.

Managing threads and context windows is a foundational aspect of building scalable, coherent multi-turn conversations in AI systems. Here's why it matters and how it works:

First, thread segmentation by user ensures that each conversation remains isolated and personal. When you maintain separate conversation threads for each user, you prevent cross-contamination of context between different conversations, allowing for more personalized and accurate responses.

Second, token counting and trimming serve as essential maintenance tools. By actively monitoring the token count of conversations, you can prevent hitting model context limits while preserving the most relevant information. This process involves carefully removing older messages while maintaining crucial context, similar to how human conversations naturally focus on recent and relevant information.

Third, context summarization acts as a memory compression technique. When conversations grow long, summarizing older context allows you to maintain the essential narrative while reducing token usage. This is similar to how humans maintain the gist of earlier conversations without remembering every detail.

The combination of these strategies results in an AI assistant that can:

  • Maintain consistent context across multiple conversation turns
  • Scale efficiently without degrading performance
  • Provide relevant responses based on both recent and historical context
  • Adapt to different conversation lengths and complexities

These capabilities ensure your AI assistant remains responsive, informed, and context-aware throughout extended dialogues, creating a more natural and effective conversation experience.

7.2 Thread Management and Context Windows

In multi-turn conversations, especially those that extend over multiple messages or sessions, managing the context effectively is crucial for maintaining meaningful dialogue. This becomes particularly important in AI applications where conversations can become complex and lengthy. OpenAI models operate within a fixed "context window"—a maximum number of tokens they can consider in one call (up to 128K tokens for some models). Think of this window as the model's working memory, similar to how humans can only keep a certain amount of information in their immediate thoughts.

When your conversation extends beyond this window limit, older messages may be truncated or removed from consideration, causing the model to lose important context. This can lead to several issues: the model might forget earlier parts of the conversation, repeat information, or fail to maintain consistency with previously established facts or preferences. For example, if a user references something mentioned earlier in the conversation that was truncated, the model won't have access to that information to provide an appropriate response.

Proper thread management is therefore essential—it ensures that your applications remain coherent and focused, even as conversations grow in length and complexity. This involves implementing strategies to preserve crucial context while efficiently managing the token limit, such as summarizing previous exchanges, maintaining relevant information, and intelligently deciding which parts of the conversation to keep or remove. Through effective thread management, you can create more natural, context-aware conversations that maintain their coherence and usefulness over extended interactions.

7.2.1 What Is a Context Window?

A context window is a fundamental concept in AI language models that defines the maximum amount of text the model can process and understand at once. This window acts like the model's working memory - similar to how humans can only hold a certain amount of information in their immediate thoughts. The size of this window is measured in tokens, which are the fundamental building blocks the model uses to process language.

To better understand tokens, think of them as the model's vocabulary pieces. While we humans naturally break language into words, AI models break text down differently. A single word might become multiple tokens - for example, "understanding" becomes "under" + "standing", while common words like "the" or "and" are usually single tokens. Even punctuation marks and spaces count as tokens. This tokenization helps the model process text more efficiently and understand language patterns better.

When you have a conversation with an AI model, it handles your interaction in a fascinating way. Every piece of the conversation - whether it's the initial instructions (system messages), your questions (user messages), or the AI's responses (assistant messages) - gets woven together into one continuous sequence. This combined text then goes through the tokenization process, where it's broken down into tokens and analyzed as a complete unit.

However, there's a crucial limitation to be aware of: the context window has a fixed size. When your conversation grows too large and exceeds this token limit, the model has to make room for new information. It does this by dropping older messages from the beginning of the conversation, much like how your phone's messaging app might only show the most recent messages in a chat window. This is why sometimes the model might not remember something mentioned much earlier in a long conversation.

  • GPT-3.5-turbo: ~16K tokens (approximately equivalent to 12,000 words or 48 pages of text). This substantial context window allows for detailed conversations and complex tasks, though it may need management in longer interactions.
  • GPT-4o and GPT-4o-mini: ~128K tokens (approximately equivalent to 96,000 words or 384 pages of text, varying based on your subscription plan). This significantly larger context window enables much longer conversations and more complex analyses, making it suitable for extensive documentation review, long-form content creation, and detailed analytical tasks.

7.2.2 Thread Management Strategies

1. Segmentation into Logical Threads

When building AI applications that handle conversations, it's crucial to implement proper thread segmentation. This means organizing conversations into separate, independent message arrays (threads) based on different users, topics, or contexts. Think of it like having different chat rooms or conversation channels - each maintains its own history and context without interfering with others.

For example, if you have a customer service bot helping multiple customers simultaneously, each customer's conversation should be stored in a separate thread. This ensures that when Customer A asks about their order status, the bot won't accidentally reference Customer B's shipping details. Similarly, if your application handles different topics (like medical advice vs technical support), keeping separate threads prevents confusion and maintains contextual accuracy.

This segmentation approach offers several benefits:

  • Clear conversation boundaries
  • Better context management
  • Improved response accuracy
  • Easier debugging and monitoring
  • Enhanced privacy between different users or topics

The key is to implement a robust system for creating, managing, and storing these separate threads while ensuring each maintains its own complete conversation history.

# Example: Separate threads for different users.
threads = {
    "user_1": [{"role": "system", "content": "You are a friendly assistant."}],
    "user_2": [{"role": "system", "content": "You are an expert math tutor."}]
}

def append_message(user_id, role, content):
    threads[user_id].append({"role": role, "content": content})

This code demonstrates a basic implementation of conversation thread management. Let me break it down:

Data Structure:

  • The code uses a dictionary called 'threads' to store separate conversation threads for different users
  • Each user (user_1, user_2) has their own array of messages with a unique system message defining the assistant's role (friendly assistant vs math tutor)

Functionality:

  • The append_message() function allows adding new messages to a specific user's thread
  • It takes three parameters: user_id (to identify the thread), role (who's speaking), and content (the message)

This simple implementation ensures that conversations remain separate and contextually appropriate for each user, preventing cross-contamination of conversations between different users.

Here's a comprehensive implementation

# Example: Thread management with OpenAI API
import openai
from typing import Dict, List
import time

class ThreadManager:
    def __init__(self, api_key: str):
        self.threads: Dict[str, List[Dict]] = {}
        openai.api_key = api_key
        
    def create_thread(self, user_id: str, system_role: str) -> None:
        """Initialize a new thread for a user with system message."""
        self.threads[user_id] = [
            {"role": "system", "content": system_role}
        ]
    
    def append_message(self, user_id: str, role: str, content: str) -> None:
        """Add a message to user's thread."""
        if user_id not in self.threads:
            self.create_thread(user_id, "You are a helpful assistant.")
        self.threads[user_id].append({
            "role": role,
            "content": content,
            "timestamp": time.time()
        })
    
    async def get_response(self, user_id: str, temperature: float = 0.7) -> str:
        """Get AI response using GPT-4o."""
        try:
            response = await openai.ChatCompletion.acreate(
                model="gpt-4o",
                messages=self.threads[user_id],
                temperature=temperature,
                max_tokens=1000
            )
            ai_response = response.choices[0].message.content
            self.append_message(user_id, "assistant", ai_response)
            return ai_response
        except Exception as e:
            return f"Error: {str(e)}"

# Usage example
async def main():
    thread_mgr = ThreadManager("your-api-key-here")
    
    # Create threads for different users
    thread_mgr.create_thread("user_1", "You are a friendly assistant.")
    thread_mgr.create_thread("user_2", "You are an expert math tutor.")
    
    # Simulate conversation
    thread_mgr.append_message("user_1", "user", "Hello! How are you?")
    response = await thread_mgr.get_response("user_1")
    print(f"Assistant: {response}")

This example shows an implementation of a thread management system for handling multiple conversations with an AI assistant. Here are the key components:

ThreadManager Class:

  • Manages separate conversation threads for different users
  • Stores threads in a dictionary with user IDs as keys

Main Functions:

  • create_thread(): Sets up new conversations with a system role (e.g., "friendly assistant" or "math tutor")
  • append_message(): Adds new messages to a user's conversation thread with timestamps
  • get_response(): Makes API calls to GPT-4o to get AI responses

Key Features:

  • Error handling for API calls
  • Asynchronous support for better performance
  • Automatic thread creation if none exists
  • Message timestamp tracking

The code helps prevent cross-contamination of conversations between different users by keeping each user's conversation separate and contextually appropriate.

Code Breakdown:

Class Structure

  • ThreadManager class handles all thread-related operations
  • Uses a dictionary to store threads with user IDs as keys
  • Initializes with OpenAI API key

Key Methods

  • create_thread(): Initializes new conversation threads
  • append_message(): Adds messages to existing threads
  • get_response(): Handles API calls to GPT-4o

Improvements over Basic Version

  • Proper error handling
  • Async support for better performance
  • Timestamp tracking for messages
  • Temperature control for response variation

Safety Features

  • Automatic thread creation if not exists
  • Try-except block for API calls
  • Type hints for better code maintainability

2. Token Counting and Trimming

Use a tokenizer like tiktoken to accurately count tokens in your message history. A tokenizer is essential because it breaks down text the same way the AI model does, ensuring accurate token counts. When your conversation history approaches the model's context window limit, you'll need to implement a trimming strategy.

The most effective approach is to remove older messages in complete chunks (full messages rather than partial ones) to maintain conversation coherence. This preserves the natural flow of the dialogue while ensuring you stay within token limits.

For example, you might remove the oldest user-assistant message pair first, keeping the system message and most recent interactions intact. This approach is preferable to removing individual tokens or partial messages, which could lead to confusing or incomplete context.

Code example:

import tiktoken
from openai import OpenAI

client = OpenAI()

def count_tokens(messages, model="gpt-4o"):
    """Count tokens for a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print(f"Warning: Model `{model}` not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    num_tokens = 0
    for message in messages:
        num_tokens += 4  # Format tax for each message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":  # Name tax
                num_tokens += 1
    num_tokens += 2  # Format tax for the entire message
    return num_tokens

def trim_history(messages, max_tokens=10000, model="gpt-4o"):
    """Trim conversation history to fit within token limit."""
    total_tokens = count_tokens(messages, model)
    while total_tokens > max_tokens and len(messages) > 1:
        # Calculate tokens of the message to be removed
        tokens_to_remove = count_tokens([messages[1]], model) + 4  # +4 for format tax
        if "name" in messages[1]:
            tokens_to_remove += 1  # Name tax
        messages.pop(1)
        total_tokens -= tokens_to_remove
    return messages

# Example usage:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "name": "Alice", "content": "Tell me something else."},
    {"role": "assistant", "content": "The Eiffel Tower is a famous landmark in Paris."},
]

token_count = count_tokens(messages)
print(f"Initial token count: {token_count}")

trimmed_messages = trim_history(messages, max_tokens=300)
trimmed_token_count = count_tokens(trimmed_messages)
print(f"Token count after trimming: {trimmed_token_count}")
print(f"Trimmed messages: {trimmed_messages}")

Code Breakdown:

1. Imports and Setup

  • Imports tiktoken for token counting
  • Imports OpenAI client for API interaction
  • Initializes OpenAI client instance

2. count_tokens Function

Purpose: Accurately counts tokens in a message list according to OpenAI's tokenization rules.

  • Parameters:
    • messages: List of message dictionaries
    • model: Target model name (default: "gpt-4o")
  • Key Features:
    • Uses model-specific tokenizer when available
    • Falls back to cl100k_base encoding if model not found
    • Accounts for message format tax (+4 tokens per message)
    • Adds name tax (+1 token) when present
    • Includes format tax for entire message (+2 tokens)

3. trim_history Function

Purpose: Trims conversation history to fit within token limit while preserving recent context.

  • Parameters:
    • messages: List of message dictionaries
    • max_tokens: Maximum allowed tokens (default: 10000)
    • model: Target model name (default: "gpt-4o")
  • Algorithm:
    • Counts total tokens in current history
    • Removes oldest messages (starting at index 1) until under limit
    • Preserves system message (index 0)
    • Accurately tracks token reduction including format taxes

4. Example Implementation

  • Sample conversation:
    • System message defining assistant role
    • Two user messages (one with name)
    • Two assistant responses
  • Demonstration:
    • Counts initial tokens in conversation
    • Trims history to fit within 300 tokens
    • Shows token count before and after trimming
    • Displays final trimmed message list

The example usage shows how to apply these functions to a conversation history, counting the initial tokens and then trimming it to fit within a 300-token limit.

3. Summarization of Older Context

For very long conversations that span many exchanges, implementing a summarization strategy becomes crucial for managing context effectively while staying within token limits. This approach involves condensing earlier parts of the conversation into a concise summary, which preserves the essential information while using significantly fewer tokens.

You can leverage the AI model's own capabilities by periodically calling it to analyze the conversation history and generate a meaningful summary. This summary can then replace a larger block of previous messages, maintaining the conversation's context while drastically reducing token usage.

The summarization process is particularly effective because it retains key discussion points and decisions while eliminating redundant or less relevant details, ensuring that future responses remain contextually appropriate without requiring the full conversation history.

Code example:

from openai import OpenAI
import tiktoken  # Import tiktoken

def summarize_context(history, api_key, model="gpt-3.5-turbo", max_summary_tokens=80):
    """Summarizes a conversation history.

    Args:
        history: A list of message dictionaries.
        api_key: Your OpenAI API key.
        model: The OpenAI model to use for summarization.
        max_summary_tokens: The maximum number of tokens for the summary.

    Returns:
        A list containing a single message with the summary, or the original
        history if summarization fails.
    """
    client = OpenAI(api_key=api_key)
    encoding = tiktoken.encoding_for_model(model) # Added encoding

    # Improved prompt with more specific instructions
    prompt = [
        {"role": "system", "content": "You are a helpful assistant that provides concise summaries of conversations."},
        {"role": "user", "content": (
            "Please provide a two-sentence summary of the following conversation, focusing on the key topics discussed and the main points:\n" +
            "\n".join([f'{m["role"]}: {m["content"]}' for m in history])
        )}
    ]

    prompt_token_count = len(encoding.encode(prompt[1]["content"])) # Count tokens
    if prompt_token_count > 4000: # Check if prompt is too long.
        print("Warning: Prompt is longer than 4000 tokens.  Consider Trimming History")

    try:
        response = client.chat.completions.create(
            model=model,
            messages=prompt,
            max_tokens=max_summary_tokens,
            temperature=0.2,  # Lower temperature for more focused summaries
        )
        summary = response.choices[0].message.content
        return [{"role": "system", "content": "Conversation summary: " + summary}]
    except Exception as e:
        print(f"Error during summarization: {str(e)}")
        return history  # Return original history if summarization fails

This code implements a conversation summarization function that helps manage long conversation histories. Here's a breakdown of its key components:

Function Overview:

The summarize_context function takes a conversation history and converts it into a concise two-sentence summary. Key parameters include:

  • history: The conversation messages to summarize
  • api_key: OpenAI API authentication
  • model: The AI model to use (defaults to GPT-3.5-turbo)
  • max_summary_tokens: Maximum length of the summary

Key Features:

  • Uses tiktoken for accurate token counting
  • Implements token limit checks (warns if over 4000 tokens)
  • Uses a low temperature (0.2) for more consistent summaries
  • Falls back to original history if summarization fails

Process:

  • Initializes OpenAI client and sets up token encoding
  • Creates a prompt with specific instructions for summarization
  • Formats the conversation history into a readable format
  • Makes API call to generate the summary
  • Returns the summary as a system message or original history if there's an error

This function is particularly useful when conversation histories become too long, as it helps maintain context while reducing token usage. The summarized version can then be prepended to more recent messages to maintain conversation continuity.

7.2.3 Putting It All Together

Now, let's explore how to seamlessly combine the trimming and summarization techniques we discussed into a unified, effective workflow. This integration is crucial for maintaining optimal conversation management, as it allows us to handle both immediate token constraints and long-term context preservation in a single, coordinated process.

The following code example demonstrates how these components work together to create a robust conversation management system that can handle everything from basic message processing to complex context maintenance.

from openai import OpenAI
from typing import List, Dict
import tiktoken
from datetime import datetime

class ConversationManager:
    def __init__(self, api_key: str, model: str = "gpt-3.5-turbo"):
        self.client = OpenAI(api_key=api_key)
        self.model = model
        self.threads: Dict[str, List[Dict]] = {}
        self.encoding = tiktoken.encoding_for_model(model)
        
    def append_message(self, user_id: str, role: str, content: str) -> None:
        """Add a new message to the user's conversation thread."""
        if user_id not in self.threads:
            self.threads[user_id] = [
                {"role": "system", "content": "You are a helpful assistant."}
            ]
        self.threads[user_id].append({"role": role, "content": content})

    def count_tokens(self, messages: List[Dict]) -> int:
        """Count tokens in the message list."""
        num_tokens = 0
        for message in messages:
            num_tokens += 4  # Message format tax
            for key, value in message.items():
                num_tokens += len(self.encoding.encode(str(value)))
                if key == "name":
                    num_tokens += 1  # Name field tax
        return num_tokens + 2  # Add final format tax

    def summarize_context(self, history: List[Dict], max_summary_tokens: int = 300) -> List[Dict]:
        """Generate a summary of the conversation history."""
        try:
            prompt = [
                {"role": "system", "content": "Summarize this conversation in two concise sentences:"},
                {"role": "user", "content": "\n".join([f"{m['role']}: {m['content']}" 
                                                     for m in history if m['role'] != "system"])}
            ]
            
            response = self.client.chat.completions.create(
                model=self.model,
                messages=prompt,
                max_tokens=max_summary_tokens,
                temperature=0.3
            )
            
            timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            return [{
                "role": "system",
                "content": f"Summary as of {timestamp}: {response.choices[0].message.content}"
            }]
        except Exception as e:
            print(f"Summarization failed: {str(e)}")
            return history

    def trim_history(self, messages: List[Dict], max_tokens: int) -> List[Dict]:
        """Trim conversation history to fit within token limit."""
        while self.count_tokens(messages) > max_tokens and len(messages) > 1:
            # Remove second message (preserve system message)
            messages.pop(1)
        return messages

    def manage_context_thread(self, user_id: str, new_message: str, 
                            max_tokens: int = 4000, 
                            summary_threshold: int = 3000) -> List[Dict]:
        """Manage conversation context with token limits and summarization."""
        # Add new message
        self.append_message(user_id, "user", new_message)
        history = self.threads[user_id]
        current_tokens = self.count_tokens(history)

        # Log for debugging
        print(f"Current token count: {current_tokens}")
        
        # If approaching token limit, summarize older context
        if current_tokens > summary_threshold:
            # Keep system message and last 3 message pairs (6 messages)
            recent_messages = history[-6:] if len(history) > 6 else history
            older_messages = history[1:-6] if len(history) > 6 else []
            
            if older_messages:
                summary = self.summarize_context(older_messages)
                history = [history[0]] + summary + recent_messages
                print("Context summarized")

        # Ensure we're under max tokens
        history = self.trim_history(history, max_tokens)
        self.threads[user_id] = history
        
        return history

# Example usage
if __name__ == "__main__":
    manager = ConversationManager("your-api-key-here")
    
    # Simulate a conversation
    user_id = "user123"
    messages = [
        "Hello! How can you help me today?",
        "I'd like to learn about machine learning.",
        "Can you explain neural networks?",
        "What's the difference between supervised and unsupervised learning?"
    ]
    
    # Process messages
    for msg in messages:
        history = manager.manage_context_thread(user_id, msg)
        print(f"\nMessage: {msg}")
        print(f"Thread length: {len(history)}")
        print(f"Token count: {manager.count_tokens(history)}")

Code Breakdown:

  1. Class Structure
  • ConversationManager class handles all aspects of conversation management
    • Initializes with OpenAI API key and model selection
    • Maintains conversation threads for multiple users
    • Uses tiktoken for accurate token counting
  1. Core Methods
  • append_message()
    • Adds new messages to user-specific conversation threads
    • Initializes new threads with a system message
  • count_tokens()
    • Accurately counts tokens including format taxes
    • Accounts for message structure and name fields
  1. Advanced Features
  • summarize_context()
    • Uses OpenAI API to generate concise summaries
    • Includes timestamps for context
    • Handles errors gracefully
  • trim_history()
    • Removes oldest messages while preserving system message
    • Ensures conversation stays within token limits
  1. Main Management Logic
  • manage_context_thread()
    • Implements three-phase context management:
      • Addition of new messages
      • Summarization of older context
      • Token limit enforcement
    • Uses separate thresholds for summarization and maximum tokens
  1. Usage Example
  • Demonstrates practical implementation with multiple messages
  • Includes token counting and thread length monitoring
  • Shows how to maintain conversation context over multiple exchanges

With this approach:

  1. New messages are added.
  2. History length is checked.
  3. If over the limit, you summarize the entire conversation, keep the summary, and the last few messages.
  4. Trim any remaining excess.

Managing threads and context windows is a foundational aspect of building scalable, coherent multi-turn conversations in AI systems. Here's why it matters and how it works:

First, thread segmentation by user ensures that each conversation remains isolated and personal. When you maintain separate conversation threads for each user, you prevent cross-contamination of context between different conversations, allowing for more personalized and accurate responses.

Second, token counting and trimming serve as essential maintenance tools. By actively monitoring the token count of conversations, you can prevent hitting model context limits while preserving the most relevant information. This process involves carefully removing older messages while maintaining crucial context, similar to how human conversations naturally focus on recent and relevant information.

Third, context summarization acts as a memory compression technique. When conversations grow long, summarizing older context allows you to maintain the essential narrative while reducing token usage. This is similar to how humans maintain the gist of earlier conversations without remembering every detail.

The combination of these strategies results in an AI assistant that can:

  • Maintain consistent context across multiple conversation turns
  • Scale efficiently without degrading performance
  • Provide relevant responses based on both recent and historical context
  • Adapt to different conversation lengths and complexities

These capabilities ensure your AI assistant remains responsive, informed, and context-aware throughout extended dialogues, creating a more natural and effective conversation experience.

7.2 Thread Management and Context Windows

In multi-turn conversations, especially those that extend over multiple messages or sessions, managing the context effectively is crucial for maintaining meaningful dialogue. This becomes particularly important in AI applications where conversations can become complex and lengthy. OpenAI models operate within a fixed "context window"—a maximum number of tokens they can consider in one call (up to 128K tokens for some models). Think of this window as the model's working memory, similar to how humans can only keep a certain amount of information in their immediate thoughts.

When your conversation extends beyond this window limit, older messages may be truncated or removed from consideration, causing the model to lose important context. This can lead to several issues: the model might forget earlier parts of the conversation, repeat information, or fail to maintain consistency with previously established facts or preferences. For example, if a user references something mentioned earlier in the conversation that was truncated, the model won't have access to that information to provide an appropriate response.

Proper thread management is therefore essential—it ensures that your applications remain coherent and focused, even as conversations grow in length and complexity. This involves implementing strategies to preserve crucial context while efficiently managing the token limit, such as summarizing previous exchanges, maintaining relevant information, and intelligently deciding which parts of the conversation to keep or remove. Through effective thread management, you can create more natural, context-aware conversations that maintain their coherence and usefulness over extended interactions.

7.2.1 What Is a Context Window?

A context window is a fundamental concept in AI language models that defines the maximum amount of text the model can process and understand at once. This window acts like the model's working memory - similar to how humans can only hold a certain amount of information in their immediate thoughts. The size of this window is measured in tokens, which are the fundamental building blocks the model uses to process language.

To better understand tokens, think of them as the model's vocabulary pieces. While we humans naturally break language into words, AI models break text down differently. A single word might become multiple tokens - for example, "understanding" becomes "under" + "standing", while common words like "the" or "and" are usually single tokens. Even punctuation marks and spaces count as tokens. This tokenization helps the model process text more efficiently and understand language patterns better.

When you have a conversation with an AI model, it handles your interaction in a fascinating way. Every piece of the conversation - whether it's the initial instructions (system messages), your questions (user messages), or the AI's responses (assistant messages) - gets woven together into one continuous sequence. This combined text then goes through the tokenization process, where it's broken down into tokens and analyzed as a complete unit.

However, there's a crucial limitation to be aware of: the context window has a fixed size. When your conversation grows too large and exceeds this token limit, the model has to make room for new information. It does this by dropping older messages from the beginning of the conversation, much like how your phone's messaging app might only show the most recent messages in a chat window. This is why sometimes the model might not remember something mentioned much earlier in a long conversation.

  • GPT-3.5-turbo: ~16K tokens (approximately equivalent to 12,000 words or 48 pages of text). This substantial context window allows for detailed conversations and complex tasks, though it may need management in longer interactions.
  • GPT-4o and GPT-4o-mini: ~128K tokens (approximately equivalent to 96,000 words or 384 pages of text, varying based on your subscription plan). This significantly larger context window enables much longer conversations and more complex analyses, making it suitable for extensive documentation review, long-form content creation, and detailed analytical tasks.

7.2.2 Thread Management Strategies

1. Segmentation into Logical Threads

When building AI applications that handle conversations, it's crucial to implement proper thread segmentation. This means organizing conversations into separate, independent message arrays (threads) based on different users, topics, or contexts. Think of it like having different chat rooms or conversation channels - each maintains its own history and context without interfering with others.

For example, if you have a customer service bot helping multiple customers simultaneously, each customer's conversation should be stored in a separate thread. This ensures that when Customer A asks about their order status, the bot won't accidentally reference Customer B's shipping details. Similarly, if your application handles different topics (like medical advice vs technical support), keeping separate threads prevents confusion and maintains contextual accuracy.

This segmentation approach offers several benefits:

  • Clear conversation boundaries
  • Better context management
  • Improved response accuracy
  • Easier debugging and monitoring
  • Enhanced privacy between different users or topics

The key is to implement a robust system for creating, managing, and storing these separate threads while ensuring each maintains its own complete conversation history.

# Example: Separate threads for different users.
threads = {
    "user_1": [{"role": "system", "content": "You are a friendly assistant."}],
    "user_2": [{"role": "system", "content": "You are an expert math tutor."}]
}

def append_message(user_id, role, content):
    threads[user_id].append({"role": role, "content": content})

This code demonstrates a basic implementation of conversation thread management. Let me break it down:

Data Structure:

  • The code uses a dictionary called 'threads' to store separate conversation threads for different users
  • Each user (user_1, user_2) has their own array of messages with a unique system message defining the assistant's role (friendly assistant vs math tutor)

Functionality:

  • The append_message() function allows adding new messages to a specific user's thread
  • It takes three parameters: user_id (to identify the thread), role (who's speaking), and content (the message)

This simple implementation ensures that conversations remain separate and contextually appropriate for each user, preventing cross-contamination of conversations between different users.

Here's a comprehensive implementation

# Example: Thread management with OpenAI API
import openai
from typing import Dict, List
import time

class ThreadManager:
    def __init__(self, api_key: str):
        self.threads: Dict[str, List[Dict]] = {}
        openai.api_key = api_key
        
    def create_thread(self, user_id: str, system_role: str) -> None:
        """Initialize a new thread for a user with system message."""
        self.threads[user_id] = [
            {"role": "system", "content": system_role}
        ]
    
    def append_message(self, user_id: str, role: str, content: str) -> None:
        """Add a message to user's thread."""
        if user_id not in self.threads:
            self.create_thread(user_id, "You are a helpful assistant.")
        self.threads[user_id].append({
            "role": role,
            "content": content,
            "timestamp": time.time()
        })
    
    async def get_response(self, user_id: str, temperature: float = 0.7) -> str:
        """Get AI response using GPT-4o."""
        try:
            response = await openai.ChatCompletion.acreate(
                model="gpt-4o",
                messages=self.threads[user_id],
                temperature=temperature,
                max_tokens=1000
            )
            ai_response = response.choices[0].message.content
            self.append_message(user_id, "assistant", ai_response)
            return ai_response
        except Exception as e:
            return f"Error: {str(e)}"

# Usage example
async def main():
    thread_mgr = ThreadManager("your-api-key-here")
    
    # Create threads for different users
    thread_mgr.create_thread("user_1", "You are a friendly assistant.")
    thread_mgr.create_thread("user_2", "You are an expert math tutor.")
    
    # Simulate conversation
    thread_mgr.append_message("user_1", "user", "Hello! How are you?")
    response = await thread_mgr.get_response("user_1")
    print(f"Assistant: {response}")

This example shows an implementation of a thread management system for handling multiple conversations with an AI assistant. Here are the key components:

ThreadManager Class:

  • Manages separate conversation threads for different users
  • Stores threads in a dictionary with user IDs as keys

Main Functions:

  • create_thread(): Sets up new conversations with a system role (e.g., "friendly assistant" or "math tutor")
  • append_message(): Adds new messages to a user's conversation thread with timestamps
  • get_response(): Makes API calls to GPT-4o to get AI responses

Key Features:

  • Error handling for API calls
  • Asynchronous support for better performance
  • Automatic thread creation if none exists
  • Message timestamp tracking

The code helps prevent cross-contamination of conversations between different users by keeping each user's conversation separate and contextually appropriate.

Code Breakdown:

Class Structure

  • ThreadManager class handles all thread-related operations
  • Uses a dictionary to store threads with user IDs as keys
  • Initializes with OpenAI API key

Key Methods

  • create_thread(): Initializes new conversation threads
  • append_message(): Adds messages to existing threads
  • get_response(): Handles API calls to GPT-4o

Improvements over Basic Version

  • Proper error handling
  • Async support for better performance
  • Timestamp tracking for messages
  • Temperature control for response variation

Safety Features

  • Automatic thread creation if not exists
  • Try-except block for API calls
  • Type hints for better code maintainability

2. Token Counting and Trimming

Use a tokenizer like tiktoken to accurately count tokens in your message history. A tokenizer is essential because it breaks down text the same way the AI model does, ensuring accurate token counts. When your conversation history approaches the model's context window limit, you'll need to implement a trimming strategy.

The most effective approach is to remove older messages in complete chunks (full messages rather than partial ones) to maintain conversation coherence. This preserves the natural flow of the dialogue while ensuring you stay within token limits.

For example, you might remove the oldest user-assistant message pair first, keeping the system message and most recent interactions intact. This approach is preferable to removing individual tokens or partial messages, which could lead to confusing or incomplete context.

Code example:

import tiktoken
from openai import OpenAI

client = OpenAI()

def count_tokens(messages, model="gpt-4o"):
    """Count tokens for a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print(f"Warning: Model `{model}` not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    num_tokens = 0
    for message in messages:
        num_tokens += 4  # Format tax for each message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":  # Name tax
                num_tokens += 1
    num_tokens += 2  # Format tax for the entire message
    return num_tokens

def trim_history(messages, max_tokens=10000, model="gpt-4o"):
    """Trim conversation history to fit within token limit."""
    total_tokens = count_tokens(messages, model)
    while total_tokens > max_tokens and len(messages) > 1:
        # Calculate tokens of the message to be removed
        tokens_to_remove = count_tokens([messages[1]], model) + 4  # +4 for format tax
        if "name" in messages[1]:
            tokens_to_remove += 1  # Name tax
        messages.pop(1)
        total_tokens -= tokens_to_remove
    return messages

# Example usage:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "name": "Alice", "content": "Tell me something else."},
    {"role": "assistant", "content": "The Eiffel Tower is a famous landmark in Paris."},
]

token_count = count_tokens(messages)
print(f"Initial token count: {token_count}")

trimmed_messages = trim_history(messages, max_tokens=300)
trimmed_token_count = count_tokens(trimmed_messages)
print(f"Token count after trimming: {trimmed_token_count}")
print(f"Trimmed messages: {trimmed_messages}")

Code Breakdown:

1. Imports and Setup

  • Imports tiktoken for token counting
  • Imports OpenAI client for API interaction
  • Initializes OpenAI client instance

2. count_tokens Function

Purpose: Accurately counts tokens in a message list according to OpenAI's tokenization rules.

  • Parameters:
    • messages: List of message dictionaries
    • model: Target model name (default: "gpt-4o")
  • Key Features:
    • Uses model-specific tokenizer when available
    • Falls back to cl100k_base encoding if model not found
    • Accounts for message format tax (+4 tokens per message)
    • Adds name tax (+1 token) when present
    • Includes format tax for entire message (+2 tokens)

3. trim_history Function

Purpose: Trims conversation history to fit within token limit while preserving recent context.

  • Parameters:
    • messages: List of message dictionaries
    • max_tokens: Maximum allowed tokens (default: 10000)
    • model: Target model name (default: "gpt-4o")
  • Algorithm:
    • Counts total tokens in current history
    • Removes oldest messages (starting at index 1) until under limit
    • Preserves system message (index 0)
    • Accurately tracks token reduction including format taxes

4. Example Implementation

  • Sample conversation:
    • System message defining assistant role
    • Two user messages (one with name)
    • Two assistant responses
  • Demonstration:
    • Counts initial tokens in conversation
    • Trims history to fit within 300 tokens
    • Shows token count before and after trimming
    • Displays final trimmed message list

The example usage shows how to apply these functions to a conversation history, counting the initial tokens and then trimming it to fit within a 300-token limit.

3. Summarization of Older Context

For very long conversations that span many exchanges, implementing a summarization strategy becomes crucial for managing context effectively while staying within token limits. This approach involves condensing earlier parts of the conversation into a concise summary, which preserves the essential information while using significantly fewer tokens.

You can leverage the AI model's own capabilities by periodically calling it to analyze the conversation history and generate a meaningful summary. This summary can then replace a larger block of previous messages, maintaining the conversation's context while drastically reducing token usage.

The summarization process is particularly effective because it retains key discussion points and decisions while eliminating redundant or less relevant details, ensuring that future responses remain contextually appropriate without requiring the full conversation history.

Code example:

from openai import OpenAI
import tiktoken  # Import tiktoken

def summarize_context(history, api_key, model="gpt-3.5-turbo", max_summary_tokens=80):
    """Summarizes a conversation history.

    Args:
        history: A list of message dictionaries.
        api_key: Your OpenAI API key.
        model: The OpenAI model to use for summarization.
        max_summary_tokens: The maximum number of tokens for the summary.

    Returns:
        A list containing a single message with the summary, or the original
        history if summarization fails.
    """
    client = OpenAI(api_key=api_key)
    encoding = tiktoken.encoding_for_model(model) # Added encoding

    # Improved prompt with more specific instructions
    prompt = [
        {"role": "system", "content": "You are a helpful assistant that provides concise summaries of conversations."},
        {"role": "user", "content": (
            "Please provide a two-sentence summary of the following conversation, focusing on the key topics discussed and the main points:\n" +
            "\n".join([f'{m["role"]}: {m["content"]}' for m in history])
        )}
    ]

    prompt_token_count = len(encoding.encode(prompt[1]["content"])) # Count tokens
    if prompt_token_count > 4000: # Check if prompt is too long.
        print("Warning: Prompt is longer than 4000 tokens.  Consider Trimming History")

    try:
        response = client.chat.completions.create(
            model=model,
            messages=prompt,
            max_tokens=max_summary_tokens,
            temperature=0.2,  # Lower temperature for more focused summaries
        )
        summary = response.choices[0].message.content
        return [{"role": "system", "content": "Conversation summary: " + summary}]
    except Exception as e:
        print(f"Error during summarization: {str(e)}")
        return history  # Return original history if summarization fails

This code implements a conversation summarization function that helps manage long conversation histories. Here's a breakdown of its key components:

Function Overview:

The summarize_context function takes a conversation history and converts it into a concise two-sentence summary. Key parameters include:

  • history: The conversation messages to summarize
  • api_key: OpenAI API authentication
  • model: The AI model to use (defaults to GPT-3.5-turbo)
  • max_summary_tokens: Maximum length of the summary

Key Features:

  • Uses tiktoken for accurate token counting
  • Implements token limit checks (warns if over 4000 tokens)
  • Uses a low temperature (0.2) for more consistent summaries
  • Falls back to original history if summarization fails

Process:

  • Initializes OpenAI client and sets up token encoding
  • Creates a prompt with specific instructions for summarization
  • Formats the conversation history into a readable format
  • Makes API call to generate the summary
  • Returns the summary as a system message or original history if there's an error

This function is particularly useful when conversation histories become too long, as it helps maintain context while reducing token usage. The summarized version can then be prepended to more recent messages to maintain conversation continuity.

7.2.3 Putting It All Together

Now, let's explore how to seamlessly combine the trimming and summarization techniques we discussed into a unified, effective workflow. This integration is crucial for maintaining optimal conversation management, as it allows us to handle both immediate token constraints and long-term context preservation in a single, coordinated process.

The following code example demonstrates how these components work together to create a robust conversation management system that can handle everything from basic message processing to complex context maintenance.

from openai import OpenAI
from typing import List, Dict
import tiktoken
from datetime import datetime

class ConversationManager:
    def __init__(self, api_key: str, model: str = "gpt-3.5-turbo"):
        self.client = OpenAI(api_key=api_key)
        self.model = model
        self.threads: Dict[str, List[Dict]] = {}
        self.encoding = tiktoken.encoding_for_model(model)
        
    def append_message(self, user_id: str, role: str, content: str) -> None:
        """Add a new message to the user's conversation thread."""
        if user_id not in self.threads:
            self.threads[user_id] = [
                {"role": "system", "content": "You are a helpful assistant."}
            ]
        self.threads[user_id].append({"role": role, "content": content})

    def count_tokens(self, messages: List[Dict]) -> int:
        """Count tokens in the message list."""
        num_tokens = 0
        for message in messages:
            num_tokens += 4  # Message format tax
            for key, value in message.items():
                num_tokens += len(self.encoding.encode(str(value)))
                if key == "name":
                    num_tokens += 1  # Name field tax
        return num_tokens + 2  # Add final format tax

    def summarize_context(self, history: List[Dict], max_summary_tokens: int = 300) -> List[Dict]:
        """Generate a summary of the conversation history."""
        try:
            prompt = [
                {"role": "system", "content": "Summarize this conversation in two concise sentences:"},
                {"role": "user", "content": "\n".join([f"{m['role']}: {m['content']}" 
                                                     for m in history if m['role'] != "system"])}
            ]
            
            response = self.client.chat.completions.create(
                model=self.model,
                messages=prompt,
                max_tokens=max_summary_tokens,
                temperature=0.3
            )
            
            timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            return [{
                "role": "system",
                "content": f"Summary as of {timestamp}: {response.choices[0].message.content}"
            }]
        except Exception as e:
            print(f"Summarization failed: {str(e)}")
            return history

    def trim_history(self, messages: List[Dict], max_tokens: int) -> List[Dict]:
        """Trim conversation history to fit within token limit."""
        while self.count_tokens(messages) > max_tokens and len(messages) > 1:
            # Remove second message (preserve system message)
            messages.pop(1)
        return messages

    def manage_context_thread(self, user_id: str, new_message: str, 
                            max_tokens: int = 4000, 
                            summary_threshold: int = 3000) -> List[Dict]:
        """Manage conversation context with token limits and summarization."""
        # Add new message
        self.append_message(user_id, "user", new_message)
        history = self.threads[user_id]
        current_tokens = self.count_tokens(history)

        # Log for debugging
        print(f"Current token count: {current_tokens}")
        
        # If approaching token limit, summarize older context
        if current_tokens > summary_threshold:
            # Keep system message and last 3 message pairs (6 messages)
            recent_messages = history[-6:] if len(history) > 6 else history
            older_messages = history[1:-6] if len(history) > 6 else []
            
            if older_messages:
                summary = self.summarize_context(older_messages)
                history = [history[0]] + summary + recent_messages
                print("Context summarized")

        # Ensure we're under max tokens
        history = self.trim_history(history, max_tokens)
        self.threads[user_id] = history
        
        return history

# Example usage
if __name__ == "__main__":
    manager = ConversationManager("your-api-key-here")
    
    # Simulate a conversation
    user_id = "user123"
    messages = [
        "Hello! How can you help me today?",
        "I'd like to learn about machine learning.",
        "Can you explain neural networks?",
        "What's the difference between supervised and unsupervised learning?"
    ]
    
    # Process messages
    for msg in messages:
        history = manager.manage_context_thread(user_id, msg)
        print(f"\nMessage: {msg}")
        print(f"Thread length: {len(history)}")
        print(f"Token count: {manager.count_tokens(history)}")

Code Breakdown:

  1. Class Structure
  • ConversationManager class handles all aspects of conversation management
    • Initializes with OpenAI API key and model selection
    • Maintains conversation threads for multiple users
    • Uses tiktoken for accurate token counting
  1. Core Methods
  • append_message()
    • Adds new messages to user-specific conversation threads
    • Initializes new threads with a system message
  • count_tokens()
    • Accurately counts tokens including format taxes
    • Accounts for message structure and name fields
  1. Advanced Features
  • summarize_context()
    • Uses OpenAI API to generate concise summaries
    • Includes timestamps for context
    • Handles errors gracefully
  • trim_history()
    • Removes oldest messages while preserving system message
    • Ensures conversation stays within token limits
  1. Main Management Logic
  • manage_context_thread()
    • Implements three-phase context management:
      • Addition of new messages
      • Summarization of older context
      • Token limit enforcement
    • Uses separate thresholds for summarization and maximum tokens
  1. Usage Example
  • Demonstrates practical implementation with multiple messages
  • Includes token counting and thread length monitoring
  • Shows how to maintain conversation context over multiple exchanges

With this approach:

  1. New messages are added.
  2. History length is checked.
  3. If over the limit, you summarize the entire conversation, keep the summary, and the last few messages.
  4. Trim any remaining excess.

Managing threads and context windows is a foundational aspect of building scalable, coherent multi-turn conversations in AI systems. Here's why it matters and how it works:

First, thread segmentation by user ensures that each conversation remains isolated and personal. When you maintain separate conversation threads for each user, you prevent cross-contamination of context between different conversations, allowing for more personalized and accurate responses.

Second, token counting and trimming serve as essential maintenance tools. By actively monitoring the token count of conversations, you can prevent hitting model context limits while preserving the most relevant information. This process involves carefully removing older messages while maintaining crucial context, similar to how human conversations naturally focus on recent and relevant information.

Third, context summarization acts as a memory compression technique. When conversations grow long, summarizing older context allows you to maintain the essential narrative while reducing token usage. This is similar to how humans maintain the gist of earlier conversations without remembering every detail.

The combination of these strategies results in an AI assistant that can:

  • Maintain consistent context across multiple conversation turns
  • Scale efficiently without degrading performance
  • Provide relevant responses based on both recent and historical context
  • Adapt to different conversation lengths and complexities

These capabilities ensure your AI assistant remains responsive, informed, and context-aware throughout extended dialogues, creating a more natural and effective conversation experience.

7.2 Thread Management and Context Windows

In multi-turn conversations, especially those that extend over multiple messages or sessions, managing the context effectively is crucial for maintaining meaningful dialogue. This becomes particularly important in AI applications where conversations can become complex and lengthy. OpenAI models operate within a fixed "context window"—a maximum number of tokens they can consider in one call (up to 128K tokens for some models). Think of this window as the model's working memory, similar to how humans can only keep a certain amount of information in their immediate thoughts.

When your conversation extends beyond this window limit, older messages may be truncated or removed from consideration, causing the model to lose important context. This can lead to several issues: the model might forget earlier parts of the conversation, repeat information, or fail to maintain consistency with previously established facts or preferences. For example, if a user references something mentioned earlier in the conversation that was truncated, the model won't have access to that information to provide an appropriate response.

Proper thread management is therefore essential—it ensures that your applications remain coherent and focused, even as conversations grow in length and complexity. This involves implementing strategies to preserve crucial context while efficiently managing the token limit, such as summarizing previous exchanges, maintaining relevant information, and intelligently deciding which parts of the conversation to keep or remove. Through effective thread management, you can create more natural, context-aware conversations that maintain their coherence and usefulness over extended interactions.

7.2.1 What Is a Context Window?

A context window is a fundamental concept in AI language models that defines the maximum amount of text the model can process and understand at once. This window acts like the model's working memory - similar to how humans can only hold a certain amount of information in their immediate thoughts. The size of this window is measured in tokens, which are the fundamental building blocks the model uses to process language.

To better understand tokens, think of them as the model's vocabulary pieces. While we humans naturally break language into words, AI models break text down differently. A single word might become multiple tokens - for example, "understanding" becomes "under" + "standing", while common words like "the" or "and" are usually single tokens. Even punctuation marks and spaces count as tokens. This tokenization helps the model process text more efficiently and understand language patterns better.

When you have a conversation with an AI model, it handles your interaction in a fascinating way. Every piece of the conversation - whether it's the initial instructions (system messages), your questions (user messages), or the AI's responses (assistant messages) - gets woven together into one continuous sequence. This combined text then goes through the tokenization process, where it's broken down into tokens and analyzed as a complete unit.

However, there's a crucial limitation to be aware of: the context window has a fixed size. When your conversation grows too large and exceeds this token limit, the model has to make room for new information. It does this by dropping older messages from the beginning of the conversation, much like how your phone's messaging app might only show the most recent messages in a chat window. This is why sometimes the model might not remember something mentioned much earlier in a long conversation.

  • GPT-3.5-turbo: ~16K tokens (approximately equivalent to 12,000 words or 48 pages of text). This substantial context window allows for detailed conversations and complex tasks, though it may need management in longer interactions.
  • GPT-4o and GPT-4o-mini: ~128K tokens (approximately equivalent to 96,000 words or 384 pages of text, varying based on your subscription plan). This significantly larger context window enables much longer conversations and more complex analyses, making it suitable for extensive documentation review, long-form content creation, and detailed analytical tasks.

7.2.2 Thread Management Strategies

1. Segmentation into Logical Threads

When building AI applications that handle conversations, it's crucial to implement proper thread segmentation. This means organizing conversations into separate, independent message arrays (threads) based on different users, topics, or contexts. Think of it like having different chat rooms or conversation channels - each maintains its own history and context without interfering with others.

For example, if you have a customer service bot helping multiple customers simultaneously, each customer's conversation should be stored in a separate thread. This ensures that when Customer A asks about their order status, the bot won't accidentally reference Customer B's shipping details. Similarly, if your application handles different topics (like medical advice vs technical support), keeping separate threads prevents confusion and maintains contextual accuracy.

This segmentation approach offers several benefits:

  • Clear conversation boundaries
  • Better context management
  • Improved response accuracy
  • Easier debugging and monitoring
  • Enhanced privacy between different users or topics

The key is to implement a robust system for creating, managing, and storing these separate threads while ensuring each maintains its own complete conversation history.

# Example: Separate threads for different users.
threads = {
    "user_1": [{"role": "system", "content": "You are a friendly assistant."}],
    "user_2": [{"role": "system", "content": "You are an expert math tutor."}]
}

def append_message(user_id, role, content):
    threads[user_id].append({"role": role, "content": content})

This code demonstrates a basic implementation of conversation thread management. Let me break it down:

Data Structure:

  • The code uses a dictionary called 'threads' to store separate conversation threads for different users
  • Each user (user_1, user_2) has their own array of messages with a unique system message defining the assistant's role (friendly assistant vs math tutor)

Functionality:

  • The append_message() function allows adding new messages to a specific user's thread
  • It takes three parameters: user_id (to identify the thread), role (who's speaking), and content (the message)

This simple implementation ensures that conversations remain separate and contextually appropriate for each user, preventing cross-contamination of conversations between different users.

Here's a comprehensive implementation

# Example: Thread management with OpenAI API
import openai
from typing import Dict, List
import time

class ThreadManager:
    def __init__(self, api_key: str):
        self.threads: Dict[str, List[Dict]] = {}
        openai.api_key = api_key
        
    def create_thread(self, user_id: str, system_role: str) -> None:
        """Initialize a new thread for a user with system message."""
        self.threads[user_id] = [
            {"role": "system", "content": system_role}
        ]
    
    def append_message(self, user_id: str, role: str, content: str) -> None:
        """Add a message to user's thread."""
        if user_id not in self.threads:
            self.create_thread(user_id, "You are a helpful assistant.")
        self.threads[user_id].append({
            "role": role,
            "content": content,
            "timestamp": time.time()
        })
    
    async def get_response(self, user_id: str, temperature: float = 0.7) -> str:
        """Get AI response using GPT-4o."""
        try:
            response = await openai.ChatCompletion.acreate(
                model="gpt-4o",
                messages=self.threads[user_id],
                temperature=temperature,
                max_tokens=1000
            )
            ai_response = response.choices[0].message.content
            self.append_message(user_id, "assistant", ai_response)
            return ai_response
        except Exception as e:
            return f"Error: {str(e)}"

# Usage example
async def main():
    thread_mgr = ThreadManager("your-api-key-here")
    
    # Create threads for different users
    thread_mgr.create_thread("user_1", "You are a friendly assistant.")
    thread_mgr.create_thread("user_2", "You are an expert math tutor.")
    
    # Simulate conversation
    thread_mgr.append_message("user_1", "user", "Hello! How are you?")
    response = await thread_mgr.get_response("user_1")
    print(f"Assistant: {response}")

This example shows an implementation of a thread management system for handling multiple conversations with an AI assistant. Here are the key components:

ThreadManager Class:

  • Manages separate conversation threads for different users
  • Stores threads in a dictionary with user IDs as keys

Main Functions:

  • create_thread(): Sets up new conversations with a system role (e.g., "friendly assistant" or "math tutor")
  • append_message(): Adds new messages to a user's conversation thread with timestamps
  • get_response(): Makes API calls to GPT-4o to get AI responses

Key Features:

  • Error handling for API calls
  • Asynchronous support for better performance
  • Automatic thread creation if none exists
  • Message timestamp tracking

The code helps prevent cross-contamination of conversations between different users by keeping each user's conversation separate and contextually appropriate.

Code Breakdown:

Class Structure

  • ThreadManager class handles all thread-related operations
  • Uses a dictionary to store threads with user IDs as keys
  • Initializes with OpenAI API key

Key Methods

  • create_thread(): Initializes new conversation threads
  • append_message(): Adds messages to existing threads
  • get_response(): Handles API calls to GPT-4o

Improvements over Basic Version

  • Proper error handling
  • Async support for better performance
  • Timestamp tracking for messages
  • Temperature control for response variation

Safety Features

  • Automatic thread creation if not exists
  • Try-except block for API calls
  • Type hints for better code maintainability

2. Token Counting and Trimming

Use a tokenizer like tiktoken to accurately count tokens in your message history. A tokenizer is essential because it breaks down text the same way the AI model does, ensuring accurate token counts. When your conversation history approaches the model's context window limit, you'll need to implement a trimming strategy.

The most effective approach is to remove older messages in complete chunks (full messages rather than partial ones) to maintain conversation coherence. This preserves the natural flow of the dialogue while ensuring you stay within token limits.

For example, you might remove the oldest user-assistant message pair first, keeping the system message and most recent interactions intact. This approach is preferable to removing individual tokens or partial messages, which could lead to confusing or incomplete context.

Code example:

import tiktoken
from openai import OpenAI

client = OpenAI()

def count_tokens(messages, model="gpt-4o"):
    """Count tokens for a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print(f"Warning: Model `{model}` not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    num_tokens = 0
    for message in messages:
        num_tokens += 4  # Format tax for each message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":  # Name tax
                num_tokens += 1
    num_tokens += 2  # Format tax for the entire message
    return num_tokens

def trim_history(messages, max_tokens=10000, model="gpt-4o"):
    """Trim conversation history to fit within token limit."""
    total_tokens = count_tokens(messages, model)
    while total_tokens > max_tokens and len(messages) > 1:
        # Calculate tokens of the message to be removed
        tokens_to_remove = count_tokens([messages[1]], model) + 4  # +4 for format tax
        if "name" in messages[1]:
            tokens_to_remove += 1  # Name tax
        messages.pop(1)
        total_tokens -= tokens_to_remove
    return messages

# Example usage:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "name": "Alice", "content": "Tell me something else."},
    {"role": "assistant", "content": "The Eiffel Tower is a famous landmark in Paris."},
]

token_count = count_tokens(messages)
print(f"Initial token count: {token_count}")

trimmed_messages = trim_history(messages, max_tokens=300)
trimmed_token_count = count_tokens(trimmed_messages)
print(f"Token count after trimming: {trimmed_token_count}")
print(f"Trimmed messages: {trimmed_messages}")

Code Breakdown:

1. Imports and Setup

  • Imports tiktoken for token counting
  • Imports OpenAI client for API interaction
  • Initializes OpenAI client instance

2. count_tokens Function

Purpose: Accurately counts tokens in a message list according to OpenAI's tokenization rules.

  • Parameters:
    • messages: List of message dictionaries
    • model: Target model name (default: "gpt-4o")
  • Key Features:
    • Uses model-specific tokenizer when available
    • Falls back to cl100k_base encoding if model not found
    • Accounts for message format tax (+4 tokens per message)
    • Adds name tax (+1 token) when present
    • Includes format tax for entire message (+2 tokens)

3. trim_history Function

Purpose: Trims conversation history to fit within token limit while preserving recent context.

  • Parameters:
    • messages: List of message dictionaries
    • max_tokens: Maximum allowed tokens (default: 10000)
    • model: Target model name (default: "gpt-4o")
  • Algorithm:
    • Counts total tokens in current history
    • Removes oldest messages (starting at index 1) until under limit
    • Preserves system message (index 0)
    • Accurately tracks token reduction including format taxes

4. Example Implementation

  • Sample conversation:
    • System message defining assistant role
    • Two user messages (one with name)
    • Two assistant responses
  • Demonstration:
    • Counts initial tokens in conversation
    • Trims history to fit within 300 tokens
    • Shows token count before and after trimming
    • Displays final trimmed message list

The example usage shows how to apply these functions to a conversation history, counting the initial tokens and then trimming it to fit within a 300-token limit.

3. Summarization of Older Context

For very long conversations that span many exchanges, implementing a summarization strategy becomes crucial for managing context effectively while staying within token limits. This approach involves condensing earlier parts of the conversation into a concise summary, which preserves the essential information while using significantly fewer tokens.

You can leverage the AI model's own capabilities by periodically calling it to analyze the conversation history and generate a meaningful summary. This summary can then replace a larger block of previous messages, maintaining the conversation's context while drastically reducing token usage.

The summarization process is particularly effective because it retains key discussion points and decisions while eliminating redundant or less relevant details, ensuring that future responses remain contextually appropriate without requiring the full conversation history.

Code example:

from openai import OpenAI
import tiktoken  # Import tiktoken

def summarize_context(history, api_key, model="gpt-3.5-turbo", max_summary_tokens=80):
    """Summarizes a conversation history.

    Args:
        history: A list of message dictionaries.
        api_key: Your OpenAI API key.
        model: The OpenAI model to use for summarization.
        max_summary_tokens: The maximum number of tokens for the summary.

    Returns:
        A list containing a single message with the summary, or the original
        history if summarization fails.
    """
    client = OpenAI(api_key=api_key)
    encoding = tiktoken.encoding_for_model(model) # Added encoding

    # Improved prompt with more specific instructions
    prompt = [
        {"role": "system", "content": "You are a helpful assistant that provides concise summaries of conversations."},
        {"role": "user", "content": (
            "Please provide a two-sentence summary of the following conversation, focusing on the key topics discussed and the main points:\n" +
            "\n".join([f'{m["role"]}: {m["content"]}' for m in history])
        )}
    ]

    prompt_token_count = len(encoding.encode(prompt[1]["content"])) # Count tokens
    if prompt_token_count > 4000: # Check if prompt is too long.
        print("Warning: Prompt is longer than 4000 tokens.  Consider Trimming History")

    try:
        response = client.chat.completions.create(
            model=model,
            messages=prompt,
            max_tokens=max_summary_tokens,
            temperature=0.2,  # Lower temperature for more focused summaries
        )
        summary = response.choices[0].message.content
        return [{"role": "system", "content": "Conversation summary: " + summary}]
    except Exception as e:
        print(f"Error during summarization: {str(e)}")
        return history  # Return original history if summarization fails

This code implements a conversation summarization function that helps manage long conversation histories. Here's a breakdown of its key components:

Function Overview:

The summarize_context function takes a conversation history and converts it into a concise two-sentence summary. Key parameters include:

  • history: The conversation messages to summarize
  • api_key: OpenAI API authentication
  • model: The AI model to use (defaults to GPT-3.5-turbo)
  • max_summary_tokens: Maximum length of the summary

Key Features:

  • Uses tiktoken for accurate token counting
  • Implements token limit checks (warns if over 4000 tokens)
  • Uses a low temperature (0.2) for more consistent summaries
  • Falls back to original history if summarization fails

Process:

  • Initializes OpenAI client and sets up token encoding
  • Creates a prompt with specific instructions for summarization
  • Formats the conversation history into a readable format
  • Makes API call to generate the summary
  • Returns the summary as a system message or original history if there's an error

This function is particularly useful when conversation histories become too long, as it helps maintain context while reducing token usage. The summarized version can then be prepended to more recent messages to maintain conversation continuity.

7.2.3 Putting It All Together

Now, let's explore how to seamlessly combine the trimming and summarization techniques we discussed into a unified, effective workflow. This integration is crucial for maintaining optimal conversation management, as it allows us to handle both immediate token constraints and long-term context preservation in a single, coordinated process.

The following code example demonstrates how these components work together to create a robust conversation management system that can handle everything from basic message processing to complex context maintenance.

from openai import OpenAI
from typing import List, Dict
import tiktoken
from datetime import datetime

class ConversationManager:
    def __init__(self, api_key: str, model: str = "gpt-3.5-turbo"):
        self.client = OpenAI(api_key=api_key)
        self.model = model
        self.threads: Dict[str, List[Dict]] = {}
        self.encoding = tiktoken.encoding_for_model(model)
        
    def append_message(self, user_id: str, role: str, content: str) -> None:
        """Add a new message to the user's conversation thread."""
        if user_id not in self.threads:
            self.threads[user_id] = [
                {"role": "system", "content": "You are a helpful assistant."}
            ]
        self.threads[user_id].append({"role": role, "content": content})

    def count_tokens(self, messages: List[Dict]) -> int:
        """Count tokens in the message list."""
        num_tokens = 0
        for message in messages:
            num_tokens += 4  # Message format tax
            for key, value in message.items():
                num_tokens += len(self.encoding.encode(str(value)))
                if key == "name":
                    num_tokens += 1  # Name field tax
        return num_tokens + 2  # Add final format tax

    def summarize_context(self, history: List[Dict], max_summary_tokens: int = 300) -> List[Dict]:
        """Generate a summary of the conversation history."""
        try:
            prompt = [
                {"role": "system", "content": "Summarize this conversation in two concise sentences:"},
                {"role": "user", "content": "\n".join([f"{m['role']}: {m['content']}" 
                                                     for m in history if m['role'] != "system"])}
            ]
            
            response = self.client.chat.completions.create(
                model=self.model,
                messages=prompt,
                max_tokens=max_summary_tokens,
                temperature=0.3
            )
            
            timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            return [{
                "role": "system",
                "content": f"Summary as of {timestamp}: {response.choices[0].message.content}"
            }]
        except Exception as e:
            print(f"Summarization failed: {str(e)}")
            return history

    def trim_history(self, messages: List[Dict], max_tokens: int) -> List[Dict]:
        """Trim conversation history to fit within token limit."""
        while self.count_tokens(messages) > max_tokens and len(messages) > 1:
            # Remove second message (preserve system message)
            messages.pop(1)
        return messages

    def manage_context_thread(self, user_id: str, new_message: str, 
                            max_tokens: int = 4000, 
                            summary_threshold: int = 3000) -> List[Dict]:
        """Manage conversation context with token limits and summarization."""
        # Add new message
        self.append_message(user_id, "user", new_message)
        history = self.threads[user_id]
        current_tokens = self.count_tokens(history)

        # Log for debugging
        print(f"Current token count: {current_tokens}")
        
        # If approaching token limit, summarize older context
        if current_tokens > summary_threshold:
            # Keep system message and last 3 message pairs (6 messages)
            recent_messages = history[-6:] if len(history) > 6 else history
            older_messages = history[1:-6] if len(history) > 6 else []
            
            if older_messages:
                summary = self.summarize_context(older_messages)
                history = [history[0]] + summary + recent_messages
                print("Context summarized")

        # Ensure we're under max tokens
        history = self.trim_history(history, max_tokens)
        self.threads[user_id] = history
        
        return history

# Example usage
if __name__ == "__main__":
    manager = ConversationManager("your-api-key-here")
    
    # Simulate a conversation
    user_id = "user123"
    messages = [
        "Hello! How can you help me today?",
        "I'd like to learn about machine learning.",
        "Can you explain neural networks?",
        "What's the difference between supervised and unsupervised learning?"
    ]
    
    # Process messages
    for msg in messages:
        history = manager.manage_context_thread(user_id, msg)
        print(f"\nMessage: {msg}")
        print(f"Thread length: {len(history)}")
        print(f"Token count: {manager.count_tokens(history)}")

Code Breakdown:

  1. Class Structure
  • ConversationManager class handles all aspects of conversation management
    • Initializes with OpenAI API key and model selection
    • Maintains conversation threads for multiple users
    • Uses tiktoken for accurate token counting
  1. Core Methods
  • append_message()
    • Adds new messages to user-specific conversation threads
    • Initializes new threads with a system message
  • count_tokens()
    • Accurately counts tokens including format taxes
    • Accounts for message structure and name fields
  1. Advanced Features
  • summarize_context()
    • Uses OpenAI API to generate concise summaries
    • Includes timestamps for context
    • Handles errors gracefully
  • trim_history()
    • Removes oldest messages while preserving system message
    • Ensures conversation stays within token limits
  1. Main Management Logic
  • manage_context_thread()
    • Implements three-phase context management:
      • Addition of new messages
      • Summarization of older context
      • Token limit enforcement
    • Uses separate thresholds for summarization and maximum tokens
  1. Usage Example
  • Demonstrates practical implementation with multiple messages
  • Includes token counting and thread length monitoring
  • Shows how to maintain conversation context over multiple exchanges

With this approach:

  1. New messages are added.
  2. History length is checked.
  3. If over the limit, you summarize the entire conversation, keep the summary, and the last few messages.
  4. Trim any remaining excess.

Managing threads and context windows is a foundational aspect of building scalable, coherent multi-turn conversations in AI systems. Here's why it matters and how it works:

First, thread segmentation by user ensures that each conversation remains isolated and personal. When you maintain separate conversation threads for each user, you prevent cross-contamination of context between different conversations, allowing for more personalized and accurate responses.

Second, token counting and trimming serve as essential maintenance tools. By actively monitoring the token count of conversations, you can prevent hitting model context limits while preserving the most relevant information. This process involves carefully removing older messages while maintaining crucial context, similar to how human conversations naturally focus on recent and relevant information.

Third, context summarization acts as a memory compression technique. When conversations grow long, summarizing older context allows you to maintain the essential narrative while reducing token usage. This is similar to how humans maintain the gist of earlier conversations without remembering every detail.

The combination of these strategies results in an AI assistant that can:

  • Maintain consistent context across multiple conversation turns
  • Scale efficiently without degrading performance
  • Provide relevant responses based on both recent and historical context
  • Adapt to different conversation lengths and complexities

These capabilities ensure your AI assistant remains responsive, informed, and context-aware throughout extended dialogues, creating a more natural and effective conversation experience.