Chapter 3: Embeddings and Semantic Search
3.2 When to Use Embeddings
Embeddings have revolutionized how we process and understand textual information in modern AI applications. While traditional text processing methods rely on exact matches or basic keyword searching, embeddings provide a sophisticated way to capture the nuanced meanings and relationships between pieces of text. By converting words and phrases into high-dimensional numerical vectors, embeddings enable machines to understand semantic relationships and similarities in ways that more closely mirror human understanding.
Let's explore the key scenarios where embeddings prove particularly valuable, showcasing how this technology transforms various aspects of information processing and retrieval. Understanding these use cases is crucial for developers and organizations looking to leverage the full potential of embedding technology in their applications.
3.2.1 Semantic search
Finding relevant information based on meaning rather than just keywords, enabling more intelligent search results. Unlike traditional keyword-based search that matches exact words or phrases, semantic search understands the intent and contextual meaning of a query by analyzing the underlying relationships between words and concepts. This advanced approach allows the system to comprehend variations in language, context, and even user intent.
For example, a search for "natural language processing" would also return relevant results about "NLP," "computational linguistics," or "text analysis." When a user searches for "treating common cold symptoms," the system would understand and return results about "flu remedies," "reducing fever," and "cough medicine" - even if these exact phrases aren't used. This technology leverages embedding vectors to calculate similarity scores between queries and documents, transforming each piece of text into a high-dimensional numerical representation that captures its semantic meaning. This mathematical approach enables more nuanced and accurate search results that account for:
- Synonyms and related terms (like "car" and "automobile")
- Conceptual relationships (connecting "python" to both programming and snakes, depending on context)
- Multiple languages (finding relevant content even when written in different languages)
- Contextual variations (understanding that "apple" could refer to either the fruit or the technology company)
- Intent matching (recognizing that "how to fix a flat tire" and "tire repair instructions" are seeking the same information)
Example:
Here is a code example demonstrating semantic search using OpenAI embeddings, based on the content you provided.
This script will:
- Define a small set of documents.
- Generate embeddings for these documents and a search query.
- Calculate the similarity between the query and each document.
- Rank the documents by relevance based on semantic similarity.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-03-22 15:22:00 CDT"
current_location = "Grapevine, Texas, United States"
print(f"Running Semantic Search example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings example)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print(f"Generating embedding for: \"{text[:50]}...\"") # Print truncated text
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
print("Embedding generation successful.")
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{text[:50]}...': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{text[:50]}...': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings example)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
# print("Error: Cannot calculate similarity with None vectors.")
return 0.0 # Return 0 if any vector is missing
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
# print("Warning: One or both vectors have zero magnitude.")
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return similarity
# --- Semantic Search Implementation ---
# 1. Define your document store (a list of text strings)
# In a real application, this could come from a database, files, etc.
document_store = [
"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
"Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy.",
"Artificial intelligence research focuses on creating systems capable of performing tasks that typically require human intelligence.",
"A recipe for classic French onion soup involves caramelizing onions and topping with bread and cheese.",
"Machine learning, a subset of AI, involves algorithms that allow systems to learn from data.",
"The Louvre Museum in Paris is the world's largest art museum and a historic monument.",
"Natural Language Processing (NLP) enables computers to understand and process human language.",
"Baking bread requires careful measurement of ingredients like flour, water, yeast, and salt."
]
print(f"\nDocument store contains {len(document_store)} documents.")
# 2. Generate embeddings for all documents in the store (pre-computation)
# In a real app, you'd store these embeddings alongside the documents.
print("\nGenerating embeddings for the document store...")
document_embeddings = []
for doc in document_store:
embedding = get_embedding(client, doc)
# Store the document text and its embedding together
if embedding: # Only store if embedding was successful
document_embeddings.append({"text": doc, "embedding": embedding})
else:
print(f"Skipping document due to embedding error: \"{doc[:50]}...\"")
print(f"\nSuccessfully generated embeddings for {len(document_embeddings)} documents.")
# 3. Define the user's search query
search_query = "What is AI?"
# search_query = "Things to see in Paris"
# search_query = "How does NLP work?"
# search_query = "Cooking instructions"
print(f"\nSearch Query: \"{search_query}\"")
# 4. Generate embedding for the search query
print("\nGenerating embedding for the search query...")
query_embedding = get_embedding(client, search_query)
# 5. Calculate similarity and rank documents
search_results = []
if query_embedding and document_embeddings:
print("\nCalculating similarities...")
for doc_data in document_embeddings:
similarity = cosine_similarity(query_embedding, doc_data["embedding"])
search_results.append({"text": doc_data["text"], "score": similarity})
# Sort results by similarity score in descending order
search_results.sort(key=lambda x: x["score"], reverse=True)
# 6. Display results
print("\n--- Semantic Search Results ---")
print(f"Top results for query: \"{search_query}\"\n")
if not search_results:
print("No results found (or error calculating similarities).")
else:
# Display top N results (e.g., top 3)
top_n = 3
for i, result in enumerate(search_results[:top_n]):
print(f"{i+1}. Score: {result['score']:.4f}")
print(f" Text: {result['text']}")
print("-" * 10)
if len(search_results) > top_n:
print(f"(Showing top {top_n} of {len(search_results)} results)")
else:
print("\nCould not perform search.")
if not query_embedding:
print("Reason: Failed to generate embedding for the search query.")
if not document_embeddings:
print("Reason: No document embeddings were successfully generated.")
Code Breakdown Explanation:
- Setup & Helpers: Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions from the previous example. - Document Store: A simple Python list (
document_store
) holds the text content of the documents we want to search through. In a real application, this data would likely come from a database or file system. - Document Embedding Generation:
- The script iterates through each document in the
document_store
. - It calls
get_embedding
for each document to get its numerical representation. - It stores the original document text and its corresponding embedding vector together (e.g., in a list of dictionaries). This pre-computation step is crucial for efficiency in real systems – you generate document embeddings once and store them. Error handling ensures documents are skipped if embedding fails.
- The script iterates through each document in the
- Search Query: A sample
search_query
string is defined. - Query Embedding Generation: The
get_embedding
function is called again, this time for thesearch_query
. - Similarity Calculation & Ranking:
- It checks if both the query embedding and document embeddings were successfully generated.
- It iterates through the stored
document_embeddings
. - For each document, it calculates the
cosine_similarity
between thequery_embedding
and the document's embedding. - The document text and its calculated similarity score are stored in a
search_results
list. - Finally,
search_results.sort(...)
arranges the list based on thescore
in descending order (highest similarity first).
- Display Results: The script prints the top N (e.g., 3) most relevant documents from the sorted list, showing their similarity score and text content.
This example clearly illustrates the core concept of semantic search: converting both documents and queries into embeddings and then using vector similarity (like cosine similarity) to find documents that are semantically related to the query, even if they don't share the exact keywords.
3.2.2 Topic clustering
Topic clustering is a sophisticated technique for organizing and analyzing large document collections by automatically grouping them based on their semantic content. This advanced application of embeddings transforms the way we process and understand large-scale document collections, offering a powerful solution for content organization. The system works by converting each document into a high-dimensional embedding vector that captures its meaning, then using clustering algorithms to group similar vectors together.
This powerful application of embeddings empowers systems to:
- Identify thematic patterns across thousands of documents without manual labeling - the system can automatically detect common topics and themes across vast document collections, saving countless hours of manual categorization work
- Group similar discussions, articles, or content pieces into intuitive categories - by understanding the semantic relationships between documents, the system can create meaningful groupings that reflect natural topic divisions, even when documents use different terminology to discuss the same concepts
- Discover emerging topics and trends within large document collections - as new content is added, the system can identify new thematic clusters forming, helping organizations stay ahead of developing trends in their field
- Create dynamic content hierarchies that adapt as new documents are added - unlike traditional static categorization systems, embedding-based clustering can automatically reorganize and refine category structures as the content collection grows and evolves
For example, a news organization could use topic clustering to automatically group thousands of articles into categories like "Technology", "Politics", or "Sports", even when these topics aren't explicitly tagged. The embeddings capture the semantic relationships between articles by analyzing the actual meaning and context of the content, not just keywords. This enables much more sophisticated grouping that can understand subtle distinctions - for instance, recognizing that an article about the economic impact of sports stadiums belongs in both "Sports" and "Business" categories, or that articles about different programming languages all belong in a "Technology" cluster despite using completely different terminology.
Example:
Below is a code example that demonstrates topic clustering using OpenAI embeddings and the K-means algorithm from scikit-learn
.
This code will:
- Define a list of sample documents covering different implicit topics.
- Generate embeddings for each document using OpenAI's API.
- Apply the K-Means clustering algorithm to group the embedding vectors.
- Display the documents belonging to each identified cluster.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np
from sklearn.cluster import KMeans # For clustering algorithm
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-03-23 15:26:00 CDT"
current_location = "Dallas, Texas, United States"
print(f"Running Topic Clustering example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
# Truncate text for printing if it's too long
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
# print("Embedding generation successful.") # Reduce verbosity
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Topic Clustering Implementation ---
# 1. Define your collection of documents
# These documents cover roughly 3 topics: AI/Tech, Travel/Geography, Food/Cooking
documents = [
"Artificial intelligence research focuses on creating systems capable of performing tasks that typically require human intelligence.",
"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
"A recipe for classic French onion soup involves caramelizing onions and topping with bread and cheese.",
"Machine learning, a subset of AI, involves algorithms that allow systems to learn from data.",
"The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials.",
"Natural Language Processing (NLP) enables computers to understand and process human language.",
"Baking bread requires careful measurement of ingredients like flour, water, yeast, and salt.",
"The Colosseum in Rome, Italy, is an oval amphitheatre in the centre of the city.",
"Deep learning utilizes artificial neural networks with multiple layers to model complex patterns.",
"Sushi is a traditional Japanese dish of prepared vinegared rice, usually with some sugar and salt, accompanying a variety of ingredients, such as seafood, often raw, and vegetables."
]
print(f"\nDocument collection contains {len(documents)} documents.")
# 2. Generate embeddings for all documents
print("\nGenerating embeddings for the document collection...")
embeddings = []
valid_documents = [] # Keep track of documents for which embedding was successful
for doc in documents:
embedding = get_embedding(client, doc)
if embedding:
embeddings.append(embedding)
valid_documents.append(doc) # Add corresponding document text
else:
print(f"Skipping document due to embedding error: \"{doc[:70]}...\"")
if not embeddings:
print("\nError: No embeddings were generated. Cannot perform clustering.")
exit()
print(f"\nSuccessfully generated embeddings for {len(valid_documents)} documents.")
# Convert embeddings list to a NumPy array for scikit-learn
embedding_matrix = np.array(embeddings)
# 3. Apply Clustering Algorithm (K-Means)
# We need to choose the number of clusters (k). Let's assume we expect 3 topics.
# In real applications, determining the optimal 'k' often requires experimentation
# (e.g., using the elbow method or silhouette scores).
n_clusters = 3
print(f"\nApplying K-Means clustering with k={n_clusters}...")
try:
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) # n_init suppresses warning
kmeans.fit(embedding_matrix)
cluster_labels = kmeans.labels_
print("Clustering complete.")
except Exception as e:
print(f"An error occurred during clustering: {e}")
exit()
# 4. Display Documents by Cluster
print(f"\n--- Documents Grouped by Topic Cluster (k={n_clusters}) ---")
# Create a dictionary to hold documents for each cluster
clustered_documents = {i: [] for i in range(n_clusters)}
# Assign each document (that had a valid embedding) to its cluster
for i, label in enumerate(cluster_labels):
clustered_documents[label].append(valid_documents[i])
# Print the contents of each cluster
for cluster_id, docs_in_cluster in clustered_documents.items():
print(f"\nCluster {cluster_id + 1}:")
if not docs_in_cluster:
print(" (No documents in this cluster)")
else:
for doc_text in docs_in_cluster:
# Print truncated document text for readability
print_text = doc_text[:100] + "..." if len(doc_text) > 100 else doc_text
print(f" - {print_text}")
print("-" * 20)
print("\nNote: The quality of clustering depends on the data, the embedding model,")
print("and the chosen number of clusters (k). Cluster numbers are arbitrary.")
Code Breakdown Explanation:
- Setup & Helpers:
- Includes standard imports plus
KMeans
fromsklearn.cluster
. - Initializes the OpenAI client.
- Includes the
get_embedding
helper function (same as before).
- Includes standard imports plus
- Document Collection: A list named
documents
holds the text content. The sample documents are chosen to represent a few distinct underlying topics (AI/Tech, Travel/Geography, Food/Cooking). - Embedding Generation:
- The script iterates through the
documents
. - It calls
get_embedding
for each document. - It stores the successful embeddings in the
embeddings
list and the corresponding document text invalid_documents
. This ensures that the indices match later. - Error handling skips documents if embedding generation fails.
- The list of embedding vectors is converted into a NumPy array (
embedding_matrix
), which is the standard input format forscikit-learn
algorithms.
- The script iterates through the
- Clustering (K-Means):
- Choosing
k
: The number of clusters (n_clusters
) is set (here,k=3
, assuming we expect three topics based on the sample data). A comment highlights that finding the optimalk
is often a separate task in real-world scenarios. - Initialization: A
KMeans
object is created.n_clusters
specifies the desired number of groups.random_state
ensures reproducibility.n_init=10
runs the algorithm multiple times with different starting centroids and chooses the best result (suppresses a future warning). - Fitting:
kmeans.fit(embedding_matrix)
performs the K-Means clustering algorithm on the document embeddings. It finds cluster centers and assigns each embedding vector to the nearest center. - Labels:
kmeans.labels_
contains an array where each element indicates the cluster ID (0, 1, 2, etc.) assigned to the corresponding document embedding.
- Choosing
- Displaying Results:
- A dictionary (
clustered_documents
) is created to organize the results, with keys representing cluster IDs. - The script iterates through the
cluster_labels
assigned by K-Means. For each document's indexi
, it finds its assignedlabel
and appends the corresponding text fromvalid_documents[i]
to the list for that cluster ID in the dictionary. - Finally, it loops through the
clustered_documents
dictionary and prints the text of the documents belonging to each cluster, clearly grouping them by the topic cluster identified by the algorithm.
- A dictionary (
This example demonstrates the power of embeddings for unsupervised topic discovery. By converting text to vectors, we can use mathematical algorithms like K-Means to group semantically similar documents without needing pre-defined labels.
3.2.3 Recommendation Systems
Suggesting related items by understanding the deeper connections between different pieces of content. This powerful application of embeddings enables systems to provide personalized recommendations by analyzing the semantic relationships between items. The embedding vectors capture subtle patterns and similarities that might not be immediately obvious to human observers.
Here's how recommendation systems leverage embeddings:
- Content-Based Filtering
- Systems analyze the actual content characteristics (like text descriptions, features, or attributes)
- Each item is converted into an embedding vector that represents its key features
- Similar items are found by measuring the distance between these vectors
- Collaborative Filtering Enhancement
- User behaviors and preferences are also converted into embeddings
- The system can identify patterns in user-item interactions
- This helps predict which items a user might like based on similar users' preferences
For example, a video streaming service can recommend shows not just based on genre tags, but by understanding thematic elements, storytelling styles, and complex narrative patterns. The embedding vectors can capture nuanced features like:
- Pacing and plot complexity
- Character development styles
- Emotional tone and atmosphere
- Visual and directorial techniques
Similarly, e-commerce platforms can suggest products by understanding the contextual similarities in product descriptions, user behavior, and item characteristics. This includes analyzing:
- Product descriptions and features
- User browsing and purchase patterns
- Price points and quality levels
- Brand relationships and market positioning
This semantic understanding leads to more accurate and relevant recommendations compared to traditional methods that rely solely on explicit categories or user ratings. The system can identify subtle connections and patterns that might be missed by conventional recommendation approaches, resulting in more engaging and personalized user experiences.
Example:
The following code example demonstrates how OpenAI embeddings can be used to build a simple content-based recommendation system.
This script will:
- Define a small catalog of items (e.g., movie descriptions).
- Generate embeddings for these items.
- Choose a target item.
- Find other items in the catalog that are semantically similar to the target item based on their embeddings.
- Present the most similar items as recommendations.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-03-24 15:29:00 CDT"
current_location = "Austin, Texas, United States"
print(f"Running Recommendation System example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
# Truncate text for printing if it's too long
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
# print("Embedding generation successful.") # Reduce verbosity
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0 # Return 0 if any vector is missing
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return similarity
# --- Recommendation System Implementation ---
# 1. Define your item catalog (e.g., movie descriptions)
# In a real application, this would come from a database.
item_catalog = [
{"id": "mov001", "title": "Space Odyssey: The Final Frontier", "description": "A visually stunning sci-fi epic exploring humanity's place in the universe, featuring complex themes and groundbreaking special effects."},
{"id": "mov002", "title": "Galactic Wars: Attack of the Clones", "description": "An action-packed space opera with laser battles, alien creatures, and a classic good versus evil storyline."},
{"id": "com001", "title": "Laugh Riot", "description": "A slapstick comedy about mistaken identities and hilarious mishaps during a weekend getaway."},
{"id": "doc001", "title": "Wonders of the Deep", "description": "An awe-inspiring documentary showcasing the beauty and mystery of marine life in the world's oceans."},
{"id": "mov003", "title": "Cyber City 2077", "description": "A gritty cyberpunk thriller set in a dystopian future, exploring themes of technology, consciousness, and rebellion."},
{"id": "com002", "title": "The Office Party", "description": "A witty ensemble comedy centered around awkward interactions and office politics during an annual holiday celebration."},
{"id": "doc002", "title": "Cosmic Journeys", "description": "A documentary exploring the vastness of space, black holes, distant galaxies, and the search for extraterrestrial life."},
{"id": "mov004", "title": "Interstellar Echoes", "description": "A philosophical science fiction film about astronauts travelling through a wormhole in search of a new home for humanity."}
]
print(f"\nItem catalog contains {len(item_catalog)} items.")
# 2. Generate embeddings for all items in the catalog (pre-computation)
print("\nGenerating embeddings for the item catalog...")
item_embeddings_data = []
for item in item_catalog:
# Combine title and description for a richer embedding
text_to_embed = f"{item['title']}: {item['description']}"
embedding = get_embedding(client, text_to_embed)
if embedding:
# Store item ID and its embedding
item_embeddings_data.append({"id": item["id"], "embedding": embedding})
else:
print(f"Skipping item {item['id']} due to embedding error.")
if not item_embeddings_data:
print("\nError: No embeddings were generated. Cannot provide recommendations.")
exit()
print(f"\nSuccessfully generated embeddings for {len(item_embeddings_data)} items.")
# 3. Select a target item for which to find recommendations
target_item_id = "mov001" # Let's find movies similar to "Space Odyssey"
print(f"\nFinding recommendations similar to item ID: {target_item_id}")
# Find the embedding for the target item
target_embedding = None
for item_data in item_embeddings_data:
if item_data["id"] == target_item_id:
target_embedding = item_data["embedding"]
break
if target_embedding is None:
print(f"Error: Could not find the embedding for the target item ID '{target_item_id}'.")
exit()
# 4. Calculate similarity between the target item and all other items
recommendations = []
print("\nCalculating similarities...")
for item_data in item_embeddings_data:
# Don't compare the item with itself
if item_data["id"] == target_item_id:
continue
similarity = cosine_similarity(target_embedding, item_data["embedding"])
recommendations.append({"id": item_data["id"], "score": similarity})
# 5. Sort potential recommendations by similarity score
recommendations.sort(key=lambda x: x["score"], reverse=True)
# 6. Display top N recommendations
print("\n--- Top Recommendations ---")
# Find the original title/description for the target item for context
target_item_info = next((item for item in item_catalog if item["id"] == target_item_id), None)
if target_item_info:
print(f"Based on: \"{target_item_info['title']}\"\n")
if not recommendations:
print("No recommendations found (or error calculating similarities).")
else:
top_n = 3
print(f"Top {top_n} most similar items:")
for i, rec in enumerate(recommendations[:top_n]):
# Find the full item details from the original catalog
rec_details = next((item for item in item_catalog if item["id"] == rec["id"]), None)
if rec_details:
print(f"{i+1}. ID: {rec['id']}, Score: {rec['score']:.4f}")
print(f" Title: {rec_details['title']}")
print(f" Description: {rec_details['description'][:100]}...") # Truncate description
print("-" * 10)
else:
print(f"{i+1}. ID: {rec['id']}, Score: {rec['score']:.4f} (Details not found)")
print("-" * 10)
if len(recommendations) > top_n:
print(f"(Showing top {top_n} of {len(recommendations)} potential recommendations)")
Code Breakdown Explanation
This example demonstrates how to build a straightforward content-based recommendation system by combining OpenAI embeddings with cosine similarity calculations.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions previously defined.
- Includes standard imports (
- Item Catalog:
- A list of dictionaries (
item_catalog
) represents the items available for recommendation (e.g., movies). Each item has anid
,title
, anddescription
. In a real system, this would likely be loaded from a database.
- A list of dictionaries (
- Item Embedding Generation:
- The script iterates through each
item
in theitem_catalog
. - Content Combination: It combines the
title
anddescription
into a single string (text_to_embed
). This provides richer context to the embedding model than using just the title or description alone. - It calls
get_embedding
for this combined text. - It stores the
item['id']
and its correspondingembedding
vector together in theitem_embeddings_data
list. This pre-computation step is standard practice for recommendation systems.
- The script iterates through each
- Target Item Selection:
- A
target_item_id
variable is set to specify the item for which we want recommendations (e.g., find items similar tomov001
). - The script retrieves the pre-computed embedding vector for this
target_item_id
from theitem_embeddings_data
list.
- A
- Similarity Calculation:
- It iterates through all the items with embeddings in
item_embeddings_data
. - Exclusion: It explicitly skips the comparison if the current item's ID matches the
target_item_id
(an item shouldn't recommend itself). - For every other item, it calculates the
cosine_similarity
between thetarget_embedding
and the current item's embedding. - It stores the other item's
id
and its calculated similarityscore
in arecommendations
list.
- It iterates through all the items with embeddings in
- Ranking Recommendations:
- The
recommendations
list is sorted usingrecommendations.sort(...)
based on thescore
field in descending order, placing the most similar items at the beginning of the list.
- The
- Displaying Results:
- The script prints the title of the target item for context.
- It then iterates through the top N (e.g., 3) items in the sorted
recommendations
list. - For each recommended item ID, it looks up the full details (title, description) from the original
item_catalog
. - It prints the rank, ID, similarity score, title, and a truncated description for each recommended item.
This example effectively shows how embeddings capture semantic meaning, allowing the system to recommend items based on content similarity (e.g., recommending other philosophical sci-fi movies similar to "Space Odyssey") rather than just explicit tags or user history.
3.2.4 Context retrieval for AI assistants
Helping chatbots and AI systems find and use relevant information from large knowledge bases by converting both queries and stored knowledge into embeddings. This process involves several key steps:
First, the system converts all documents in its knowledge base into embedding vectors - numerical representations that capture the semantic meaning of the text. These embeddings are stored in a vector database for quick retrieval.
When an AI assistant receives a question, it converts that query into an embedding vector using the same process. This ensures that both the stored knowledge and the incoming questions are represented in the same mathematical space.
The system then performs a similarity search to find the most relevant information. This search compares the query embedding to all stored embeddings, typically using techniques like cosine similarity or nearest neighbor search. The beauty of this approach is that it can identify semantically similar content even when the exact wording differs significantly.
For example, a query about "laptop won't turn on" might match documentation about "computer power issues" because their embeddings capture the similar underlying meaning. This semantic matching is far more powerful than traditional keyword-based search.
Once relevant information is identified, it can be used to generate more accurate, informed responses. This is particularly powerful for domain-specific applications where the AI needs to access technical documentation, product information, or company policies. The system can handle complex queries by combining multiple pieces of relevant context, ensuring responses are both accurate and comprehensive.
Example:
Below is a code example that demonstrates how AI assistants can retrieve context using OpenAI embeddings, implementing the concepts discussed in section 3.2.4.
The script illustrates the essential process of searching a knowledge base to provide relevant context for an AI assistant's responses.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-02-10 15:35:00 CDT"
current_location = "Grapevine, Texas, United States"
print(f"Running Context Retrieval example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return similarity
# --- Context Retrieval Implementation ---
# 1. Define your Knowledge Base (list of text documents/chunks)
# This represents the information the AI assistant can draw upon.
knowledge_base = [
{"id": "doc001", "source": "troubleshooting_guide.txt", "content": "If your laptop fails to power on, first check the power adapter connection. Ensure the cable is securely plugged into both the laptop and the wall outlet. Try a different outlet if possible."},
{"id": "doc002", "source": "troubleshooting_guide.txt", "content": "A blinking power light often indicates a battery issue or a charging problem. Try removing the battery (if removable) and powering on with only the adapter connected."},
{"id": "doc003", "source": "faq.html", "content": "To reset your password, go to the login page and click the 'Forgot Password' link. Follow the instructions sent to your registered email address."},
{"id": "doc004", "source": "product_manual.pdf", "content": "The Model X laptop uses a USB-C port for charging. Ensure you are using the correct wattage power adapter (65W minimum recommended)."},
{"id": "doc005", "source": "troubleshooting_guide.txt", "content": "No display output? Check if the laptop is making any sounds (fan spinning, beeps). Try connecting an external monitor to rule out a screen issue."},
{"id": "doc006", "source": "support_articles/power_issues.md", "content": "Holding the power button down for 15-30 seconds can perform a hard reset, sometimes resolving power-on failures."},
{"id": "doc007", "source": "faq.html", "content": "Software updates can be found in the 'System Settings' under the 'Updates' section. Ensure you are connected to the internet."}
]
print(f"\nKnowledge base contains {len(knowledge_base)} documents/chunks.")
# 2. Generate embeddings for the knowledge base (pre-computation)
print("\nGenerating embeddings for the knowledge base...")
kb_embeddings_data = []
for doc in knowledge_base:
embedding = get_embedding(client, doc["content"])
if embedding:
# Store document ID and its embedding
kb_embeddings_data.append({"id": doc["id"], "embedding": embedding})
else:
print(f"Skipping document {doc['id']} due to embedding error.")
if not kb_embeddings_data:
print("\nError: No embeddings were generated for the knowledge base. Cannot retrieve context.")
exit()
print(f"\nSuccessfully generated embeddings for {len(kb_embeddings_data)} knowledge base documents.")
# 3. Define the user's query to the AI assistant
user_query = "My computer won't start up."
# user_query = "How do I update the system software?"
# user_query = "Screen is black when I press the power button."
print(f"\nUser Query: \"{user_query}\"")
# 4. Generate embedding for the user query
print("\nGenerating embedding for the user query...")
query_embedding = get_embedding(client, user_query)
# 5. Find relevant documents from the knowledge base using similarity search
retrieved_context = []
if query_embedding and kb_embeddings_data:
print("\nCalculating similarities to find relevant context...")
for doc_data in kb_embeddings_data:
similarity = cosine_similarity(query_embedding, doc_data["embedding"])
retrieved_context.append({"id": doc_data["id"], "score": similarity})
# Sort context documents by similarity score in descending order
retrieved_context.sort(key=lambda x: x["score"], reverse=True)
# 6. Select Top N relevant documents to use as context
top_n_context = 3
print(f"\n--- Top {top_n_context} Relevant Context Documents Found ---")
if not retrieved_context:
print("No relevant context found (or error calculating similarities).")
else:
final_context_docs = []
for i, context_item in enumerate(retrieved_context[:top_n_context]):
# Find the full document details from the original knowledge base
doc_details = next((doc for doc in knowledge_base if doc["id"] == context_item["id"]), None)
if doc_details:
print(f"{i+1}. ID: {context_item['id']}, Score: {context_item['score']:.4f}")
print(f" Source: {doc_details['source']}")
print(f" Content: {doc_details['content'][:150]}...") # Truncate content
print("-" * 10)
final_context_docs.append(doc_details['content']) # Store content for next step
else:
print(f"{i+1}. ID: {context_item['id']}, Score: {context_item['score']:.4f} (Details not found)")
print("-" * 10)
if len(retrieved_context) > top_n_context:
print(f"(Showing top {top_n_context} of {len(retrieved_context)} potential context documents)")
# --- Next Step (Conceptual - Not coded here) ---
print("\n--- Next Step: Generating AI Assistant Response ---")
print("The content from the relevant documents above would now be combined")
print("with the original user query and sent to a model like GPT-4o")
print("as context to generate an informed and accurate response.")
print("Example prompt structure for GPT-4o:")
print("```")
print(f"System: You are a helpful AI assistant. Answer the user's question based ONLY on the provided context documents.")
print(f"User: Context Documents:\n1. {final_context_docs[0][:50]}...\n2. {final_context_docs[1][:50]}...\n[...]\n\nQuestion: {user_query}\n\nAnswer:")
print("```")
else:
print("\nCould not retrieve context.")
if not query_embedding:
print("Reason: Failed to generate embedding for the user query.")
if not kb_embeddings_data:
print("Reason: No knowledge base embeddings were successfully generated.")
Code Breakdown Explanation
This example demonstrates the core mechanism behind context retrieval for AI assistants using embeddings – finding relevant information from a knowledge base to answer a user's query.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions.
- Includes standard imports (
- Knowledge Base Definition:
- A list of dictionaries (
knowledge_base
) simulates the information store the AI assistant can access. Each dictionary represents a document or chunk of information and includes anid
,source
(optional metadata), and the actual textcontent
.
- A list of dictionaries (
- Knowledge Base Embedding Generation:
- The script iterates through each
doc
in theknowledge_base
. - It calls
get_embedding
on thedoc["content"]
to get its vector representation. - It stores the
doc['id']
and its correspondingembedding
vector together inkb_embeddings_data
. This is the crucial pre-computation step – embeddings for the knowledge base are typically generated offline and stored (often in a specialized vector database) for fast retrieval.
- The script iterates through each
- User Query:
- A sample
user_query
string represents the question asked to the AI assistant.
- A sample
- Query Embedding Generation:
- The
get_embedding
function is called for theuser_query
to get its vector representation in the same embedding space as the knowledge base documents.
- The
- Similarity Search (Context Retrieval):
- It iterates through all the pre-computed embeddings in
kb_embeddings_data
. - For each knowledge base document, it calculates the
cosine_similarity
between thequery_embedding
and the document's embedding. - It stores the document's
id
and its similarityscore
relative to the query in aretrieved_context
list.
- It iterates through all the pre-computed embeddings in
- Ranking and Selection:
- The
retrieved_context
list is sorted byscore
in descending order, bringing the most semantically relevant documents to the top. - The script selects the top N (e.g., 3) documents from this sorted list. These documents represent the most relevant context found in the knowledge base for the user's query.
- The
- Displaying Retrieved Context:
- The script prints the details (ID, score, source, content preview) of the top N context documents found.
- Conceptual Next Step (Crucial Explanation):
- The final print statements explain the purpose of this retrieval process. The content of these
final_context_docs
would not be the final answer. Instead, they would be combined with the originaluser_query
and passed as context to a large language model like GPT-4o in a subsequent API call. - An example prompt structure is shown, illustrating how the retrieved context grounds the AI assistant, enabling it to generate an informed response based on the relevant information found in the knowledge base, rather than relying solely on its general knowledge.
- The final print statements explain the purpose of this retrieval process. The content of these
This example effectively demonstrates the retrieval part of Retrieval-Augmented Generation (RAG), showing how embeddings bridge the gap between a user's query and relevant information stored in a knowledge base, enabling more accurate and context-aware AI assistants.
3.2.5 Anomaly and similarity detection
Identifying unusual patterns or finding similar items in large datasets by comparing their semantic representations is a fundamental application of embedding technology. This powerful technique transforms raw data into mathematical vectors that capture the essence of their content, enabling sophisticated analysis at scale. Here's how these systems work and their key applications:
- Detect Anomalies
- Flag unusual transactions or behaviors that deviate from normal patterns - For example, detecting suspicious credit card purchases by comparing them against typical spending patterns
- Identify potential security threats or fraud attempts - Such as recognizing unusual login patterns or detecting fake accounts based on behavior analysis
- Spot data quality issues or outliers in datasets - Including identifying incorrect data entries or unusual measurements that might indicate equipment malfunction
- Find Similarities
- Group related documents, images, or data points based on semantic meaning - This allows systems to cluster similar content even when the exact wording differs, making it easier to organize large collections of information
- Match similar customer inquiries or support tickets - Helping customer service teams identify common issues and standardize responses to frequent problems
- Identify duplicate or near-duplicate content - Useful for content management systems to maintain data quality and reduce redundancy
By converting data points into embedding vectors, systems can measure how "different" or "similar" items are to each other using mathematical distance calculations. This process works by mapping each item to a point in a high-dimensional space, where similar items are positioned closer together and dissimilar items are farther apart. This mathematical representation makes it possible to automatically flag unusual patterns or group related items together at scale, enabling both anomaly detection and similarity matching in ways that would be impossible with traditional rule-based systems.
Example:
The following code example demonstrates similarity and anomaly detection using OpenAI embeddings.
This script will:
- Define a dataset of text items (e.g., descriptions of transactions or events).
- Generate embeddings for these items.
- Similarity Detection: Find items most similar to a given target item.
- Anomaly Detection: Identify items that are least similar (most anomalous) compared to the rest of the dataset using a simple average similarity approach.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-01-13 15:40:00 CDT"
current_location = "Houston, Texas, United States"
print(f"Running Similarity & Anomaly Detection example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
# Clamp the value to handle potential floating point inaccuracies slightly outside [-1, 1]
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return np.clip(similarity, -1.0, 1.0)
# --- Similarity and Anomaly Detection Implementation ---
# 1. Define your dataset (e.g., transaction descriptions, log entries)
# Includes mostly normal items and a couple of potentially anomalous ones.
dataset = [
{"id": "txn001", "description": "Grocery purchase at Local Supermarket"},
{"id": "txn002", "description": "Monthly subscription fee for streaming service"},
{"id": "txn003", "description": "Dinner payment at Italian Restaurant"},
{"id": "txn004", "description": "Online order for electronics from TechStore"},
{"id": "txn005", "description": "Fuel purchase at Gas Station"},
{"id": "txn006", "description": "Purchase of fresh produce and bread"}, # Similar to txn001
{"id": "txn007", "description": "Payment for movie streaming subscription"}, # Similar to txn002
{"id": "txn008", "description": "Unusual large wire transfer to overseas account"}, # Potential Anomaly 1
{"id": "txn009", "description": "Purchase of rare antique collectible vase"}, # Potential Anomaly 2
{"id": "txn010", "description": "Coffee purchase at Cafe Central"}
]
print(f"\nDataset contains {len(dataset)} items.")
# 2. Generate embeddings for all items in the dataset (pre-computation)
print("\nGenerating embeddings for the dataset...")
dataset_embeddings_data = []
for item in dataset:
embedding = get_embedding(client, item["description"])
if embedding:
# Store item ID, description, and its embedding
dataset_embeddings_data.append({
"id": item["id"],
"description": item["description"],
"embedding": embedding
})
else:
print(f"Skipping item {item['id']} due to embedding error.")
if not dataset_embeddings_data:
print("\nError: No embeddings were generated. Cannot perform analysis.")
exit()
print(f"\nSuccessfully generated embeddings for {len(dataset_embeddings_data)} items.")
# --- Part A: Similarity Detection ---
print("\n--- Part A: Similarity Detection ---")
# Select a target item to find similar items for
target_item_id_similarity = "txn001" # Find items similar to "Grocery purchase..."
print(f"Finding items similar to item ID: {target_item_id_similarity}")
# Find the target item's data
target_item_data = next((item for item in dataset_embeddings_data if item["id"] == target_item_id_similarity), None)
if target_item_data:
target_embedding = target_item_data["embedding"]
similar_items = []
# Calculate similarity between the target and all other items
for item_data in dataset_embeddings_data:
if item_data["id"] == target_item_id_similarity:
continue # Skip self-comparison
similarity = cosine_similarity(target_embedding, item_data["embedding"])
similar_items.append({
"id": item_data["id"],
"description": item_data["description"],
"score": similarity
})
# Sort by similarity score
similar_items.sort(key=lambda x: x["score"], reverse=True)
# Display top N similar items
print(f"\nItems most similar to: \"{target_item_data['description']}\"")
top_n_similar = 2
for i, item in enumerate(similar_items[:top_n_similar]):
print(f"{i+1}. ID: {item['id']}, Score: {item['score']:.4f}")
print(f" Description: {item['description']}")
print("-" * 10)
else:
print(f"Error: Could not find data for target item ID '{target_item_id_similarity}'.")
# --- Part B: Anomaly Detection (Simple Approach) ---
print("\n--- Part B: Anomaly Detection (Low Average Similarity) ---")
# Calculate the average similarity of each item to all other items
item_avg_similarities = []
num_items = len(dataset_embeddings_data)
if num_items < 2:
print("Need at least 2 items with embeddings to calculate average similarities.")
else:
print("\nCalculating average similarities for anomaly detection...")
for i in range(num_items):
current_item = dataset_embeddings_data[i]
total_similarity = 0
# Compare current item to all others
for j in range(num_items):
if i == j: # Don't compare item to itself
continue
other_item = dataset_embeddings_data[j]
similarity = cosine_similarity(current_item["embedding"], other_item["embedding"])
total_similarity += similarity
# Calculate average similarity (avoid division by zero if only 1 item)
average_similarity = total_similarity / (num_items - 1) if num_items > 1 else 0
item_avg_similarities.append({
"id": current_item["id"],
"description": current_item["description"],
"avg_score": average_similarity
})
print(f"Item ID {current_item['id']} - Avg Similarity: {average_similarity:.4f}")
# Sort items by average similarity in ascending order (lowest first = most anomalous)
item_avg_similarities.sort(key=lambda x: x["avg_score"])
# Display top N potential anomalies (items least similar to others)
print("\nPotential Anomalies (Lowest Average Similarity):")
top_n_anomalies = 3
for i, item in enumerate(item_avg_similarities[:top_n_anomalies]):
print(f"{i+1}. ID: {item['id']}, Avg Score: {item['avg_score']:.4f}")
print(f" Description: {item['description']}")
print("-" * 10)
print("\nNote: Low average similarity suggests an item is semantically")
print("different from the majority of other items in this dataset.")
Code Breakdown Explanation
This example demonstrates using OpenAI embeddings for both finding similar items and detecting potential anomalies within a dataset based on semantic meaning.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions. Thecosine_similarity
function now includesnp.clip
to ensure the output is strictly within [-1, 1].
- Includes standard imports (
- Dataset Definition:
- A list of dictionaries (
dataset
) simulates the data to be analyzed (e.g., transaction descriptions). Each item has anid
and a textdescription
. The sample data includes mostly common items and a few conceptually different ones intended as potential anomalies.
- A list of dictionaries (
- Dataset Embedding Generation:
- The script iterates through each
item
in thedataset
. - It calls
get_embedding
on theitem["description"]
. - It stores the
item['id']
,item['description']
, and its correspondingembedding
vector together indataset_embeddings_data
. This pre-computation is essential.
- The script iterates through each
- Part A: Similarity Detection:
- Target Selection: An item ID (
target_item_id_similarity
) is chosen to find similar items for. - Target Embedding Retrieval: The script finds the pre-computed embedding for the target item.
- Comparison: It iterates through all other items in
dataset_embeddings_data
, calculates thecosine_similarity
between the target item's embedding and each other item's embedding. - Ranking: The results (other item ID, description, similarity score) are stored and then sorted by score in descending order.
- Display: The top N most similar items are printed.
- Target Selection: An item ID (
- Part B: Anomaly Detection (Simple Average Similarity Approach):
- Concept: This simple method identifies anomalies as items that have the lowest average semantic similarity to all other items in the dataset. An item that is very different conceptually from the rest will likely have low similarity scores when compared to most others.
- Calculation:
- The script iterates through each item (
current_item
) indataset_embeddings_data
. - For each
current_item
, it iterates through all other items in the dataset. - It calculates the
cosine_similarity
between thecurrent_item
and everyother_item
. - It sums these similarities and calculates the average similarity for the
current_item
.
- The script iterates through each item (
- Ranking: The items are stored along with their calculated
avg_score
and then sorted by this score in ascending order (lowest average similarity first). - Display: The top N items with the lowest average similarity scores are printed as potential anomalies. A note explains the interpretation.
This example showcases two powerful applications: finding related content (similarity) and identifying outliers (anomaly detection) by leveraging the semantic understanding captured within OpenAI embeddings.
3.2.6 Clustering & Tagging
Automatically organize and label content based on semantic similarity - a powerful technique that uses embedding vectors to understand the true meaning and relationships between different pieces of content. This approach goes far beyond traditional keyword matching, allowing for much more nuanced and accurate content organization.
When content is clustered, similar items naturally group together based on their semantic meaning, even if they use different terminology to express the same concepts. For example, documents about "automotive maintenance" and "car repair" would cluster together despite using different words.
This intelligent organization helps create intuitive navigation systems, improves content discovery, and makes large document collections more manageable by grouping related items together. Some key benefits include:
- Automatic tag generation based on cluster themes
- Dynamic organization that adapts as new content is added
- Improved search relevance through semantic understanding
- Better content discovery through related-item suggestions
The clustering process can be fine-tuned to create either broad categories or more granular subcategories, depending on the specific needs of your content organization system. This flexibility makes it a valuable tool for managing everything from digital libraries to enterprise knowledge bases.
Example:
Let's examine a code example that demonstrates clustering and tagging using OpenAI embeddings and GPT-4o.
This script will:
- Define a collection of documents.
- Generate embeddings for the documents.
- Cluster the documents using K-Means based on their embeddings.
- For each cluster, use GPT-4o to analyze the documents within it and generate a descriptive tag or label.
- Display the documents grouped by cluster along with their AI-generated tags.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np
from sklearn.cluster import KMeans # For clustering algorithm
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2024-12-31 15:48:00 CDT"
current_location = "San Antonio, Texas, United States"
print(f"Running Clustering & Tagging example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function to Generate Cluster Tag using GPT-4o ---
def generate_cluster_tag(client, documents_in_cluster):
"""Uses GPT-4o to suggest a tag/label for a cluster of documents."""
if not documents_in_cluster:
return "Empty Cluster"
# Combine content for context, limiting total length if necessary
# Using first few hundred chars of each doc might be enough
max_context_length = 3000 # Limit context to avoid excessive token usage
context = ""
for i, doc in enumerate(documents_in_cluster):
doc_preview = f"Document {i+1}: {doc[:300]}...\n"
if len(context) + len(doc_preview) > max_context_length:
break
context += doc_preview
if not context:
return "Error: Could not create context"
system_prompt = "You are an expert at identifying themes and creating concise labels."
user_prompt = f"""Based on the following document excerpts from a single cluster, suggest a short, descriptive tag or label (2-5 words) that captures the main theme or topic of this group.
Document Excerpts:
---
{context.strip()}
---
Suggested Tag/Label:
"""
print(f"\nGenerating tag for cluster with {len(documents_in_cluster)} documents...")
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=20, # Short response expected
temperature=0.3 # More deterministic label
)
tag = response.choices[0].message.content.strip().replace('"', '') # Clean up quotes
print(f"Generated tag: '{tag}'")
return tag
except OpenAIError as e:
print(f"OpenAI API Error generating tag: {e}")
return "Tagging Error"
except Exception as e:
print(f"An unexpected error occurred during tag generation: {e}")
return "Tagging Error"
# --- Clustering and Tagging Implementation ---
# 1. Define your collection of documents
# Covers topics: Space Exploration, Cooking/Food, Web Development
documents = [
"NASA launches new probe to study Jupiter's moons.",
"Recipe for authentic Italian pasta carbonara.",
"JavaScript frameworks like React and Vue dominate front-end development.",
"The James Webb Space Telescope captures stunning images of distant galaxies.",
"Tips for baking the perfect sourdough bread at home.",
"Understanding asynchronous programming in Node.js.",
"SpaceX successfully lands its reusable rocket booster after launch.",
"Exploring the different types of olive oil and their uses in cooking.",
"CSS Grid vs Flexbox: Choosing the right layout module.",
"The search for habitable exoplanets continues with new telescope data.",
"How to make delicious homemade pizza from scratch.",
"Building RESTful APIs using Express.js and MongoDB."
]
print(f"\nDocument collection contains {len(documents)} documents.")
# 2. Generate embeddings for all documents
print("\nGenerating embeddings for the document collection...")
embeddings = []
valid_documents = [] # Keep track of documents for which embedding was successful
for doc in documents:
embedding = get_embedding(client, doc)
if embedding:
embeddings.append(embedding)
valid_documents.append(doc) # Add corresponding document text
else:
print(f"Skipping document due to embedding error: \"{doc[:70]}...\"")
if not embeddings:
print("\nError: No embeddings were generated. Cannot perform clustering.")
exit()
print(f"\nSuccessfully generated embeddings for {len(valid_documents)} documents.")
# Convert embeddings list to a NumPy array for scikit-learn
embedding_matrix = np.array(embeddings)
# 3. Apply Clustering Algorithm (K-Means)
# Choose the number of clusters (k). We expect 3 topics here.
n_clusters = 3
print(f"\nApplying K-Means clustering with k={n_clusters}...")
try:
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
kmeans.fit(embedding_matrix)
cluster_labels = kmeans.labels_
print("Clustering complete.")
except Exception as e:
print(f"An error occurred during clustering: {e}")
exit()
# 4. Group Documents by Cluster
print("\nGrouping documents by cluster...")
clustered_documents = {i: [] for i in range(n_clusters)}
for i, label in enumerate(cluster_labels):
clustered_documents[label].append(valid_documents[i])
# 5. Generate Tags for Each Cluster using GPT-4o
print("\nGenerating tags for each cluster...")
cluster_tags = {}
for cluster_id, docs_in_cluster in clustered_documents.items():
tag = generate_cluster_tag(client, docs_in_cluster)
cluster_tags[cluster_id] = tag
# 6. Display Documents by Cluster with Generated Tags
print(f"\n--- Documents Grouped by Cluster and Tag (k={n_clusters}) ---")
for cluster_id, docs_in_cluster in clustered_documents.items():
generated_tag = cluster_tags.get(cluster_id, "Unknown Tag")
print(f"\nCluster {cluster_id + 1} - Suggested Tag: '{generated_tag}'")
print("-" * (28 + len(generated_tag))) # Adjust underline length
if not docs_in_cluster:
print(" (No documents in this cluster)")
else:
for doc_text in docs_in_cluster:
print(f" - {doc_text}") # Print full document text here
print("\nClustering and Tagging process complete.")
Code Breakdown Explanation
This script demonstrates how to automatically group similar documents by their semantic meaning using embeddings, then uses GPT-4o to generate descriptive tags for each group.
- Setup & Helpers:
- Includes standard imports plus
KMeans
fromsklearn.cluster
. - Initializes the OpenAI client.
- Includes the
get_embedding
helper function.
- Includes standard imports plus
- New Helper Function:
generate_cluster_tag
:- Purpose: Takes a list of documents belonging to a single cluster and uses GPT-4o to suggest a concise tag summarizing their common theme.
- Input: The
client
object anddocuments_in_cluster
(a list of text strings). - Context Creation: It concatenates parts of the documents (e.g., first 300 characters) to create a context string for GPT-4o, respecting a maximum length to manage token usage.
- Prompt Engineering: It constructs a prompt asking GPT-4o to act as an expert theme identifier and suggest a short tag (2-5 words) based on the provided document excerpts.
- API Call: Uses
client.chat.completions.create
withmodel="gpt-4o"
and the specialized prompt. A low temperature is used for more focused tag generation. - Output: Returns the cleaned-up tag suggested by GPT-4o, or an error message.
- Document Collection: A list named
documents
holds sample text content covering a few distinct topics (Space, Cooking, Web Development). - Embedding Generation:
- The script iterates through the
documents
, generates an embedding for each usingget_embedding
, and stores successful embeddings and corresponding text inembeddings
andvalid_documents
. - The embeddings are converted to a NumPy array (
embedding_matrix
).
- The script iterates through the
- Clustering (K-Means):
- The number of clusters (
n_clusters
) is set (e.g.,k=3
). KMeans
fromscikit-learn
is initialized and fitted to theembedding_matrix
.kmeans.labels_
provides the cluster assignment for each document.
- The number of clusters (
- Grouping Documents:
- A dictionary (
clustered_documents
) is created to store the text of documents belonging to each cluster ID.
- A dictionary (
- Generating Cluster Tags:
- The script iterates through the
clustered_documents
dictionary. - For each
cluster_id
and its list ofdocs_in_cluster
, it calls thegenerate_cluster_tag
helper function. - The suggested tag for each cluster is stored in the
cluster_tags
dictionary.
- The script iterates through the
- Displaying Results:
- The script iterates through the clusters again.
- For each cluster, it retrieves the generated tag from
cluster_tags
. - It prints the cluster number, the suggested tag, and then lists the full text of all documents belonging to that cluster.
This example showcases a powerful workflow: using embeddings for unsupervised grouping of content based on meaning (clustering) and then leveraging an LLM like GPT-4o to interpret those groupings and assign meaningful labels (tagging), automating content organization.
3.2.7 Content Recommendations
Content recommendation systems powered by embeddings represent a significant advancement in personalization technology. By analyzing semantic relationships, these systems can understand the nuanced meaning and context of content in ways that traditional keyword-based systems cannot.
Here's a detailed look at how embedding-based recommendations work:
- Content Analysis:
- The system generates sophisticated embedding vectors for each piece of content in the database
- These vectors capture nuanced characteristics like writing style, topic depth, and emotional tone
- Advanced algorithms analyze patterns across multiple dimensions of content features
- User Preference Modeling:
- The system tracks detailed interaction patterns including time spent, engagement level, and sharing behavior
- Historical preferences are weighted and combined to create multi-dimensional user profiles
- Both explicit feedback (ratings, likes) and implicit signals (scroll depth, repeat visits) are considered
- Contextual Understanding:
- Real-time factors like device type and location are incorporated into the recommendation algorithm
- The system identifies patterns in content consumption based on time of day and day of week
- Current session behavior is analyzed to understand immediate user interests
- Dynamic Adaptation:
- Machine learning models continuously refine user profiles based on new interactions
- The system learns from both positive and negative feedback to improve accuracy
- Recommendation strategies are automatically adjusted based on performance metrics
This sophisticated approach enables recommendation engines to deliver highly personalized experiences through several key capabilities:
- Identify content similarities that might not be apparent through traditional metadata
- Can detect thematic connections between items even when they use different terminology
- Recognizes similar writing styles, tone, and complexity levels across content
- Understand the progression of user interests over time
- Tracks how preferences evolve from basic to advanced topics
- Identifies shifts in user interests across different categories
- Make cross-domain recommendations (e.g., suggesting articles based on watched videos)
- Connects content across different media types based on semantic relationships
- Leverages learning from one domain to enhance recommendations in another
- Account for seasonal trends and temporal relevance
- Adjusts recommendations based on time-sensitive factors like holidays or events
- Considers current trends and their impact on user interests
The result is a highly personalized experience that can suggest truly relevant videos, articles, or products that match users' interests, both current and evolving. This goes far beyond simple "users who liked X also liked Y" algorithms, creating a more engaging and valuable user experience.
Example:
Here's a code example that demonstrates the core concept of content recommendations using embeddings.
This script focuses on finding semantically similar content items based on their embeddings, which is the foundation for the more advanced recommendation features you described.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2024-11-30 15:52:00 CDT"
current_location = "Orlando, Florida, United States"
print(f"Running Content Recommendation example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return np.clip(similarity, -1.0, 1.0) # Ensure value is within valid range
# --- Content Recommendation Implementation ---
# 1. Define your Content Catalog (e.g., articles, blog posts)
# In a real application, this would come from a database or CMS.
content_catalog = [
{"id": "art001", "title": "Introduction to Quantum Computing", "content": "Exploring the basics of qubits, superposition, and entanglement in quantum mechanics and their potential for computation."},
{"id": "art002", "title": "Healthy Mediterranean Diet Recipes", "content": "Delicious and easy recipes focusing on fresh vegetables, olive oil, fish, and whole grains for a heart-healthy lifestyle."},
{"id": "art003", "title": "The Future of Artificial Intelligence in Healthcare", "content": "How AI and machine learning are transforming diagnostics, drug discovery, and personalized medicine."},
{"id": "art004", "title": "Beginner's Guide to Python Programming", "content": "Learn the fundamentals of Python syntax, data types, control flow, and functions to start coding."},
{"id": "art005", "title": "Understanding Neural Networks and Deep Learning", "content": "An overview of artificial neural networks, backpropagation, and the concepts behind deep learning models."},
{"id": "art006", "title": "Travel Guide: Hiking the Swiss Alps", "content": "Tips for planning your trip, recommended trails, essential gear, and stunning viewpoints in the Swiss Alps."},
{"id": "art007", "title": "Mastering the Art of French Pastry", "content": "Techniques for creating classic French desserts like croissants, macarons, and éclairs."},
{"id": "art008", "title": "Ethical Considerations in AI Development", "content": "Discussing bias, fairness, transparency, and accountability in the development and deployment of artificial intelligence systems."}
]
print(f"\nContent catalog contains {len(content_catalog)} items.")
# 2. Generate embeddings for all content items (pre-computation)
print("\nGenerating embeddings for the content catalog...")
content_embeddings_data = []
for item in content_catalog:
# Use title and content for embedding
text_to_embed = f"Title: {item['title']}\nContent: {item['content']}"
embedding = get_embedding(client, text_to_embed)
if embedding:
# Store item ID and its embedding
content_embeddings_data.append({"id": item["id"], "embedding": embedding})
else:
print(f"Skipping item {item['id']} due to embedding error.")
if not content_embeddings_data:
print("\nError: No embeddings were generated. Cannot provide recommendations.")
exit()
print(f"\nSuccessfully generated embeddings for {len(content_embeddings_data)} content items.")
# 3. Select a target item (e.g., an article the user just read)
target_item_id = "art003" # User read "The Future of Artificial Intelligence in Healthcare"
print(f"\nFinding content similar to item ID: {target_item_id}")
# Find the embedding for the target item
target_embedding = None
for item_data in content_embeddings_data:
if item_data["id"] == target_item_id:
target_embedding = item_data["embedding"]
break
if target_embedding is None:
print(f"Error: Could not find the embedding for the target item ID '{target_item_id}'.")
exit()
# 4. Calculate similarity between the target item and all other items
recommendations = []
print("\nCalculating similarities...")
for item_data in content_embeddings_data:
# Don't recommend the item itself
if item_data["id"] == target_item_id:
continue
similarity = cosine_similarity(target_embedding, item_data["embedding"])
recommendations.append({"id": item_data["id"], "score": similarity})
# 5. Sort potential recommendations by similarity score
recommendations.sort(key=lambda x: x["score"], reverse=True)
# 6. Display top N recommendations
print("\n--- Top Content Recommendations ---")
# Find the original title for the target item for context
target_item_info = next((item for item in content_catalog if item["id"] == target_item_id), None)
if target_item_info:
print(f"Because you read: \"{target_item_info['title']}\"\n")
if not recommendations:
print("No recommendations found (or error calculating similarities).")
else:
top_n = 3
print(f"Top {top_n} recommended items:")
for i, rec in enumerate(recommendations[:top_n]):
# Find the full item details from the original catalog
rec_details = next((item for item in content_catalog if item["id"] == rec["id"]), None)
if rec_details:
print(f"{i+1}. ID: {rec['id']}, Similarity Score: {rec['score']:.4f}")
print(f" Title: {rec_details['title']}")
print(f" Content Snippet: {rec_details['content'][:100]}...") # Truncate content
print("-" * 10)
else:
print(f"{i+1}. ID: {rec['id']}, Score: {rec['score']:.4f} (Details not found)")
print("-" * 10)
if len(recommendations) > top_n:
print(f"(Showing top {top_n} of {len(recommendations)} potential recommendations)")
print("\nNote: This demonstrates basic content-to-content similarity.")
print("Advanced systems incorporate user profiles, interaction history, context, etc.")
Code Breakdown Explanation
This script demonstrates a fundamental approach to content recommendation using OpenAI embeddings, focusing on finding items semantically similar to a target item.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions.
- Includes standard imports (
- Content Catalog:
- A list of dictionaries (
content_catalog
) simulates the available content (e.g., articles). Each item has anid
,title
, andcontent
.
- A list of dictionaries (
- Content Embedding Generation (Pre-computation):
- The script iterates through each
item
in thecontent_catalog
. - Combined Text: It creates a combined text string from the item's
title
andcontent
to generate a richer embedding that captures more semantic detail. - It calls
get_embedding
for this combined text. - It stores the
item['id']
and itsembedding
vector incontent_embeddings_data
. This pre-computation is vital for efficiency.
- The script iterates through each
- Target Item Selection:
- A
target_item_id
is chosen (e.g.,art003
), simulating an item the user has interacted with (e.g., read). - The script retrieves the pre-computed embedding for this target item.
- A
- Similarity Calculation:
- It iterates through all other items in
content_embeddings_data
. - It calculates the
cosine_similarity
between thetarget_embedding
and each other item's embedding. - It stores the other item's
id
and its similarityscore
in therecommendations
list.
- It iterates through all other items in
- Ranking Recommendations:
- The
recommendations
list is sorted byscore
in descending order, placing the most semantically similar content items first.
- The
- Displaying Results:
- The script prints the title of the target item for context ("Because you read...").
- It displays the top N (e.g., 3) recommended items, showing their ID, similarity score, title, and a snippet of their content.
- Contextual Note: The final print statements explicitly mention that this example shows basic content-to-content similarity. Advanced recommendation systems, as described in the section text, would integrate user profiles (embeddings based on interaction history), real-time context (time, location), explicit feedback, and potentially more complex algorithms beyond simple cosine similarity. However, the core principle of using embeddings to measure semantic relatedness remains fundamental.
This example effectively illustrates how embeddings enable recommendations based on understanding the meaning of content, allowing suggestions that go beyond simple keyword or category matching.
3.2.8 Email Triage / Prioritization
Embedding technology enables sophisticated email analysis and categorization by understanding the semantic meaning of messages. This advanced system employs multiple layers of analysis to streamline email management:
- Urgency Detection
- Identify time-sensitive matters requiring immediate attention through natural language processing
- Recognize urgent language patterns and contextual cues by analyzing word choice, sentence structure, and historical patterns
- Flag critical emails based on sender importance, keywords, and organizational hierarchy
- Smart Categorization
- Group related email threads and conversations using semantic similarity matching
- Sort messages by project, department, or business function through content analysis
- Create dynamic folders based on emerging topics and trends
- Apply machine learning to improve categorization accuracy over time
- Intent Classification
- Distinguish between requests, updates, and FYI messages using advanced natural language understanding
- Prioritize action items and delegate tasks automatically based on content and context
- Identify follow-up requirements and set automated reminders
- Extract key deadlines and commitments from message content
By leveraging semantic understanding, the system creates an intelligent email processing pipeline that can handle hundreds of messages simultaneously. The embedding-based analysis examines not just keywords, but the actual meaning and context of each message, considering factors such as:
- Message context within ongoing conversations
- Historical patterns of communication
- Organizational relationships and hierarchies
- Project timelines and priorities
This comprehensive approach significantly reduces the cognitive load of email management by automatically handling routine classification and prioritization tasks. The system ensures that important messages receive immediate attention while maintaining an organized structure for all communications. As a result, professionals can focus on high-value activities instead of spending hours manually sorting through their inbox, leading to improved productivity and faster response times for critical communications.
Example:
This script simulates categorizing incoming emails based on their semantic similarity to predefined categories like "Urgent Request," "Project Update,"
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2024-10-31 15:54:00 CDT"
current_location = "Plano, Texas, United States"
print(f"Running Email Triage/Prioritization example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return np.clip(similarity, -1.0, 1.0) # Ensure value is within valid range
# --- Email Triage/Prioritization Implementation ---
# 1. Define Sample Emails (Subject + Snippet)
emails = [
{"id": "email01", "subject": "Urgent: Server Down!", "body_snippet": "The main production server seems to be unresponsive. We need immediate assistance to investigate and bring it back online."},
{"id": "email02", "subject": "Meeting Minutes - Project Phoenix Sync", "body_snippet": "Attached are the minutes from today's sync call. Key decisions included finalizing the Q3 roadmap. Action items assigned."},
{"id": "email03", "subject": "Quick Question about Report", "body_snippet": "Hi team, just had a quick question regarding the methodology used in the latest market analysis report. Can someone clarify?"},
{"id": "email04", "subject": "Fwd: Company Newsletter - April Edition", "body_snippet": "Sharing the latest company newsletter for your information."},
{"id": "email05", "subject": "Action Required: Submit Timesheet by EOD", "body_snippet": "Friendly reminder to please submit your weekly timesheet by the end of the day today. This is mandatory."},
{"id": "email06", "subject": "Update on Q2 Marketing Campaign", "body_snippet": "Just wanted to provide a brief update on the campaign performance metrics we discussed last week. See attached summary."},
{"id": "email07", "subject": "Can you approve this request ASAP?", "body_snippet": "Need your approval on the attached budget request urgently to proceed with the vendor contract."}
]
print(f"\nProcessing {len(emails)} emails.")
# 2. Define Categories/Priorities and their Semantic Representations
# We represent each category with a descriptive phrase.
categories = {
"Urgent Action Required": "Requires immediate attention, critical issue, deadline, ASAP request, mandatory task.",
"Project Update / Status": "Information about ongoing projects, progress reports, meeting minutes, status updates.",
"Question / Request for Info": "Asking for clarification, seeking information, query about details.",
"General Info / FYI": "Newsletter, announcement, sharing information, non-actionable update."
}
print(f"\nDefined categories: {list(categories.keys())}")
# 3. Generate embeddings for Categories (pre-computation recommended)
print("\nGenerating embeddings for categories...")
category_embeddings = {}
for category_name, category_description in categories.items():
embedding = get_embedding(client, category_description)
if embedding:
category_embeddings[category_name] = embedding
else:
print(f"Skipping category '{category_name}' due to embedding error.")
if not category_embeddings:
print("\nError: No embeddings generated for categories. Cannot triage emails.")
exit()
# 4. Process Each Email: Generate Embedding and Find Best Category
print("\nTriaging emails...")
email_results = []
for email in emails:
# Combine subject and body for better context
email_content = f"Subject: {email['subject']}\nBody: {email['body_snippet']}"
email_embedding = get_embedding(client, email_content)
if not email_embedding:
print(f"Skipping email {email['id']} due to embedding error.")
continue
# Find the category with the highest similarity
best_category = None
max_similarity = -1 # Cosine similarity ranges from -1 to 1
for category_name, category_embedding in category_embeddings.items():
similarity = cosine_similarity(email_embedding, category_embedding)
print(f" Email {email['id']} vs Category '{category_name}': Score {similarity:.4f}")
if similarity > max_similarity:
max_similarity = similarity
best_category = category_name
email_results.append({
"id": email["id"],
"subject": email["subject"],
"assigned_category": best_category,
"score": max_similarity
})
print(f"-> Email {email['id']} assigned to: '{best_category}' (Score: {max_similarity:.4f})")
# 5. Display Triage Results
print("\n--- Email Triage Results ---")
if not email_results:
print("No emails were successfully triaged.")
else:
# Optional: Group by category for display
results_by_category = {cat: [] for cat in categories.keys()}
for result in email_results:
if result["assigned_category"]: # Check if category was assigned
results_by_category[result["assigned_category"]].append(result)
for category_name, items in results_by_category.items():
print(f"\nCategory: {category_name}")
print("-" * (10 + len(category_name)))
if not items:
print(" (No emails assigned)")
else:
# Sort items within category by score if desired
items.sort(key=lambda x: x['score'], reverse=True)
for item in items:
print(f" - ID: {item['id']}, Subject: \"{item['subject']}\" (Score: {item['score']:.3f})")
print("\nEmail triage process complete.")
Code Breakdown Explanation
This example shows how OpenAI embeddings can automatically sort and prioritize emails by understanding their meaning, demonstrating an intelligent email management system.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions.
- Includes standard imports (
- Sample Email Data:
- A list of dictionaries (
emails
) simulates incoming messages. Each email has anid
,subject
, and abody_snippet
.
- A list of dictionaries (
- Category Definitions:
- A dictionary (
categories
) defines the target categories for triage (e.g., "Urgent Action Required", "Project Update / Status"). - Key Idea: Each category is represented by a descriptive phrase or list of keywords that captures its semantic essence. This description is what will be embedded.
- A dictionary (
- Category Embedding Generation:
- The script iterates through the defined
categories
. - It calls
get_embedding
on the description associated with each category name. - The resulting embedding vector for each category is stored in the
category_embeddings
dictionary. This step would typically be pre-computed and stored.
- The script iterates through the defined
- Email Processing Loop:
- The script iterates through each
email
in the sample data. - Content Combination: It combines the
subject
andbody_snippet
into a singleemail_content
string to provide richer context for the embedding. - Email Embedding: It calls
get_embedding
to get the vector representation of the current email's content. - Similarity Calculation:
- It then iterates through the pre-computed
category_embeddings
. - For each category, it calculates the
cosine_similarity
between theemail_embedding
and thecategory_embedding
. - It keeps track of the
best_category
(the one with the highest similarity score found so far) and the correspondingmax_similarity
score.
- It then iterates through the pre-computed
- Assignment: After comparing the email to all categories, the email is assigned the
best_category
found. The result (email ID, subject, assigned category, score) is stored.
- The script iterates through each
- Displaying Triage Results:
- The script prints the final assignments.
- Optional Grouping: It includes logic to group the results by the assigned category for a clearer presentation, showing which emails fell into the "Urgent," "Update," etc., buckets.
This example effectively demonstrates how embeddings allow for intelligent categorization based on meaning. An email asking for "approval ASAP" can be correctly identified as "Urgent Action Required" even without using the exact word "urgent," because its embedding will be semantically close to the embedding of the "Urgent Action Required" category description. This is far more robust than simple keyword filtering.
3.2 When to Use Embeddings
Embeddings have revolutionized how we process and understand textual information in modern AI applications. While traditional text processing methods rely on exact matches or basic keyword searching, embeddings provide a sophisticated way to capture the nuanced meanings and relationships between pieces of text. By converting words and phrases into high-dimensional numerical vectors, embeddings enable machines to understand semantic relationships and similarities in ways that more closely mirror human understanding.
Let's explore the key scenarios where embeddings prove particularly valuable, showcasing how this technology transforms various aspects of information processing and retrieval. Understanding these use cases is crucial for developers and organizations looking to leverage the full potential of embedding technology in their applications.
3.2.1 Semantic search
Finding relevant information based on meaning rather than just keywords, enabling more intelligent search results. Unlike traditional keyword-based search that matches exact words or phrases, semantic search understands the intent and contextual meaning of a query by analyzing the underlying relationships between words and concepts. This advanced approach allows the system to comprehend variations in language, context, and even user intent.
For example, a search for "natural language processing" would also return relevant results about "NLP," "computational linguistics," or "text analysis." When a user searches for "treating common cold symptoms," the system would understand and return results about "flu remedies," "reducing fever," and "cough medicine" - even if these exact phrases aren't used. This technology leverages embedding vectors to calculate similarity scores between queries and documents, transforming each piece of text into a high-dimensional numerical representation that captures its semantic meaning. This mathematical approach enables more nuanced and accurate search results that account for:
- Synonyms and related terms (like "car" and "automobile")
- Conceptual relationships (connecting "python" to both programming and snakes, depending on context)
- Multiple languages (finding relevant content even when written in different languages)
- Contextual variations (understanding that "apple" could refer to either the fruit or the technology company)
- Intent matching (recognizing that "how to fix a flat tire" and "tire repair instructions" are seeking the same information)
Example:
Here is a code example demonstrating semantic search using OpenAI embeddings, based on the content you provided.
This script will:
- Define a small set of documents.
- Generate embeddings for these documents and a search query.
- Calculate the similarity between the query and each document.
- Rank the documents by relevance based on semantic similarity.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-03-22 15:22:00 CDT"
current_location = "Grapevine, Texas, United States"
print(f"Running Semantic Search example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings example)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print(f"Generating embedding for: \"{text[:50]}...\"") # Print truncated text
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
print("Embedding generation successful.")
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{text[:50]}...': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{text[:50]}...': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings example)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
# print("Error: Cannot calculate similarity with None vectors.")
return 0.0 # Return 0 if any vector is missing
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
# print("Warning: One or both vectors have zero magnitude.")
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return similarity
# --- Semantic Search Implementation ---
# 1. Define your document store (a list of text strings)
# In a real application, this could come from a database, files, etc.
document_store = [
"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
"Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy.",
"Artificial intelligence research focuses on creating systems capable of performing tasks that typically require human intelligence.",
"A recipe for classic French onion soup involves caramelizing onions and topping with bread and cheese.",
"Machine learning, a subset of AI, involves algorithms that allow systems to learn from data.",
"The Louvre Museum in Paris is the world's largest art museum and a historic monument.",
"Natural Language Processing (NLP) enables computers to understand and process human language.",
"Baking bread requires careful measurement of ingredients like flour, water, yeast, and salt."
]
print(f"\nDocument store contains {len(document_store)} documents.")
# 2. Generate embeddings for all documents in the store (pre-computation)
# In a real app, you'd store these embeddings alongside the documents.
print("\nGenerating embeddings for the document store...")
document_embeddings = []
for doc in document_store:
embedding = get_embedding(client, doc)
# Store the document text and its embedding together
if embedding: # Only store if embedding was successful
document_embeddings.append({"text": doc, "embedding": embedding})
else:
print(f"Skipping document due to embedding error: \"{doc[:50]}...\"")
print(f"\nSuccessfully generated embeddings for {len(document_embeddings)} documents.")
# 3. Define the user's search query
search_query = "What is AI?"
# search_query = "Things to see in Paris"
# search_query = "How does NLP work?"
# search_query = "Cooking instructions"
print(f"\nSearch Query: \"{search_query}\"")
# 4. Generate embedding for the search query
print("\nGenerating embedding for the search query...")
query_embedding = get_embedding(client, search_query)
# 5. Calculate similarity and rank documents
search_results = []
if query_embedding and document_embeddings:
print("\nCalculating similarities...")
for doc_data in document_embeddings:
similarity = cosine_similarity(query_embedding, doc_data["embedding"])
search_results.append({"text": doc_data["text"], "score": similarity})
# Sort results by similarity score in descending order
search_results.sort(key=lambda x: x["score"], reverse=True)
# 6. Display results
print("\n--- Semantic Search Results ---")
print(f"Top results for query: \"{search_query}\"\n")
if not search_results:
print("No results found (or error calculating similarities).")
else:
# Display top N results (e.g., top 3)
top_n = 3
for i, result in enumerate(search_results[:top_n]):
print(f"{i+1}. Score: {result['score']:.4f}")
print(f" Text: {result['text']}")
print("-" * 10)
if len(search_results) > top_n:
print(f"(Showing top {top_n} of {len(search_results)} results)")
else:
print("\nCould not perform search.")
if not query_embedding:
print("Reason: Failed to generate embedding for the search query.")
if not document_embeddings:
print("Reason: No document embeddings were successfully generated.")
Code Breakdown Explanation:
- Setup & Helpers: Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions from the previous example. - Document Store: A simple Python list (
document_store
) holds the text content of the documents we want to search through. In a real application, this data would likely come from a database or file system. - Document Embedding Generation:
- The script iterates through each document in the
document_store
. - It calls
get_embedding
for each document to get its numerical representation. - It stores the original document text and its corresponding embedding vector together (e.g., in a list of dictionaries). This pre-computation step is crucial for efficiency in real systems – you generate document embeddings once and store them. Error handling ensures documents are skipped if embedding fails.
- The script iterates through each document in the
- Search Query: A sample
search_query
string is defined. - Query Embedding Generation: The
get_embedding
function is called again, this time for thesearch_query
. - Similarity Calculation & Ranking:
- It checks if both the query embedding and document embeddings were successfully generated.
- It iterates through the stored
document_embeddings
. - For each document, it calculates the
cosine_similarity
between thequery_embedding
and the document's embedding. - The document text and its calculated similarity score are stored in a
search_results
list. - Finally,
search_results.sort(...)
arranges the list based on thescore
in descending order (highest similarity first).
- Display Results: The script prints the top N (e.g., 3) most relevant documents from the sorted list, showing their similarity score and text content.
This example clearly illustrates the core concept of semantic search: converting both documents and queries into embeddings and then using vector similarity (like cosine similarity) to find documents that are semantically related to the query, even if they don't share the exact keywords.
3.2.2 Topic clustering
Topic clustering is a sophisticated technique for organizing and analyzing large document collections by automatically grouping them based on their semantic content. This advanced application of embeddings transforms the way we process and understand large-scale document collections, offering a powerful solution for content organization. The system works by converting each document into a high-dimensional embedding vector that captures its meaning, then using clustering algorithms to group similar vectors together.
This powerful application of embeddings empowers systems to:
- Identify thematic patterns across thousands of documents without manual labeling - the system can automatically detect common topics and themes across vast document collections, saving countless hours of manual categorization work
- Group similar discussions, articles, or content pieces into intuitive categories - by understanding the semantic relationships between documents, the system can create meaningful groupings that reflect natural topic divisions, even when documents use different terminology to discuss the same concepts
- Discover emerging topics and trends within large document collections - as new content is added, the system can identify new thematic clusters forming, helping organizations stay ahead of developing trends in their field
- Create dynamic content hierarchies that adapt as new documents are added - unlike traditional static categorization systems, embedding-based clustering can automatically reorganize and refine category structures as the content collection grows and evolves
For example, a news organization could use topic clustering to automatically group thousands of articles into categories like "Technology", "Politics", or "Sports", even when these topics aren't explicitly tagged. The embeddings capture the semantic relationships between articles by analyzing the actual meaning and context of the content, not just keywords. This enables much more sophisticated grouping that can understand subtle distinctions - for instance, recognizing that an article about the economic impact of sports stadiums belongs in both "Sports" and "Business" categories, or that articles about different programming languages all belong in a "Technology" cluster despite using completely different terminology.
Example:
Below is a code example that demonstrates topic clustering using OpenAI embeddings and the K-means algorithm from scikit-learn
.
This code will:
- Define a list of sample documents covering different implicit topics.
- Generate embeddings for each document using OpenAI's API.
- Apply the K-Means clustering algorithm to group the embedding vectors.
- Display the documents belonging to each identified cluster.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np
from sklearn.cluster import KMeans # For clustering algorithm
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-03-23 15:26:00 CDT"
current_location = "Dallas, Texas, United States"
print(f"Running Topic Clustering example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
# Truncate text for printing if it's too long
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
# print("Embedding generation successful.") # Reduce verbosity
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Topic Clustering Implementation ---
# 1. Define your collection of documents
# These documents cover roughly 3 topics: AI/Tech, Travel/Geography, Food/Cooking
documents = [
"Artificial intelligence research focuses on creating systems capable of performing tasks that typically require human intelligence.",
"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
"A recipe for classic French onion soup involves caramelizing onions and topping with bread and cheese.",
"Machine learning, a subset of AI, involves algorithms that allow systems to learn from data.",
"The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials.",
"Natural Language Processing (NLP) enables computers to understand and process human language.",
"Baking bread requires careful measurement of ingredients like flour, water, yeast, and salt.",
"The Colosseum in Rome, Italy, is an oval amphitheatre in the centre of the city.",
"Deep learning utilizes artificial neural networks with multiple layers to model complex patterns.",
"Sushi is a traditional Japanese dish of prepared vinegared rice, usually with some sugar and salt, accompanying a variety of ingredients, such as seafood, often raw, and vegetables."
]
print(f"\nDocument collection contains {len(documents)} documents.")
# 2. Generate embeddings for all documents
print("\nGenerating embeddings for the document collection...")
embeddings = []
valid_documents = [] # Keep track of documents for which embedding was successful
for doc in documents:
embedding = get_embedding(client, doc)
if embedding:
embeddings.append(embedding)
valid_documents.append(doc) # Add corresponding document text
else:
print(f"Skipping document due to embedding error: \"{doc[:70]}...\"")
if not embeddings:
print("\nError: No embeddings were generated. Cannot perform clustering.")
exit()
print(f"\nSuccessfully generated embeddings for {len(valid_documents)} documents.")
# Convert embeddings list to a NumPy array for scikit-learn
embedding_matrix = np.array(embeddings)
# 3. Apply Clustering Algorithm (K-Means)
# We need to choose the number of clusters (k). Let's assume we expect 3 topics.
# In real applications, determining the optimal 'k' often requires experimentation
# (e.g., using the elbow method or silhouette scores).
n_clusters = 3
print(f"\nApplying K-Means clustering with k={n_clusters}...")
try:
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) # n_init suppresses warning
kmeans.fit(embedding_matrix)
cluster_labels = kmeans.labels_
print("Clustering complete.")
except Exception as e:
print(f"An error occurred during clustering: {e}")
exit()
# 4. Display Documents by Cluster
print(f"\n--- Documents Grouped by Topic Cluster (k={n_clusters}) ---")
# Create a dictionary to hold documents for each cluster
clustered_documents = {i: [] for i in range(n_clusters)}
# Assign each document (that had a valid embedding) to its cluster
for i, label in enumerate(cluster_labels):
clustered_documents[label].append(valid_documents[i])
# Print the contents of each cluster
for cluster_id, docs_in_cluster in clustered_documents.items():
print(f"\nCluster {cluster_id + 1}:")
if not docs_in_cluster:
print(" (No documents in this cluster)")
else:
for doc_text in docs_in_cluster:
# Print truncated document text for readability
print_text = doc_text[:100] + "..." if len(doc_text) > 100 else doc_text
print(f" - {print_text}")
print("-" * 20)
print("\nNote: The quality of clustering depends on the data, the embedding model,")
print("and the chosen number of clusters (k). Cluster numbers are arbitrary.")
Code Breakdown Explanation:
- Setup & Helpers:
- Includes standard imports plus
KMeans
fromsklearn.cluster
. - Initializes the OpenAI client.
- Includes the
get_embedding
helper function (same as before).
- Includes standard imports plus
- Document Collection: A list named
documents
holds the text content. The sample documents are chosen to represent a few distinct underlying topics (AI/Tech, Travel/Geography, Food/Cooking). - Embedding Generation:
- The script iterates through the
documents
. - It calls
get_embedding
for each document. - It stores the successful embeddings in the
embeddings
list and the corresponding document text invalid_documents
. This ensures that the indices match later. - Error handling skips documents if embedding generation fails.
- The list of embedding vectors is converted into a NumPy array (
embedding_matrix
), which is the standard input format forscikit-learn
algorithms.
- The script iterates through the
- Clustering (K-Means):
- Choosing
k
: The number of clusters (n_clusters
) is set (here,k=3
, assuming we expect three topics based on the sample data). A comment highlights that finding the optimalk
is often a separate task in real-world scenarios. - Initialization: A
KMeans
object is created.n_clusters
specifies the desired number of groups.random_state
ensures reproducibility.n_init=10
runs the algorithm multiple times with different starting centroids and chooses the best result (suppresses a future warning). - Fitting:
kmeans.fit(embedding_matrix)
performs the K-Means clustering algorithm on the document embeddings. It finds cluster centers and assigns each embedding vector to the nearest center. - Labels:
kmeans.labels_
contains an array where each element indicates the cluster ID (0, 1, 2, etc.) assigned to the corresponding document embedding.
- Choosing
- Displaying Results:
- A dictionary (
clustered_documents
) is created to organize the results, with keys representing cluster IDs. - The script iterates through the
cluster_labels
assigned by K-Means. For each document's indexi
, it finds its assignedlabel
and appends the corresponding text fromvalid_documents[i]
to the list for that cluster ID in the dictionary. - Finally, it loops through the
clustered_documents
dictionary and prints the text of the documents belonging to each cluster, clearly grouping them by the topic cluster identified by the algorithm.
- A dictionary (
This example demonstrates the power of embeddings for unsupervised topic discovery. By converting text to vectors, we can use mathematical algorithms like K-Means to group semantically similar documents without needing pre-defined labels.
3.2.3 Recommendation Systems
Suggesting related items by understanding the deeper connections between different pieces of content. This powerful application of embeddings enables systems to provide personalized recommendations by analyzing the semantic relationships between items. The embedding vectors capture subtle patterns and similarities that might not be immediately obvious to human observers.
Here's how recommendation systems leverage embeddings:
- Content-Based Filtering
- Systems analyze the actual content characteristics (like text descriptions, features, or attributes)
- Each item is converted into an embedding vector that represents its key features
- Similar items are found by measuring the distance between these vectors
- Collaborative Filtering Enhancement
- User behaviors and preferences are also converted into embeddings
- The system can identify patterns in user-item interactions
- This helps predict which items a user might like based on similar users' preferences
For example, a video streaming service can recommend shows not just based on genre tags, but by understanding thematic elements, storytelling styles, and complex narrative patterns. The embedding vectors can capture nuanced features like:
- Pacing and plot complexity
- Character development styles
- Emotional tone and atmosphere
- Visual and directorial techniques
Similarly, e-commerce platforms can suggest products by understanding the contextual similarities in product descriptions, user behavior, and item characteristics. This includes analyzing:
- Product descriptions and features
- User browsing and purchase patterns
- Price points and quality levels
- Brand relationships and market positioning
This semantic understanding leads to more accurate and relevant recommendations compared to traditional methods that rely solely on explicit categories or user ratings. The system can identify subtle connections and patterns that might be missed by conventional recommendation approaches, resulting in more engaging and personalized user experiences.
Example:
The following code example demonstrates how OpenAI embeddings can be used to build a simple content-based recommendation system.
This script will:
- Define a small catalog of items (e.g., movie descriptions).
- Generate embeddings for these items.
- Choose a target item.
- Find other items in the catalog that are semantically similar to the target item based on their embeddings.
- Present the most similar items as recommendations.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-03-24 15:29:00 CDT"
current_location = "Austin, Texas, United States"
print(f"Running Recommendation System example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
# Truncate text for printing if it's too long
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
# print("Embedding generation successful.") # Reduce verbosity
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0 # Return 0 if any vector is missing
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return similarity
# --- Recommendation System Implementation ---
# 1. Define your item catalog (e.g., movie descriptions)
# In a real application, this would come from a database.
item_catalog = [
{"id": "mov001", "title": "Space Odyssey: The Final Frontier", "description": "A visually stunning sci-fi epic exploring humanity's place in the universe, featuring complex themes and groundbreaking special effects."},
{"id": "mov002", "title": "Galactic Wars: Attack of the Clones", "description": "An action-packed space opera with laser battles, alien creatures, and a classic good versus evil storyline."},
{"id": "com001", "title": "Laugh Riot", "description": "A slapstick comedy about mistaken identities and hilarious mishaps during a weekend getaway."},
{"id": "doc001", "title": "Wonders of the Deep", "description": "An awe-inspiring documentary showcasing the beauty and mystery of marine life in the world's oceans."},
{"id": "mov003", "title": "Cyber City 2077", "description": "A gritty cyberpunk thriller set in a dystopian future, exploring themes of technology, consciousness, and rebellion."},
{"id": "com002", "title": "The Office Party", "description": "A witty ensemble comedy centered around awkward interactions and office politics during an annual holiday celebration."},
{"id": "doc002", "title": "Cosmic Journeys", "description": "A documentary exploring the vastness of space, black holes, distant galaxies, and the search for extraterrestrial life."},
{"id": "mov004", "title": "Interstellar Echoes", "description": "A philosophical science fiction film about astronauts travelling through a wormhole in search of a new home for humanity."}
]
print(f"\nItem catalog contains {len(item_catalog)} items.")
# 2. Generate embeddings for all items in the catalog (pre-computation)
print("\nGenerating embeddings for the item catalog...")
item_embeddings_data = []
for item in item_catalog:
# Combine title and description for a richer embedding
text_to_embed = f"{item['title']}: {item['description']}"
embedding = get_embedding(client, text_to_embed)
if embedding:
# Store item ID and its embedding
item_embeddings_data.append({"id": item["id"], "embedding": embedding})
else:
print(f"Skipping item {item['id']} due to embedding error.")
if not item_embeddings_data:
print("\nError: No embeddings were generated. Cannot provide recommendations.")
exit()
print(f"\nSuccessfully generated embeddings for {len(item_embeddings_data)} items.")
# 3. Select a target item for which to find recommendations
target_item_id = "mov001" # Let's find movies similar to "Space Odyssey"
print(f"\nFinding recommendations similar to item ID: {target_item_id}")
# Find the embedding for the target item
target_embedding = None
for item_data in item_embeddings_data:
if item_data["id"] == target_item_id:
target_embedding = item_data["embedding"]
break
if target_embedding is None:
print(f"Error: Could not find the embedding for the target item ID '{target_item_id}'.")
exit()
# 4. Calculate similarity between the target item and all other items
recommendations = []
print("\nCalculating similarities...")
for item_data in item_embeddings_data:
# Don't compare the item with itself
if item_data["id"] == target_item_id:
continue
similarity = cosine_similarity(target_embedding, item_data["embedding"])
recommendations.append({"id": item_data["id"], "score": similarity})
# 5. Sort potential recommendations by similarity score
recommendations.sort(key=lambda x: x["score"], reverse=True)
# 6. Display top N recommendations
print("\n--- Top Recommendations ---")
# Find the original title/description for the target item for context
target_item_info = next((item for item in item_catalog if item["id"] == target_item_id), None)
if target_item_info:
print(f"Based on: \"{target_item_info['title']}\"\n")
if not recommendations:
print("No recommendations found (or error calculating similarities).")
else:
top_n = 3
print(f"Top {top_n} most similar items:")
for i, rec in enumerate(recommendations[:top_n]):
# Find the full item details from the original catalog
rec_details = next((item for item in item_catalog if item["id"] == rec["id"]), None)
if rec_details:
print(f"{i+1}. ID: {rec['id']}, Score: {rec['score']:.4f}")
print(f" Title: {rec_details['title']}")
print(f" Description: {rec_details['description'][:100]}...") # Truncate description
print("-" * 10)
else:
print(f"{i+1}. ID: {rec['id']}, Score: {rec['score']:.4f} (Details not found)")
print("-" * 10)
if len(recommendations) > top_n:
print(f"(Showing top {top_n} of {len(recommendations)} potential recommendations)")
Code Breakdown Explanation
This example demonstrates how to build a straightforward content-based recommendation system by combining OpenAI embeddings with cosine similarity calculations.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions previously defined.
- Includes standard imports (
- Item Catalog:
- A list of dictionaries (
item_catalog
) represents the items available for recommendation (e.g., movies). Each item has anid
,title
, anddescription
. In a real system, this would likely be loaded from a database.
- A list of dictionaries (
- Item Embedding Generation:
- The script iterates through each
item
in theitem_catalog
. - Content Combination: It combines the
title
anddescription
into a single string (text_to_embed
). This provides richer context to the embedding model than using just the title or description alone. - It calls
get_embedding
for this combined text. - It stores the
item['id']
and its correspondingembedding
vector together in theitem_embeddings_data
list. This pre-computation step is standard practice for recommendation systems.
- The script iterates through each
- Target Item Selection:
- A
target_item_id
variable is set to specify the item for which we want recommendations (e.g., find items similar tomov001
). - The script retrieves the pre-computed embedding vector for this
target_item_id
from theitem_embeddings_data
list.
- A
- Similarity Calculation:
- It iterates through all the items with embeddings in
item_embeddings_data
. - Exclusion: It explicitly skips the comparison if the current item's ID matches the
target_item_id
(an item shouldn't recommend itself). - For every other item, it calculates the
cosine_similarity
between thetarget_embedding
and the current item's embedding. - It stores the other item's
id
and its calculated similarityscore
in arecommendations
list.
- It iterates through all the items with embeddings in
- Ranking Recommendations:
- The
recommendations
list is sorted usingrecommendations.sort(...)
based on thescore
field in descending order, placing the most similar items at the beginning of the list.
- The
- Displaying Results:
- The script prints the title of the target item for context.
- It then iterates through the top N (e.g., 3) items in the sorted
recommendations
list. - For each recommended item ID, it looks up the full details (title, description) from the original
item_catalog
. - It prints the rank, ID, similarity score, title, and a truncated description for each recommended item.
This example effectively shows how embeddings capture semantic meaning, allowing the system to recommend items based on content similarity (e.g., recommending other philosophical sci-fi movies similar to "Space Odyssey") rather than just explicit tags or user history.
3.2.4 Context retrieval for AI assistants
Helping chatbots and AI systems find and use relevant information from large knowledge bases by converting both queries and stored knowledge into embeddings. This process involves several key steps:
First, the system converts all documents in its knowledge base into embedding vectors - numerical representations that capture the semantic meaning of the text. These embeddings are stored in a vector database for quick retrieval.
When an AI assistant receives a question, it converts that query into an embedding vector using the same process. This ensures that both the stored knowledge and the incoming questions are represented in the same mathematical space.
The system then performs a similarity search to find the most relevant information. This search compares the query embedding to all stored embeddings, typically using techniques like cosine similarity or nearest neighbor search. The beauty of this approach is that it can identify semantically similar content even when the exact wording differs significantly.
For example, a query about "laptop won't turn on" might match documentation about "computer power issues" because their embeddings capture the similar underlying meaning. This semantic matching is far more powerful than traditional keyword-based search.
Once relevant information is identified, it can be used to generate more accurate, informed responses. This is particularly powerful for domain-specific applications where the AI needs to access technical documentation, product information, or company policies. The system can handle complex queries by combining multiple pieces of relevant context, ensuring responses are both accurate and comprehensive.
Example:
Below is a code example that demonstrates how AI assistants can retrieve context using OpenAI embeddings, implementing the concepts discussed in section 3.2.4.
The script illustrates the essential process of searching a knowledge base to provide relevant context for an AI assistant's responses.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-02-10 15:35:00 CDT"
current_location = "Grapevine, Texas, United States"
print(f"Running Context Retrieval example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return similarity
# --- Context Retrieval Implementation ---
# 1. Define your Knowledge Base (list of text documents/chunks)
# This represents the information the AI assistant can draw upon.
knowledge_base = [
{"id": "doc001", "source": "troubleshooting_guide.txt", "content": "If your laptop fails to power on, first check the power adapter connection. Ensure the cable is securely plugged into both the laptop and the wall outlet. Try a different outlet if possible."},
{"id": "doc002", "source": "troubleshooting_guide.txt", "content": "A blinking power light often indicates a battery issue or a charging problem. Try removing the battery (if removable) and powering on with only the adapter connected."},
{"id": "doc003", "source": "faq.html", "content": "To reset your password, go to the login page and click the 'Forgot Password' link. Follow the instructions sent to your registered email address."},
{"id": "doc004", "source": "product_manual.pdf", "content": "The Model X laptop uses a USB-C port for charging. Ensure you are using the correct wattage power adapter (65W minimum recommended)."},
{"id": "doc005", "source": "troubleshooting_guide.txt", "content": "No display output? Check if the laptop is making any sounds (fan spinning, beeps). Try connecting an external monitor to rule out a screen issue."},
{"id": "doc006", "source": "support_articles/power_issues.md", "content": "Holding the power button down for 15-30 seconds can perform a hard reset, sometimes resolving power-on failures."},
{"id": "doc007", "source": "faq.html", "content": "Software updates can be found in the 'System Settings' under the 'Updates' section. Ensure you are connected to the internet."}
]
print(f"\nKnowledge base contains {len(knowledge_base)} documents/chunks.")
# 2. Generate embeddings for the knowledge base (pre-computation)
print("\nGenerating embeddings for the knowledge base...")
kb_embeddings_data = []
for doc in knowledge_base:
embedding = get_embedding(client, doc["content"])
if embedding:
# Store document ID and its embedding
kb_embeddings_data.append({"id": doc["id"], "embedding": embedding})
else:
print(f"Skipping document {doc['id']} due to embedding error.")
if not kb_embeddings_data:
print("\nError: No embeddings were generated for the knowledge base. Cannot retrieve context.")
exit()
print(f"\nSuccessfully generated embeddings for {len(kb_embeddings_data)} knowledge base documents.")
# 3. Define the user's query to the AI assistant
user_query = "My computer won't start up."
# user_query = "How do I update the system software?"
# user_query = "Screen is black when I press the power button."
print(f"\nUser Query: \"{user_query}\"")
# 4. Generate embedding for the user query
print("\nGenerating embedding for the user query...")
query_embedding = get_embedding(client, user_query)
# 5. Find relevant documents from the knowledge base using similarity search
retrieved_context = []
if query_embedding and kb_embeddings_data:
print("\nCalculating similarities to find relevant context...")
for doc_data in kb_embeddings_data:
similarity = cosine_similarity(query_embedding, doc_data["embedding"])
retrieved_context.append({"id": doc_data["id"], "score": similarity})
# Sort context documents by similarity score in descending order
retrieved_context.sort(key=lambda x: x["score"], reverse=True)
# 6. Select Top N relevant documents to use as context
top_n_context = 3
print(f"\n--- Top {top_n_context} Relevant Context Documents Found ---")
if not retrieved_context:
print("No relevant context found (or error calculating similarities).")
else:
final_context_docs = []
for i, context_item in enumerate(retrieved_context[:top_n_context]):
# Find the full document details from the original knowledge base
doc_details = next((doc for doc in knowledge_base if doc["id"] == context_item["id"]), None)
if doc_details:
print(f"{i+1}. ID: {context_item['id']}, Score: {context_item['score']:.4f}")
print(f" Source: {doc_details['source']}")
print(f" Content: {doc_details['content'][:150]}...") # Truncate content
print("-" * 10)
final_context_docs.append(doc_details['content']) # Store content for next step
else:
print(f"{i+1}. ID: {context_item['id']}, Score: {context_item['score']:.4f} (Details not found)")
print("-" * 10)
if len(retrieved_context) > top_n_context:
print(f"(Showing top {top_n_context} of {len(retrieved_context)} potential context documents)")
# --- Next Step (Conceptual - Not coded here) ---
print("\n--- Next Step: Generating AI Assistant Response ---")
print("The content from the relevant documents above would now be combined")
print("with the original user query and sent to a model like GPT-4o")
print("as context to generate an informed and accurate response.")
print("Example prompt structure for GPT-4o:")
print("```")
print(f"System: You are a helpful AI assistant. Answer the user's question based ONLY on the provided context documents.")
print(f"User: Context Documents:\n1. {final_context_docs[0][:50]}...\n2. {final_context_docs[1][:50]}...\n[...]\n\nQuestion: {user_query}\n\nAnswer:")
print("```")
else:
print("\nCould not retrieve context.")
if not query_embedding:
print("Reason: Failed to generate embedding for the user query.")
if not kb_embeddings_data:
print("Reason: No knowledge base embeddings were successfully generated.")
Code Breakdown Explanation
This example demonstrates the core mechanism behind context retrieval for AI assistants using embeddings – finding relevant information from a knowledge base to answer a user's query.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions.
- Includes standard imports (
- Knowledge Base Definition:
- A list of dictionaries (
knowledge_base
) simulates the information store the AI assistant can access. Each dictionary represents a document or chunk of information and includes anid
,source
(optional metadata), and the actual textcontent
.
- A list of dictionaries (
- Knowledge Base Embedding Generation:
- The script iterates through each
doc
in theknowledge_base
. - It calls
get_embedding
on thedoc["content"]
to get its vector representation. - It stores the
doc['id']
and its correspondingembedding
vector together inkb_embeddings_data
. This is the crucial pre-computation step – embeddings for the knowledge base are typically generated offline and stored (often in a specialized vector database) for fast retrieval.
- The script iterates through each
- User Query:
- A sample
user_query
string represents the question asked to the AI assistant.
- A sample
- Query Embedding Generation:
- The
get_embedding
function is called for theuser_query
to get its vector representation in the same embedding space as the knowledge base documents.
- The
- Similarity Search (Context Retrieval):
- It iterates through all the pre-computed embeddings in
kb_embeddings_data
. - For each knowledge base document, it calculates the
cosine_similarity
between thequery_embedding
and the document's embedding. - It stores the document's
id
and its similarityscore
relative to the query in aretrieved_context
list.
- It iterates through all the pre-computed embeddings in
- Ranking and Selection:
- The
retrieved_context
list is sorted byscore
in descending order, bringing the most semantically relevant documents to the top. - The script selects the top N (e.g., 3) documents from this sorted list. These documents represent the most relevant context found in the knowledge base for the user's query.
- The
- Displaying Retrieved Context:
- The script prints the details (ID, score, source, content preview) of the top N context documents found.
- Conceptual Next Step (Crucial Explanation):
- The final print statements explain the purpose of this retrieval process. The content of these
final_context_docs
would not be the final answer. Instead, they would be combined with the originaluser_query
and passed as context to a large language model like GPT-4o in a subsequent API call. - An example prompt structure is shown, illustrating how the retrieved context grounds the AI assistant, enabling it to generate an informed response based on the relevant information found in the knowledge base, rather than relying solely on its general knowledge.
- The final print statements explain the purpose of this retrieval process. The content of these
This example effectively demonstrates the retrieval part of Retrieval-Augmented Generation (RAG), showing how embeddings bridge the gap between a user's query and relevant information stored in a knowledge base, enabling more accurate and context-aware AI assistants.
3.2.5 Anomaly and similarity detection
Identifying unusual patterns or finding similar items in large datasets by comparing their semantic representations is a fundamental application of embedding technology. This powerful technique transforms raw data into mathematical vectors that capture the essence of their content, enabling sophisticated analysis at scale. Here's how these systems work and their key applications:
- Detect Anomalies
- Flag unusual transactions or behaviors that deviate from normal patterns - For example, detecting suspicious credit card purchases by comparing them against typical spending patterns
- Identify potential security threats or fraud attempts - Such as recognizing unusual login patterns or detecting fake accounts based on behavior analysis
- Spot data quality issues or outliers in datasets - Including identifying incorrect data entries or unusual measurements that might indicate equipment malfunction
- Find Similarities
- Group related documents, images, or data points based on semantic meaning - This allows systems to cluster similar content even when the exact wording differs, making it easier to organize large collections of information
- Match similar customer inquiries or support tickets - Helping customer service teams identify common issues and standardize responses to frequent problems
- Identify duplicate or near-duplicate content - Useful for content management systems to maintain data quality and reduce redundancy
By converting data points into embedding vectors, systems can measure how "different" or "similar" items are to each other using mathematical distance calculations. This process works by mapping each item to a point in a high-dimensional space, where similar items are positioned closer together and dissimilar items are farther apart. This mathematical representation makes it possible to automatically flag unusual patterns or group related items together at scale, enabling both anomaly detection and similarity matching in ways that would be impossible with traditional rule-based systems.
Example:
The following code example demonstrates similarity and anomaly detection using OpenAI embeddings.
This script will:
- Define a dataset of text items (e.g., descriptions of transactions or events).
- Generate embeddings for these items.
- Similarity Detection: Find items most similar to a given target item.
- Anomaly Detection: Identify items that are least similar (most anomalous) compared to the rest of the dataset using a simple average similarity approach.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-01-13 15:40:00 CDT"
current_location = "Houston, Texas, United States"
print(f"Running Similarity & Anomaly Detection example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
# Clamp the value to handle potential floating point inaccuracies slightly outside [-1, 1]
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return np.clip(similarity, -1.0, 1.0)
# --- Similarity and Anomaly Detection Implementation ---
# 1. Define your dataset (e.g., transaction descriptions, log entries)
# Includes mostly normal items and a couple of potentially anomalous ones.
dataset = [
{"id": "txn001", "description": "Grocery purchase at Local Supermarket"},
{"id": "txn002", "description": "Monthly subscription fee for streaming service"},
{"id": "txn003", "description": "Dinner payment at Italian Restaurant"},
{"id": "txn004", "description": "Online order for electronics from TechStore"},
{"id": "txn005", "description": "Fuel purchase at Gas Station"},
{"id": "txn006", "description": "Purchase of fresh produce and bread"}, # Similar to txn001
{"id": "txn007", "description": "Payment for movie streaming subscription"}, # Similar to txn002
{"id": "txn008", "description": "Unusual large wire transfer to overseas account"}, # Potential Anomaly 1
{"id": "txn009", "description": "Purchase of rare antique collectible vase"}, # Potential Anomaly 2
{"id": "txn010", "description": "Coffee purchase at Cafe Central"}
]
print(f"\nDataset contains {len(dataset)} items.")
# 2. Generate embeddings for all items in the dataset (pre-computation)
print("\nGenerating embeddings for the dataset...")
dataset_embeddings_data = []
for item in dataset:
embedding = get_embedding(client, item["description"])
if embedding:
# Store item ID, description, and its embedding
dataset_embeddings_data.append({
"id": item["id"],
"description": item["description"],
"embedding": embedding
})
else:
print(f"Skipping item {item['id']} due to embedding error.")
if not dataset_embeddings_data:
print("\nError: No embeddings were generated. Cannot perform analysis.")
exit()
print(f"\nSuccessfully generated embeddings for {len(dataset_embeddings_data)} items.")
# --- Part A: Similarity Detection ---
print("\n--- Part A: Similarity Detection ---")
# Select a target item to find similar items for
target_item_id_similarity = "txn001" # Find items similar to "Grocery purchase..."
print(f"Finding items similar to item ID: {target_item_id_similarity}")
# Find the target item's data
target_item_data = next((item for item in dataset_embeddings_data if item["id"] == target_item_id_similarity), None)
if target_item_data:
target_embedding = target_item_data["embedding"]
similar_items = []
# Calculate similarity between the target and all other items
for item_data in dataset_embeddings_data:
if item_data["id"] == target_item_id_similarity:
continue # Skip self-comparison
similarity = cosine_similarity(target_embedding, item_data["embedding"])
similar_items.append({
"id": item_data["id"],
"description": item_data["description"],
"score": similarity
})
# Sort by similarity score
similar_items.sort(key=lambda x: x["score"], reverse=True)
# Display top N similar items
print(f"\nItems most similar to: \"{target_item_data['description']}\"")
top_n_similar = 2
for i, item in enumerate(similar_items[:top_n_similar]):
print(f"{i+1}. ID: {item['id']}, Score: {item['score']:.4f}")
print(f" Description: {item['description']}")
print("-" * 10)
else:
print(f"Error: Could not find data for target item ID '{target_item_id_similarity}'.")
# --- Part B: Anomaly Detection (Simple Approach) ---
print("\n--- Part B: Anomaly Detection (Low Average Similarity) ---")
# Calculate the average similarity of each item to all other items
item_avg_similarities = []
num_items = len(dataset_embeddings_data)
if num_items < 2:
print("Need at least 2 items with embeddings to calculate average similarities.")
else:
print("\nCalculating average similarities for anomaly detection...")
for i in range(num_items):
current_item = dataset_embeddings_data[i]
total_similarity = 0
# Compare current item to all others
for j in range(num_items):
if i == j: # Don't compare item to itself
continue
other_item = dataset_embeddings_data[j]
similarity = cosine_similarity(current_item["embedding"], other_item["embedding"])
total_similarity += similarity
# Calculate average similarity (avoid division by zero if only 1 item)
average_similarity = total_similarity / (num_items - 1) if num_items > 1 else 0
item_avg_similarities.append({
"id": current_item["id"],
"description": current_item["description"],
"avg_score": average_similarity
})
print(f"Item ID {current_item['id']} - Avg Similarity: {average_similarity:.4f}")
# Sort items by average similarity in ascending order (lowest first = most anomalous)
item_avg_similarities.sort(key=lambda x: x["avg_score"])
# Display top N potential anomalies (items least similar to others)
print("\nPotential Anomalies (Lowest Average Similarity):")
top_n_anomalies = 3
for i, item in enumerate(item_avg_similarities[:top_n_anomalies]):
print(f"{i+1}. ID: {item['id']}, Avg Score: {item['avg_score']:.4f}")
print(f" Description: {item['description']}")
print("-" * 10)
print("\nNote: Low average similarity suggests an item is semantically")
print("different from the majority of other items in this dataset.")
Code Breakdown Explanation
This example demonstrates using OpenAI embeddings for both finding similar items and detecting potential anomalies within a dataset based on semantic meaning.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions. Thecosine_similarity
function now includesnp.clip
to ensure the output is strictly within [-1, 1].
- Includes standard imports (
- Dataset Definition:
- A list of dictionaries (
dataset
) simulates the data to be analyzed (e.g., transaction descriptions). Each item has anid
and a textdescription
. The sample data includes mostly common items and a few conceptually different ones intended as potential anomalies.
- A list of dictionaries (
- Dataset Embedding Generation:
- The script iterates through each
item
in thedataset
. - It calls
get_embedding
on theitem["description"]
. - It stores the
item['id']
,item['description']
, and its correspondingembedding
vector together indataset_embeddings_data
. This pre-computation is essential.
- The script iterates through each
- Part A: Similarity Detection:
- Target Selection: An item ID (
target_item_id_similarity
) is chosen to find similar items for. - Target Embedding Retrieval: The script finds the pre-computed embedding for the target item.
- Comparison: It iterates through all other items in
dataset_embeddings_data
, calculates thecosine_similarity
between the target item's embedding and each other item's embedding. - Ranking: The results (other item ID, description, similarity score) are stored and then sorted by score in descending order.
- Display: The top N most similar items are printed.
- Target Selection: An item ID (
- Part B: Anomaly Detection (Simple Average Similarity Approach):
- Concept: This simple method identifies anomalies as items that have the lowest average semantic similarity to all other items in the dataset. An item that is very different conceptually from the rest will likely have low similarity scores when compared to most others.
- Calculation:
- The script iterates through each item (
current_item
) indataset_embeddings_data
. - For each
current_item
, it iterates through all other items in the dataset. - It calculates the
cosine_similarity
between thecurrent_item
and everyother_item
. - It sums these similarities and calculates the average similarity for the
current_item
.
- The script iterates through each item (
- Ranking: The items are stored along with their calculated
avg_score
and then sorted by this score in ascending order (lowest average similarity first). - Display: The top N items with the lowest average similarity scores are printed as potential anomalies. A note explains the interpretation.
This example showcases two powerful applications: finding related content (similarity) and identifying outliers (anomaly detection) by leveraging the semantic understanding captured within OpenAI embeddings.
3.2.6 Clustering & Tagging
Automatically organize and label content based on semantic similarity - a powerful technique that uses embedding vectors to understand the true meaning and relationships between different pieces of content. This approach goes far beyond traditional keyword matching, allowing for much more nuanced and accurate content organization.
When content is clustered, similar items naturally group together based on their semantic meaning, even if they use different terminology to express the same concepts. For example, documents about "automotive maintenance" and "car repair" would cluster together despite using different words.
This intelligent organization helps create intuitive navigation systems, improves content discovery, and makes large document collections more manageable by grouping related items together. Some key benefits include:
- Automatic tag generation based on cluster themes
- Dynamic organization that adapts as new content is added
- Improved search relevance through semantic understanding
- Better content discovery through related-item suggestions
The clustering process can be fine-tuned to create either broad categories or more granular subcategories, depending on the specific needs of your content organization system. This flexibility makes it a valuable tool for managing everything from digital libraries to enterprise knowledge bases.
Example:
Let's examine a code example that demonstrates clustering and tagging using OpenAI embeddings and GPT-4o.
This script will:
- Define a collection of documents.
- Generate embeddings for the documents.
- Cluster the documents using K-Means based on their embeddings.
- For each cluster, use GPT-4o to analyze the documents within it and generate a descriptive tag or label.
- Display the documents grouped by cluster along with their AI-generated tags.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np
from sklearn.cluster import KMeans # For clustering algorithm
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2024-12-31 15:48:00 CDT"
current_location = "San Antonio, Texas, United States"
print(f"Running Clustering & Tagging example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function to Generate Cluster Tag using GPT-4o ---
def generate_cluster_tag(client, documents_in_cluster):
"""Uses GPT-4o to suggest a tag/label for a cluster of documents."""
if not documents_in_cluster:
return "Empty Cluster"
# Combine content for context, limiting total length if necessary
# Using first few hundred chars of each doc might be enough
max_context_length = 3000 # Limit context to avoid excessive token usage
context = ""
for i, doc in enumerate(documents_in_cluster):
doc_preview = f"Document {i+1}: {doc[:300]}...\n"
if len(context) + len(doc_preview) > max_context_length:
break
context += doc_preview
if not context:
return "Error: Could not create context"
system_prompt = "You are an expert at identifying themes and creating concise labels."
user_prompt = f"""Based on the following document excerpts from a single cluster, suggest a short, descriptive tag or label (2-5 words) that captures the main theme or topic of this group.
Document Excerpts:
---
{context.strip()}
---
Suggested Tag/Label:
"""
print(f"\nGenerating tag for cluster with {len(documents_in_cluster)} documents...")
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=20, # Short response expected
temperature=0.3 # More deterministic label
)
tag = response.choices[0].message.content.strip().replace('"', '') # Clean up quotes
print(f"Generated tag: '{tag}'")
return tag
except OpenAIError as e:
print(f"OpenAI API Error generating tag: {e}")
return "Tagging Error"
except Exception as e:
print(f"An unexpected error occurred during tag generation: {e}")
return "Tagging Error"
# --- Clustering and Tagging Implementation ---
# 1. Define your collection of documents
# Covers topics: Space Exploration, Cooking/Food, Web Development
documents = [
"NASA launches new probe to study Jupiter's moons.",
"Recipe for authentic Italian pasta carbonara.",
"JavaScript frameworks like React and Vue dominate front-end development.",
"The James Webb Space Telescope captures stunning images of distant galaxies.",
"Tips for baking the perfect sourdough bread at home.",
"Understanding asynchronous programming in Node.js.",
"SpaceX successfully lands its reusable rocket booster after launch.",
"Exploring the different types of olive oil and their uses in cooking.",
"CSS Grid vs Flexbox: Choosing the right layout module.",
"The search for habitable exoplanets continues with new telescope data.",
"How to make delicious homemade pizza from scratch.",
"Building RESTful APIs using Express.js and MongoDB."
]
print(f"\nDocument collection contains {len(documents)} documents.")
# 2. Generate embeddings for all documents
print("\nGenerating embeddings for the document collection...")
embeddings = []
valid_documents = [] # Keep track of documents for which embedding was successful
for doc in documents:
embedding = get_embedding(client, doc)
if embedding:
embeddings.append(embedding)
valid_documents.append(doc) # Add corresponding document text
else:
print(f"Skipping document due to embedding error: \"{doc[:70]}...\"")
if not embeddings:
print("\nError: No embeddings were generated. Cannot perform clustering.")
exit()
print(f"\nSuccessfully generated embeddings for {len(valid_documents)} documents.")
# Convert embeddings list to a NumPy array for scikit-learn
embedding_matrix = np.array(embeddings)
# 3. Apply Clustering Algorithm (K-Means)
# Choose the number of clusters (k). We expect 3 topics here.
n_clusters = 3
print(f"\nApplying K-Means clustering with k={n_clusters}...")
try:
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
kmeans.fit(embedding_matrix)
cluster_labels = kmeans.labels_
print("Clustering complete.")
except Exception as e:
print(f"An error occurred during clustering: {e}")
exit()
# 4. Group Documents by Cluster
print("\nGrouping documents by cluster...")
clustered_documents = {i: [] for i in range(n_clusters)}
for i, label in enumerate(cluster_labels):
clustered_documents[label].append(valid_documents[i])
# 5. Generate Tags for Each Cluster using GPT-4o
print("\nGenerating tags for each cluster...")
cluster_tags = {}
for cluster_id, docs_in_cluster in clustered_documents.items():
tag = generate_cluster_tag(client, docs_in_cluster)
cluster_tags[cluster_id] = tag
# 6. Display Documents by Cluster with Generated Tags
print(f"\n--- Documents Grouped by Cluster and Tag (k={n_clusters}) ---")
for cluster_id, docs_in_cluster in clustered_documents.items():
generated_tag = cluster_tags.get(cluster_id, "Unknown Tag")
print(f"\nCluster {cluster_id + 1} - Suggested Tag: '{generated_tag}'")
print("-" * (28 + len(generated_tag))) # Adjust underline length
if not docs_in_cluster:
print(" (No documents in this cluster)")
else:
for doc_text in docs_in_cluster:
print(f" - {doc_text}") # Print full document text here
print("\nClustering and Tagging process complete.")
Code Breakdown Explanation
This script demonstrates how to automatically group similar documents by their semantic meaning using embeddings, then uses GPT-4o to generate descriptive tags for each group.
- Setup & Helpers:
- Includes standard imports plus
KMeans
fromsklearn.cluster
. - Initializes the OpenAI client.
- Includes the
get_embedding
helper function.
- Includes standard imports plus
- New Helper Function:
generate_cluster_tag
:- Purpose: Takes a list of documents belonging to a single cluster and uses GPT-4o to suggest a concise tag summarizing their common theme.
- Input: The
client
object anddocuments_in_cluster
(a list of text strings). - Context Creation: It concatenates parts of the documents (e.g., first 300 characters) to create a context string for GPT-4o, respecting a maximum length to manage token usage.
- Prompt Engineering: It constructs a prompt asking GPT-4o to act as an expert theme identifier and suggest a short tag (2-5 words) based on the provided document excerpts.
- API Call: Uses
client.chat.completions.create
withmodel="gpt-4o"
and the specialized prompt. A low temperature is used for more focused tag generation. - Output: Returns the cleaned-up tag suggested by GPT-4o, or an error message.
- Document Collection: A list named
documents
holds sample text content covering a few distinct topics (Space, Cooking, Web Development). - Embedding Generation:
- The script iterates through the
documents
, generates an embedding for each usingget_embedding
, and stores successful embeddings and corresponding text inembeddings
andvalid_documents
. - The embeddings are converted to a NumPy array (
embedding_matrix
).
- The script iterates through the
- Clustering (K-Means):
- The number of clusters (
n_clusters
) is set (e.g.,k=3
). KMeans
fromscikit-learn
is initialized and fitted to theembedding_matrix
.kmeans.labels_
provides the cluster assignment for each document.
- The number of clusters (
- Grouping Documents:
- A dictionary (
clustered_documents
) is created to store the text of documents belonging to each cluster ID.
- A dictionary (
- Generating Cluster Tags:
- The script iterates through the
clustered_documents
dictionary. - For each
cluster_id
and its list ofdocs_in_cluster
, it calls thegenerate_cluster_tag
helper function. - The suggested tag for each cluster is stored in the
cluster_tags
dictionary.
- The script iterates through the
- Displaying Results:
- The script iterates through the clusters again.
- For each cluster, it retrieves the generated tag from
cluster_tags
. - It prints the cluster number, the suggested tag, and then lists the full text of all documents belonging to that cluster.
This example showcases a powerful workflow: using embeddings for unsupervised grouping of content based on meaning (clustering) and then leveraging an LLM like GPT-4o to interpret those groupings and assign meaningful labels (tagging), automating content organization.
3.2.7 Content Recommendations
Content recommendation systems powered by embeddings represent a significant advancement in personalization technology. By analyzing semantic relationships, these systems can understand the nuanced meaning and context of content in ways that traditional keyword-based systems cannot.
Here's a detailed look at how embedding-based recommendations work:
- Content Analysis:
- The system generates sophisticated embedding vectors for each piece of content in the database
- These vectors capture nuanced characteristics like writing style, topic depth, and emotional tone
- Advanced algorithms analyze patterns across multiple dimensions of content features
- User Preference Modeling:
- The system tracks detailed interaction patterns including time spent, engagement level, and sharing behavior
- Historical preferences are weighted and combined to create multi-dimensional user profiles
- Both explicit feedback (ratings, likes) and implicit signals (scroll depth, repeat visits) are considered
- Contextual Understanding:
- Real-time factors like device type and location are incorporated into the recommendation algorithm
- The system identifies patterns in content consumption based on time of day and day of week
- Current session behavior is analyzed to understand immediate user interests
- Dynamic Adaptation:
- Machine learning models continuously refine user profiles based on new interactions
- The system learns from both positive and negative feedback to improve accuracy
- Recommendation strategies are automatically adjusted based on performance metrics
This sophisticated approach enables recommendation engines to deliver highly personalized experiences through several key capabilities:
- Identify content similarities that might not be apparent through traditional metadata
- Can detect thematic connections between items even when they use different terminology
- Recognizes similar writing styles, tone, and complexity levels across content
- Understand the progression of user interests over time
- Tracks how preferences evolve from basic to advanced topics
- Identifies shifts in user interests across different categories
- Make cross-domain recommendations (e.g., suggesting articles based on watched videos)
- Connects content across different media types based on semantic relationships
- Leverages learning from one domain to enhance recommendations in another
- Account for seasonal trends and temporal relevance
- Adjusts recommendations based on time-sensitive factors like holidays or events
- Considers current trends and their impact on user interests
The result is a highly personalized experience that can suggest truly relevant videos, articles, or products that match users' interests, both current and evolving. This goes far beyond simple "users who liked X also liked Y" algorithms, creating a more engaging and valuable user experience.
Example:
Here's a code example that demonstrates the core concept of content recommendations using embeddings.
This script focuses on finding semantically similar content items based on their embeddings, which is the foundation for the more advanced recommendation features you described.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2024-11-30 15:52:00 CDT"
current_location = "Orlando, Florida, United States"
print(f"Running Content Recommendation example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return np.clip(similarity, -1.0, 1.0) # Ensure value is within valid range
# --- Content Recommendation Implementation ---
# 1. Define your Content Catalog (e.g., articles, blog posts)
# In a real application, this would come from a database or CMS.
content_catalog = [
{"id": "art001", "title": "Introduction to Quantum Computing", "content": "Exploring the basics of qubits, superposition, and entanglement in quantum mechanics and their potential for computation."},
{"id": "art002", "title": "Healthy Mediterranean Diet Recipes", "content": "Delicious and easy recipes focusing on fresh vegetables, olive oil, fish, and whole grains for a heart-healthy lifestyle."},
{"id": "art003", "title": "The Future of Artificial Intelligence in Healthcare", "content": "How AI and machine learning are transforming diagnostics, drug discovery, and personalized medicine."},
{"id": "art004", "title": "Beginner's Guide to Python Programming", "content": "Learn the fundamentals of Python syntax, data types, control flow, and functions to start coding."},
{"id": "art005", "title": "Understanding Neural Networks and Deep Learning", "content": "An overview of artificial neural networks, backpropagation, and the concepts behind deep learning models."},
{"id": "art006", "title": "Travel Guide: Hiking the Swiss Alps", "content": "Tips for planning your trip, recommended trails, essential gear, and stunning viewpoints in the Swiss Alps."},
{"id": "art007", "title": "Mastering the Art of French Pastry", "content": "Techniques for creating classic French desserts like croissants, macarons, and éclairs."},
{"id": "art008", "title": "Ethical Considerations in AI Development", "content": "Discussing bias, fairness, transparency, and accountability in the development and deployment of artificial intelligence systems."}
]
print(f"\nContent catalog contains {len(content_catalog)} items.")
# 2. Generate embeddings for all content items (pre-computation)
print("\nGenerating embeddings for the content catalog...")
content_embeddings_data = []
for item in content_catalog:
# Use title and content for embedding
text_to_embed = f"Title: {item['title']}\nContent: {item['content']}"
embedding = get_embedding(client, text_to_embed)
if embedding:
# Store item ID and its embedding
content_embeddings_data.append({"id": item["id"], "embedding": embedding})
else:
print(f"Skipping item {item['id']} due to embedding error.")
if not content_embeddings_data:
print("\nError: No embeddings were generated. Cannot provide recommendations.")
exit()
print(f"\nSuccessfully generated embeddings for {len(content_embeddings_data)} content items.")
# 3. Select a target item (e.g., an article the user just read)
target_item_id = "art003" # User read "The Future of Artificial Intelligence in Healthcare"
print(f"\nFinding content similar to item ID: {target_item_id}")
# Find the embedding for the target item
target_embedding = None
for item_data in content_embeddings_data:
if item_data["id"] == target_item_id:
target_embedding = item_data["embedding"]
break
if target_embedding is None:
print(f"Error: Could not find the embedding for the target item ID '{target_item_id}'.")
exit()
# 4. Calculate similarity between the target item and all other items
recommendations = []
print("\nCalculating similarities...")
for item_data in content_embeddings_data:
# Don't recommend the item itself
if item_data["id"] == target_item_id:
continue
similarity = cosine_similarity(target_embedding, item_data["embedding"])
recommendations.append({"id": item_data["id"], "score": similarity})
# 5. Sort potential recommendations by similarity score
recommendations.sort(key=lambda x: x["score"], reverse=True)
# 6. Display top N recommendations
print("\n--- Top Content Recommendations ---")
# Find the original title for the target item for context
target_item_info = next((item for item in content_catalog if item["id"] == target_item_id), None)
if target_item_info:
print(f"Because you read: \"{target_item_info['title']}\"\n")
if not recommendations:
print("No recommendations found (or error calculating similarities).")
else:
top_n = 3
print(f"Top {top_n} recommended items:")
for i, rec in enumerate(recommendations[:top_n]):
# Find the full item details from the original catalog
rec_details = next((item for item in content_catalog if item["id"] == rec["id"]), None)
if rec_details:
print(f"{i+1}. ID: {rec['id']}, Similarity Score: {rec['score']:.4f}")
print(f" Title: {rec_details['title']}")
print(f" Content Snippet: {rec_details['content'][:100]}...") # Truncate content
print("-" * 10)
else:
print(f"{i+1}. ID: {rec['id']}, Score: {rec['score']:.4f} (Details not found)")
print("-" * 10)
if len(recommendations) > top_n:
print(f"(Showing top {top_n} of {len(recommendations)} potential recommendations)")
print("\nNote: This demonstrates basic content-to-content similarity.")
print("Advanced systems incorporate user profiles, interaction history, context, etc.")
Code Breakdown Explanation
This script demonstrates a fundamental approach to content recommendation using OpenAI embeddings, focusing on finding items semantically similar to a target item.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions.
- Includes standard imports (
- Content Catalog:
- A list of dictionaries (
content_catalog
) simulates the available content (e.g., articles). Each item has anid
,title
, andcontent
.
- A list of dictionaries (
- Content Embedding Generation (Pre-computation):
- The script iterates through each
item
in thecontent_catalog
. - Combined Text: It creates a combined text string from the item's
title
andcontent
to generate a richer embedding that captures more semantic detail. - It calls
get_embedding
for this combined text. - It stores the
item['id']
and itsembedding
vector incontent_embeddings_data
. This pre-computation is vital for efficiency.
- The script iterates through each
- Target Item Selection:
- A
target_item_id
is chosen (e.g.,art003
), simulating an item the user has interacted with (e.g., read). - The script retrieves the pre-computed embedding for this target item.
- A
- Similarity Calculation:
- It iterates through all other items in
content_embeddings_data
. - It calculates the
cosine_similarity
between thetarget_embedding
and each other item's embedding. - It stores the other item's
id
and its similarityscore
in therecommendations
list.
- It iterates through all other items in
- Ranking Recommendations:
- The
recommendations
list is sorted byscore
in descending order, placing the most semantically similar content items first.
- The
- Displaying Results:
- The script prints the title of the target item for context ("Because you read...").
- It displays the top N (e.g., 3) recommended items, showing their ID, similarity score, title, and a snippet of their content.
- Contextual Note: The final print statements explicitly mention that this example shows basic content-to-content similarity. Advanced recommendation systems, as described in the section text, would integrate user profiles (embeddings based on interaction history), real-time context (time, location), explicit feedback, and potentially more complex algorithms beyond simple cosine similarity. However, the core principle of using embeddings to measure semantic relatedness remains fundamental.
This example effectively illustrates how embeddings enable recommendations based on understanding the meaning of content, allowing suggestions that go beyond simple keyword or category matching.
3.2.8 Email Triage / Prioritization
Embedding technology enables sophisticated email analysis and categorization by understanding the semantic meaning of messages. This advanced system employs multiple layers of analysis to streamline email management:
- Urgency Detection
- Identify time-sensitive matters requiring immediate attention through natural language processing
- Recognize urgent language patterns and contextual cues by analyzing word choice, sentence structure, and historical patterns
- Flag critical emails based on sender importance, keywords, and organizational hierarchy
- Smart Categorization
- Group related email threads and conversations using semantic similarity matching
- Sort messages by project, department, or business function through content analysis
- Create dynamic folders based on emerging topics and trends
- Apply machine learning to improve categorization accuracy over time
- Intent Classification
- Distinguish between requests, updates, and FYI messages using advanced natural language understanding
- Prioritize action items and delegate tasks automatically based on content and context
- Identify follow-up requirements and set automated reminders
- Extract key deadlines and commitments from message content
By leveraging semantic understanding, the system creates an intelligent email processing pipeline that can handle hundreds of messages simultaneously. The embedding-based analysis examines not just keywords, but the actual meaning and context of each message, considering factors such as:
- Message context within ongoing conversations
- Historical patterns of communication
- Organizational relationships and hierarchies
- Project timelines and priorities
This comprehensive approach significantly reduces the cognitive load of email management by automatically handling routine classification and prioritization tasks. The system ensures that important messages receive immediate attention while maintaining an organized structure for all communications. As a result, professionals can focus on high-value activities instead of spending hours manually sorting through their inbox, leading to improved productivity and faster response times for critical communications.
Example:
This script simulates categorizing incoming emails based on their semantic similarity to predefined categories like "Urgent Request," "Project Update,"
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2024-10-31 15:54:00 CDT"
current_location = "Plano, Texas, United States"
print(f"Running Email Triage/Prioritization example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return np.clip(similarity, -1.0, 1.0) # Ensure value is within valid range
# --- Email Triage/Prioritization Implementation ---
# 1. Define Sample Emails (Subject + Snippet)
emails = [
{"id": "email01", "subject": "Urgent: Server Down!", "body_snippet": "The main production server seems to be unresponsive. We need immediate assistance to investigate and bring it back online."},
{"id": "email02", "subject": "Meeting Minutes - Project Phoenix Sync", "body_snippet": "Attached are the minutes from today's sync call. Key decisions included finalizing the Q3 roadmap. Action items assigned."},
{"id": "email03", "subject": "Quick Question about Report", "body_snippet": "Hi team, just had a quick question regarding the methodology used in the latest market analysis report. Can someone clarify?"},
{"id": "email04", "subject": "Fwd: Company Newsletter - April Edition", "body_snippet": "Sharing the latest company newsletter for your information."},
{"id": "email05", "subject": "Action Required: Submit Timesheet by EOD", "body_snippet": "Friendly reminder to please submit your weekly timesheet by the end of the day today. This is mandatory."},
{"id": "email06", "subject": "Update on Q2 Marketing Campaign", "body_snippet": "Just wanted to provide a brief update on the campaign performance metrics we discussed last week. See attached summary."},
{"id": "email07", "subject": "Can you approve this request ASAP?", "body_snippet": "Need your approval on the attached budget request urgently to proceed with the vendor contract."}
]
print(f"\nProcessing {len(emails)} emails.")
# 2. Define Categories/Priorities and their Semantic Representations
# We represent each category with a descriptive phrase.
categories = {
"Urgent Action Required": "Requires immediate attention, critical issue, deadline, ASAP request, mandatory task.",
"Project Update / Status": "Information about ongoing projects, progress reports, meeting minutes, status updates.",
"Question / Request for Info": "Asking for clarification, seeking information, query about details.",
"General Info / FYI": "Newsletter, announcement, sharing information, non-actionable update."
}
print(f"\nDefined categories: {list(categories.keys())}")
# 3. Generate embeddings for Categories (pre-computation recommended)
print("\nGenerating embeddings for categories...")
category_embeddings = {}
for category_name, category_description in categories.items():
embedding = get_embedding(client, category_description)
if embedding:
category_embeddings[category_name] = embedding
else:
print(f"Skipping category '{category_name}' due to embedding error.")
if not category_embeddings:
print("\nError: No embeddings generated for categories. Cannot triage emails.")
exit()
# 4. Process Each Email: Generate Embedding and Find Best Category
print("\nTriaging emails...")
email_results = []
for email in emails:
# Combine subject and body for better context
email_content = f"Subject: {email['subject']}\nBody: {email['body_snippet']}"
email_embedding = get_embedding(client, email_content)
if not email_embedding:
print(f"Skipping email {email['id']} due to embedding error.")
continue
# Find the category with the highest similarity
best_category = None
max_similarity = -1 # Cosine similarity ranges from -1 to 1
for category_name, category_embedding in category_embeddings.items():
similarity = cosine_similarity(email_embedding, category_embedding)
print(f" Email {email['id']} vs Category '{category_name}': Score {similarity:.4f}")
if similarity > max_similarity:
max_similarity = similarity
best_category = category_name
email_results.append({
"id": email["id"],
"subject": email["subject"],
"assigned_category": best_category,
"score": max_similarity
})
print(f"-> Email {email['id']} assigned to: '{best_category}' (Score: {max_similarity:.4f})")
# 5. Display Triage Results
print("\n--- Email Triage Results ---")
if not email_results:
print("No emails were successfully triaged.")
else:
# Optional: Group by category for display
results_by_category = {cat: [] for cat in categories.keys()}
for result in email_results:
if result["assigned_category"]: # Check if category was assigned
results_by_category[result["assigned_category"]].append(result)
for category_name, items in results_by_category.items():
print(f"\nCategory: {category_name}")
print("-" * (10 + len(category_name)))
if not items:
print(" (No emails assigned)")
else:
# Sort items within category by score if desired
items.sort(key=lambda x: x['score'], reverse=True)
for item in items:
print(f" - ID: {item['id']}, Subject: \"{item['subject']}\" (Score: {item['score']:.3f})")
print("\nEmail triage process complete.")
Code Breakdown Explanation
This example shows how OpenAI embeddings can automatically sort and prioritize emails by understanding their meaning, demonstrating an intelligent email management system.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions.
- Includes standard imports (
- Sample Email Data:
- A list of dictionaries (
emails
) simulates incoming messages. Each email has anid
,subject
, and abody_snippet
.
- A list of dictionaries (
- Category Definitions:
- A dictionary (
categories
) defines the target categories for triage (e.g., "Urgent Action Required", "Project Update / Status"). - Key Idea: Each category is represented by a descriptive phrase or list of keywords that captures its semantic essence. This description is what will be embedded.
- A dictionary (
- Category Embedding Generation:
- The script iterates through the defined
categories
. - It calls
get_embedding
on the description associated with each category name. - The resulting embedding vector for each category is stored in the
category_embeddings
dictionary. This step would typically be pre-computed and stored.
- The script iterates through the defined
- Email Processing Loop:
- The script iterates through each
email
in the sample data. - Content Combination: It combines the
subject
andbody_snippet
into a singleemail_content
string to provide richer context for the embedding. - Email Embedding: It calls
get_embedding
to get the vector representation of the current email's content. - Similarity Calculation:
- It then iterates through the pre-computed
category_embeddings
. - For each category, it calculates the
cosine_similarity
between theemail_embedding
and thecategory_embedding
. - It keeps track of the
best_category
(the one with the highest similarity score found so far) and the correspondingmax_similarity
score.
- It then iterates through the pre-computed
- Assignment: After comparing the email to all categories, the email is assigned the
best_category
found. The result (email ID, subject, assigned category, score) is stored.
- The script iterates through each
- Displaying Triage Results:
- The script prints the final assignments.
- Optional Grouping: It includes logic to group the results by the assigned category for a clearer presentation, showing which emails fell into the "Urgent," "Update," etc., buckets.
This example effectively demonstrates how embeddings allow for intelligent categorization based on meaning. An email asking for "approval ASAP" can be correctly identified as "Urgent Action Required" even without using the exact word "urgent," because its embedding will be semantically close to the embedding of the "Urgent Action Required" category description. This is far more robust than simple keyword filtering.
3.2 When to Use Embeddings
Embeddings have revolutionized how we process and understand textual information in modern AI applications. While traditional text processing methods rely on exact matches or basic keyword searching, embeddings provide a sophisticated way to capture the nuanced meanings and relationships between pieces of text. By converting words and phrases into high-dimensional numerical vectors, embeddings enable machines to understand semantic relationships and similarities in ways that more closely mirror human understanding.
Let's explore the key scenarios where embeddings prove particularly valuable, showcasing how this technology transforms various aspects of information processing and retrieval. Understanding these use cases is crucial for developers and organizations looking to leverage the full potential of embedding technology in their applications.
3.2.1 Semantic search
Finding relevant information based on meaning rather than just keywords, enabling more intelligent search results. Unlike traditional keyword-based search that matches exact words or phrases, semantic search understands the intent and contextual meaning of a query by analyzing the underlying relationships between words and concepts. This advanced approach allows the system to comprehend variations in language, context, and even user intent.
For example, a search for "natural language processing" would also return relevant results about "NLP," "computational linguistics," or "text analysis." When a user searches for "treating common cold symptoms," the system would understand and return results about "flu remedies," "reducing fever," and "cough medicine" - even if these exact phrases aren't used. This technology leverages embedding vectors to calculate similarity scores between queries and documents, transforming each piece of text into a high-dimensional numerical representation that captures its semantic meaning. This mathematical approach enables more nuanced and accurate search results that account for:
- Synonyms and related terms (like "car" and "automobile")
- Conceptual relationships (connecting "python" to both programming and snakes, depending on context)
- Multiple languages (finding relevant content even when written in different languages)
- Contextual variations (understanding that "apple" could refer to either the fruit or the technology company)
- Intent matching (recognizing that "how to fix a flat tire" and "tire repair instructions" are seeking the same information)
Example:
Here is a code example demonstrating semantic search using OpenAI embeddings, based on the content you provided.
This script will:
- Define a small set of documents.
- Generate embeddings for these documents and a search query.
- Calculate the similarity between the query and each document.
- Rank the documents by relevance based on semantic similarity.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-03-22 15:22:00 CDT"
current_location = "Grapevine, Texas, United States"
print(f"Running Semantic Search example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings example)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print(f"Generating embedding for: \"{text[:50]}...\"") # Print truncated text
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
print("Embedding generation successful.")
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{text[:50]}...': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{text[:50]}...': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings example)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
# print("Error: Cannot calculate similarity with None vectors.")
return 0.0 # Return 0 if any vector is missing
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
# print("Warning: One or both vectors have zero magnitude.")
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return similarity
# --- Semantic Search Implementation ---
# 1. Define your document store (a list of text strings)
# In a real application, this could come from a database, files, etc.
document_store = [
"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
"Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy.",
"Artificial intelligence research focuses on creating systems capable of performing tasks that typically require human intelligence.",
"A recipe for classic French onion soup involves caramelizing onions and topping with bread and cheese.",
"Machine learning, a subset of AI, involves algorithms that allow systems to learn from data.",
"The Louvre Museum in Paris is the world's largest art museum and a historic monument.",
"Natural Language Processing (NLP) enables computers to understand and process human language.",
"Baking bread requires careful measurement of ingredients like flour, water, yeast, and salt."
]
print(f"\nDocument store contains {len(document_store)} documents.")
# 2. Generate embeddings for all documents in the store (pre-computation)
# In a real app, you'd store these embeddings alongside the documents.
print("\nGenerating embeddings for the document store...")
document_embeddings = []
for doc in document_store:
embedding = get_embedding(client, doc)
# Store the document text and its embedding together
if embedding: # Only store if embedding was successful
document_embeddings.append({"text": doc, "embedding": embedding})
else:
print(f"Skipping document due to embedding error: \"{doc[:50]}...\"")
print(f"\nSuccessfully generated embeddings for {len(document_embeddings)} documents.")
# 3. Define the user's search query
search_query = "What is AI?"
# search_query = "Things to see in Paris"
# search_query = "How does NLP work?"
# search_query = "Cooking instructions"
print(f"\nSearch Query: \"{search_query}\"")
# 4. Generate embedding for the search query
print("\nGenerating embedding for the search query...")
query_embedding = get_embedding(client, search_query)
# 5. Calculate similarity and rank documents
search_results = []
if query_embedding and document_embeddings:
print("\nCalculating similarities...")
for doc_data in document_embeddings:
similarity = cosine_similarity(query_embedding, doc_data["embedding"])
search_results.append({"text": doc_data["text"], "score": similarity})
# Sort results by similarity score in descending order
search_results.sort(key=lambda x: x["score"], reverse=True)
# 6. Display results
print("\n--- Semantic Search Results ---")
print(f"Top results for query: \"{search_query}\"\n")
if not search_results:
print("No results found (or error calculating similarities).")
else:
# Display top N results (e.g., top 3)
top_n = 3
for i, result in enumerate(search_results[:top_n]):
print(f"{i+1}. Score: {result['score']:.4f}")
print(f" Text: {result['text']}")
print("-" * 10)
if len(search_results) > top_n:
print(f"(Showing top {top_n} of {len(search_results)} results)")
else:
print("\nCould not perform search.")
if not query_embedding:
print("Reason: Failed to generate embedding for the search query.")
if not document_embeddings:
print("Reason: No document embeddings were successfully generated.")
Code Breakdown Explanation:
- Setup & Helpers: Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions from the previous example. - Document Store: A simple Python list (
document_store
) holds the text content of the documents we want to search through. In a real application, this data would likely come from a database or file system. - Document Embedding Generation:
- The script iterates through each document in the
document_store
. - It calls
get_embedding
for each document to get its numerical representation. - It stores the original document text and its corresponding embedding vector together (e.g., in a list of dictionaries). This pre-computation step is crucial for efficiency in real systems – you generate document embeddings once and store them. Error handling ensures documents are skipped if embedding fails.
- The script iterates through each document in the
- Search Query: A sample
search_query
string is defined. - Query Embedding Generation: The
get_embedding
function is called again, this time for thesearch_query
. - Similarity Calculation & Ranking:
- It checks if both the query embedding and document embeddings were successfully generated.
- It iterates through the stored
document_embeddings
. - For each document, it calculates the
cosine_similarity
between thequery_embedding
and the document's embedding. - The document text and its calculated similarity score are stored in a
search_results
list. - Finally,
search_results.sort(...)
arranges the list based on thescore
in descending order (highest similarity first).
- Display Results: The script prints the top N (e.g., 3) most relevant documents from the sorted list, showing their similarity score and text content.
This example clearly illustrates the core concept of semantic search: converting both documents and queries into embeddings and then using vector similarity (like cosine similarity) to find documents that are semantically related to the query, even if they don't share the exact keywords.
3.2.2 Topic clustering
Topic clustering is a sophisticated technique for organizing and analyzing large document collections by automatically grouping them based on their semantic content. This advanced application of embeddings transforms the way we process and understand large-scale document collections, offering a powerful solution for content organization. The system works by converting each document into a high-dimensional embedding vector that captures its meaning, then using clustering algorithms to group similar vectors together.
This powerful application of embeddings empowers systems to:
- Identify thematic patterns across thousands of documents without manual labeling - the system can automatically detect common topics and themes across vast document collections, saving countless hours of manual categorization work
- Group similar discussions, articles, or content pieces into intuitive categories - by understanding the semantic relationships between documents, the system can create meaningful groupings that reflect natural topic divisions, even when documents use different terminology to discuss the same concepts
- Discover emerging topics and trends within large document collections - as new content is added, the system can identify new thematic clusters forming, helping organizations stay ahead of developing trends in their field
- Create dynamic content hierarchies that adapt as new documents are added - unlike traditional static categorization systems, embedding-based clustering can automatically reorganize and refine category structures as the content collection grows and evolves
For example, a news organization could use topic clustering to automatically group thousands of articles into categories like "Technology", "Politics", or "Sports", even when these topics aren't explicitly tagged. The embeddings capture the semantic relationships between articles by analyzing the actual meaning and context of the content, not just keywords. This enables much more sophisticated grouping that can understand subtle distinctions - for instance, recognizing that an article about the economic impact of sports stadiums belongs in both "Sports" and "Business" categories, or that articles about different programming languages all belong in a "Technology" cluster despite using completely different terminology.
Example:
Below is a code example that demonstrates topic clustering using OpenAI embeddings and the K-means algorithm from scikit-learn
.
This code will:
- Define a list of sample documents covering different implicit topics.
- Generate embeddings for each document using OpenAI's API.
- Apply the K-Means clustering algorithm to group the embedding vectors.
- Display the documents belonging to each identified cluster.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np
from sklearn.cluster import KMeans # For clustering algorithm
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-03-23 15:26:00 CDT"
current_location = "Dallas, Texas, United States"
print(f"Running Topic Clustering example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
# Truncate text for printing if it's too long
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
# print("Embedding generation successful.") # Reduce verbosity
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Topic Clustering Implementation ---
# 1. Define your collection of documents
# These documents cover roughly 3 topics: AI/Tech, Travel/Geography, Food/Cooking
documents = [
"Artificial intelligence research focuses on creating systems capable of performing tasks that typically require human intelligence.",
"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
"A recipe for classic French onion soup involves caramelizing onions and topping with bread and cheese.",
"Machine learning, a subset of AI, involves algorithms that allow systems to learn from data.",
"The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials.",
"Natural Language Processing (NLP) enables computers to understand and process human language.",
"Baking bread requires careful measurement of ingredients like flour, water, yeast, and salt.",
"The Colosseum in Rome, Italy, is an oval amphitheatre in the centre of the city.",
"Deep learning utilizes artificial neural networks with multiple layers to model complex patterns.",
"Sushi is a traditional Japanese dish of prepared vinegared rice, usually with some sugar and salt, accompanying a variety of ingredients, such as seafood, often raw, and vegetables."
]
print(f"\nDocument collection contains {len(documents)} documents.")
# 2. Generate embeddings for all documents
print("\nGenerating embeddings for the document collection...")
embeddings = []
valid_documents = [] # Keep track of documents for which embedding was successful
for doc in documents:
embedding = get_embedding(client, doc)
if embedding:
embeddings.append(embedding)
valid_documents.append(doc) # Add corresponding document text
else:
print(f"Skipping document due to embedding error: \"{doc[:70]}...\"")
if not embeddings:
print("\nError: No embeddings were generated. Cannot perform clustering.")
exit()
print(f"\nSuccessfully generated embeddings for {len(valid_documents)} documents.")
# Convert embeddings list to a NumPy array for scikit-learn
embedding_matrix = np.array(embeddings)
# 3. Apply Clustering Algorithm (K-Means)
# We need to choose the number of clusters (k). Let's assume we expect 3 topics.
# In real applications, determining the optimal 'k' often requires experimentation
# (e.g., using the elbow method or silhouette scores).
n_clusters = 3
print(f"\nApplying K-Means clustering with k={n_clusters}...")
try:
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) # n_init suppresses warning
kmeans.fit(embedding_matrix)
cluster_labels = kmeans.labels_
print("Clustering complete.")
except Exception as e:
print(f"An error occurred during clustering: {e}")
exit()
# 4. Display Documents by Cluster
print(f"\n--- Documents Grouped by Topic Cluster (k={n_clusters}) ---")
# Create a dictionary to hold documents for each cluster
clustered_documents = {i: [] for i in range(n_clusters)}
# Assign each document (that had a valid embedding) to its cluster
for i, label in enumerate(cluster_labels):
clustered_documents[label].append(valid_documents[i])
# Print the contents of each cluster
for cluster_id, docs_in_cluster in clustered_documents.items():
print(f"\nCluster {cluster_id + 1}:")
if not docs_in_cluster:
print(" (No documents in this cluster)")
else:
for doc_text in docs_in_cluster:
# Print truncated document text for readability
print_text = doc_text[:100] + "..." if len(doc_text) > 100 else doc_text
print(f" - {print_text}")
print("-" * 20)
print("\nNote: The quality of clustering depends on the data, the embedding model,")
print("and the chosen number of clusters (k). Cluster numbers are arbitrary.")
Code Breakdown Explanation:
- Setup & Helpers:
- Includes standard imports plus
KMeans
fromsklearn.cluster
. - Initializes the OpenAI client.
- Includes the
get_embedding
helper function (same as before).
- Includes standard imports plus
- Document Collection: A list named
documents
holds the text content. The sample documents are chosen to represent a few distinct underlying topics (AI/Tech, Travel/Geography, Food/Cooking). - Embedding Generation:
- The script iterates through the
documents
. - It calls
get_embedding
for each document. - It stores the successful embeddings in the
embeddings
list and the corresponding document text invalid_documents
. This ensures that the indices match later. - Error handling skips documents if embedding generation fails.
- The list of embedding vectors is converted into a NumPy array (
embedding_matrix
), which is the standard input format forscikit-learn
algorithms.
- The script iterates through the
- Clustering (K-Means):
- Choosing
k
: The number of clusters (n_clusters
) is set (here,k=3
, assuming we expect three topics based on the sample data). A comment highlights that finding the optimalk
is often a separate task in real-world scenarios. - Initialization: A
KMeans
object is created.n_clusters
specifies the desired number of groups.random_state
ensures reproducibility.n_init=10
runs the algorithm multiple times with different starting centroids and chooses the best result (suppresses a future warning). - Fitting:
kmeans.fit(embedding_matrix)
performs the K-Means clustering algorithm on the document embeddings. It finds cluster centers and assigns each embedding vector to the nearest center. - Labels:
kmeans.labels_
contains an array where each element indicates the cluster ID (0, 1, 2, etc.) assigned to the corresponding document embedding.
- Choosing
- Displaying Results:
- A dictionary (
clustered_documents
) is created to organize the results, with keys representing cluster IDs. - The script iterates through the
cluster_labels
assigned by K-Means. For each document's indexi
, it finds its assignedlabel
and appends the corresponding text fromvalid_documents[i]
to the list for that cluster ID in the dictionary. - Finally, it loops through the
clustered_documents
dictionary and prints the text of the documents belonging to each cluster, clearly grouping them by the topic cluster identified by the algorithm.
- A dictionary (
This example demonstrates the power of embeddings for unsupervised topic discovery. By converting text to vectors, we can use mathematical algorithms like K-Means to group semantically similar documents without needing pre-defined labels.
3.2.3 Recommendation Systems
Suggesting related items by understanding the deeper connections between different pieces of content. This powerful application of embeddings enables systems to provide personalized recommendations by analyzing the semantic relationships between items. The embedding vectors capture subtle patterns and similarities that might not be immediately obvious to human observers.
Here's how recommendation systems leverage embeddings:
- Content-Based Filtering
- Systems analyze the actual content characteristics (like text descriptions, features, or attributes)
- Each item is converted into an embedding vector that represents its key features
- Similar items are found by measuring the distance between these vectors
- Collaborative Filtering Enhancement
- User behaviors and preferences are also converted into embeddings
- The system can identify patterns in user-item interactions
- This helps predict which items a user might like based on similar users' preferences
For example, a video streaming service can recommend shows not just based on genre tags, but by understanding thematic elements, storytelling styles, and complex narrative patterns. The embedding vectors can capture nuanced features like:
- Pacing and plot complexity
- Character development styles
- Emotional tone and atmosphere
- Visual and directorial techniques
Similarly, e-commerce platforms can suggest products by understanding the contextual similarities in product descriptions, user behavior, and item characteristics. This includes analyzing:
- Product descriptions and features
- User browsing and purchase patterns
- Price points and quality levels
- Brand relationships and market positioning
This semantic understanding leads to more accurate and relevant recommendations compared to traditional methods that rely solely on explicit categories or user ratings. The system can identify subtle connections and patterns that might be missed by conventional recommendation approaches, resulting in more engaging and personalized user experiences.
Example:
The following code example demonstrates how OpenAI embeddings can be used to build a simple content-based recommendation system.
This script will:
- Define a small catalog of items (e.g., movie descriptions).
- Generate embeddings for these items.
- Choose a target item.
- Find other items in the catalog that are semantically similar to the target item based on their embeddings.
- Present the most similar items as recommendations.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-03-24 15:29:00 CDT"
current_location = "Austin, Texas, United States"
print(f"Running Recommendation System example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
# Truncate text for printing if it's too long
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
# print("Embedding generation successful.") # Reduce verbosity
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0 # Return 0 if any vector is missing
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return similarity
# --- Recommendation System Implementation ---
# 1. Define your item catalog (e.g., movie descriptions)
# In a real application, this would come from a database.
item_catalog = [
{"id": "mov001", "title": "Space Odyssey: The Final Frontier", "description": "A visually stunning sci-fi epic exploring humanity's place in the universe, featuring complex themes and groundbreaking special effects."},
{"id": "mov002", "title": "Galactic Wars: Attack of the Clones", "description": "An action-packed space opera with laser battles, alien creatures, and a classic good versus evil storyline."},
{"id": "com001", "title": "Laugh Riot", "description": "A slapstick comedy about mistaken identities and hilarious mishaps during a weekend getaway."},
{"id": "doc001", "title": "Wonders of the Deep", "description": "An awe-inspiring documentary showcasing the beauty and mystery of marine life in the world's oceans."},
{"id": "mov003", "title": "Cyber City 2077", "description": "A gritty cyberpunk thriller set in a dystopian future, exploring themes of technology, consciousness, and rebellion."},
{"id": "com002", "title": "The Office Party", "description": "A witty ensemble comedy centered around awkward interactions and office politics during an annual holiday celebration."},
{"id": "doc002", "title": "Cosmic Journeys", "description": "A documentary exploring the vastness of space, black holes, distant galaxies, and the search for extraterrestrial life."},
{"id": "mov004", "title": "Interstellar Echoes", "description": "A philosophical science fiction film about astronauts travelling through a wormhole in search of a new home for humanity."}
]
print(f"\nItem catalog contains {len(item_catalog)} items.")
# 2. Generate embeddings for all items in the catalog (pre-computation)
print("\nGenerating embeddings for the item catalog...")
item_embeddings_data = []
for item in item_catalog:
# Combine title and description for a richer embedding
text_to_embed = f"{item['title']}: {item['description']}"
embedding = get_embedding(client, text_to_embed)
if embedding:
# Store item ID and its embedding
item_embeddings_data.append({"id": item["id"], "embedding": embedding})
else:
print(f"Skipping item {item['id']} due to embedding error.")
if not item_embeddings_data:
print("\nError: No embeddings were generated. Cannot provide recommendations.")
exit()
print(f"\nSuccessfully generated embeddings for {len(item_embeddings_data)} items.")
# 3. Select a target item for which to find recommendations
target_item_id = "mov001" # Let's find movies similar to "Space Odyssey"
print(f"\nFinding recommendations similar to item ID: {target_item_id}")
# Find the embedding for the target item
target_embedding = None
for item_data in item_embeddings_data:
if item_data["id"] == target_item_id:
target_embedding = item_data["embedding"]
break
if target_embedding is None:
print(f"Error: Could not find the embedding for the target item ID '{target_item_id}'.")
exit()
# 4. Calculate similarity between the target item and all other items
recommendations = []
print("\nCalculating similarities...")
for item_data in item_embeddings_data:
# Don't compare the item with itself
if item_data["id"] == target_item_id:
continue
similarity = cosine_similarity(target_embedding, item_data["embedding"])
recommendations.append({"id": item_data["id"], "score": similarity})
# 5. Sort potential recommendations by similarity score
recommendations.sort(key=lambda x: x["score"], reverse=True)
# 6. Display top N recommendations
print("\n--- Top Recommendations ---")
# Find the original title/description for the target item for context
target_item_info = next((item for item in item_catalog if item["id"] == target_item_id), None)
if target_item_info:
print(f"Based on: \"{target_item_info['title']}\"\n")
if not recommendations:
print("No recommendations found (or error calculating similarities).")
else:
top_n = 3
print(f"Top {top_n} most similar items:")
for i, rec in enumerate(recommendations[:top_n]):
# Find the full item details from the original catalog
rec_details = next((item for item in item_catalog if item["id"] == rec["id"]), None)
if rec_details:
print(f"{i+1}. ID: {rec['id']}, Score: {rec['score']:.4f}")
print(f" Title: {rec_details['title']}")
print(f" Description: {rec_details['description'][:100]}...") # Truncate description
print("-" * 10)
else:
print(f"{i+1}. ID: {rec['id']}, Score: {rec['score']:.4f} (Details not found)")
print("-" * 10)
if len(recommendations) > top_n:
print(f"(Showing top {top_n} of {len(recommendations)} potential recommendations)")
Code Breakdown Explanation
This example demonstrates how to build a straightforward content-based recommendation system by combining OpenAI embeddings with cosine similarity calculations.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions previously defined.
- Includes standard imports (
- Item Catalog:
- A list of dictionaries (
item_catalog
) represents the items available for recommendation (e.g., movies). Each item has anid
,title
, anddescription
. In a real system, this would likely be loaded from a database.
- A list of dictionaries (
- Item Embedding Generation:
- The script iterates through each
item
in theitem_catalog
. - Content Combination: It combines the
title
anddescription
into a single string (text_to_embed
). This provides richer context to the embedding model than using just the title or description alone. - It calls
get_embedding
for this combined text. - It stores the
item['id']
and its correspondingembedding
vector together in theitem_embeddings_data
list. This pre-computation step is standard practice for recommendation systems.
- The script iterates through each
- Target Item Selection:
- A
target_item_id
variable is set to specify the item for which we want recommendations (e.g., find items similar tomov001
). - The script retrieves the pre-computed embedding vector for this
target_item_id
from theitem_embeddings_data
list.
- A
- Similarity Calculation:
- It iterates through all the items with embeddings in
item_embeddings_data
. - Exclusion: It explicitly skips the comparison if the current item's ID matches the
target_item_id
(an item shouldn't recommend itself). - For every other item, it calculates the
cosine_similarity
between thetarget_embedding
and the current item's embedding. - It stores the other item's
id
and its calculated similarityscore
in arecommendations
list.
- It iterates through all the items with embeddings in
- Ranking Recommendations:
- The
recommendations
list is sorted usingrecommendations.sort(...)
based on thescore
field in descending order, placing the most similar items at the beginning of the list.
- The
- Displaying Results:
- The script prints the title of the target item for context.
- It then iterates through the top N (e.g., 3) items in the sorted
recommendations
list. - For each recommended item ID, it looks up the full details (title, description) from the original
item_catalog
. - It prints the rank, ID, similarity score, title, and a truncated description for each recommended item.
This example effectively shows how embeddings capture semantic meaning, allowing the system to recommend items based on content similarity (e.g., recommending other philosophical sci-fi movies similar to "Space Odyssey") rather than just explicit tags or user history.
3.2.4 Context retrieval for AI assistants
Helping chatbots and AI systems find and use relevant information from large knowledge bases by converting both queries and stored knowledge into embeddings. This process involves several key steps:
First, the system converts all documents in its knowledge base into embedding vectors - numerical representations that capture the semantic meaning of the text. These embeddings are stored in a vector database for quick retrieval.
When an AI assistant receives a question, it converts that query into an embedding vector using the same process. This ensures that both the stored knowledge and the incoming questions are represented in the same mathematical space.
The system then performs a similarity search to find the most relevant information. This search compares the query embedding to all stored embeddings, typically using techniques like cosine similarity or nearest neighbor search. The beauty of this approach is that it can identify semantically similar content even when the exact wording differs significantly.
For example, a query about "laptop won't turn on" might match documentation about "computer power issues" because their embeddings capture the similar underlying meaning. This semantic matching is far more powerful than traditional keyword-based search.
Once relevant information is identified, it can be used to generate more accurate, informed responses. This is particularly powerful for domain-specific applications where the AI needs to access technical documentation, product information, or company policies. The system can handle complex queries by combining multiple pieces of relevant context, ensuring responses are both accurate and comprehensive.
Example:
Below is a code example that demonstrates how AI assistants can retrieve context using OpenAI embeddings, implementing the concepts discussed in section 3.2.4.
The script illustrates the essential process of searching a knowledge base to provide relevant context for an AI assistant's responses.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-02-10 15:35:00 CDT"
current_location = "Grapevine, Texas, United States"
print(f"Running Context Retrieval example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return similarity
# --- Context Retrieval Implementation ---
# 1. Define your Knowledge Base (list of text documents/chunks)
# This represents the information the AI assistant can draw upon.
knowledge_base = [
{"id": "doc001", "source": "troubleshooting_guide.txt", "content": "If your laptop fails to power on, first check the power adapter connection. Ensure the cable is securely plugged into both the laptop and the wall outlet. Try a different outlet if possible."},
{"id": "doc002", "source": "troubleshooting_guide.txt", "content": "A blinking power light often indicates a battery issue or a charging problem. Try removing the battery (if removable) and powering on with only the adapter connected."},
{"id": "doc003", "source": "faq.html", "content": "To reset your password, go to the login page and click the 'Forgot Password' link. Follow the instructions sent to your registered email address."},
{"id": "doc004", "source": "product_manual.pdf", "content": "The Model X laptop uses a USB-C port for charging. Ensure you are using the correct wattage power adapter (65W minimum recommended)."},
{"id": "doc005", "source": "troubleshooting_guide.txt", "content": "No display output? Check if the laptop is making any sounds (fan spinning, beeps). Try connecting an external monitor to rule out a screen issue."},
{"id": "doc006", "source": "support_articles/power_issues.md", "content": "Holding the power button down for 15-30 seconds can perform a hard reset, sometimes resolving power-on failures."},
{"id": "doc007", "source": "faq.html", "content": "Software updates can be found in the 'System Settings' under the 'Updates' section. Ensure you are connected to the internet."}
]
print(f"\nKnowledge base contains {len(knowledge_base)} documents/chunks.")
# 2. Generate embeddings for the knowledge base (pre-computation)
print("\nGenerating embeddings for the knowledge base...")
kb_embeddings_data = []
for doc in knowledge_base:
embedding = get_embedding(client, doc["content"])
if embedding:
# Store document ID and its embedding
kb_embeddings_data.append({"id": doc["id"], "embedding": embedding})
else:
print(f"Skipping document {doc['id']} due to embedding error.")
if not kb_embeddings_data:
print("\nError: No embeddings were generated for the knowledge base. Cannot retrieve context.")
exit()
print(f"\nSuccessfully generated embeddings for {len(kb_embeddings_data)} knowledge base documents.")
# 3. Define the user's query to the AI assistant
user_query = "My computer won't start up."
# user_query = "How do I update the system software?"
# user_query = "Screen is black when I press the power button."
print(f"\nUser Query: \"{user_query}\"")
# 4. Generate embedding for the user query
print("\nGenerating embedding for the user query...")
query_embedding = get_embedding(client, user_query)
# 5. Find relevant documents from the knowledge base using similarity search
retrieved_context = []
if query_embedding and kb_embeddings_data:
print("\nCalculating similarities to find relevant context...")
for doc_data in kb_embeddings_data:
similarity = cosine_similarity(query_embedding, doc_data["embedding"])
retrieved_context.append({"id": doc_data["id"], "score": similarity})
# Sort context documents by similarity score in descending order
retrieved_context.sort(key=lambda x: x["score"], reverse=True)
# 6. Select Top N relevant documents to use as context
top_n_context = 3
print(f"\n--- Top {top_n_context} Relevant Context Documents Found ---")
if not retrieved_context:
print("No relevant context found (or error calculating similarities).")
else:
final_context_docs = []
for i, context_item in enumerate(retrieved_context[:top_n_context]):
# Find the full document details from the original knowledge base
doc_details = next((doc for doc in knowledge_base if doc["id"] == context_item["id"]), None)
if doc_details:
print(f"{i+1}. ID: {context_item['id']}, Score: {context_item['score']:.4f}")
print(f" Source: {doc_details['source']}")
print(f" Content: {doc_details['content'][:150]}...") # Truncate content
print("-" * 10)
final_context_docs.append(doc_details['content']) # Store content for next step
else:
print(f"{i+1}. ID: {context_item['id']}, Score: {context_item['score']:.4f} (Details not found)")
print("-" * 10)
if len(retrieved_context) > top_n_context:
print(f"(Showing top {top_n_context} of {len(retrieved_context)} potential context documents)")
# --- Next Step (Conceptual - Not coded here) ---
print("\n--- Next Step: Generating AI Assistant Response ---")
print("The content from the relevant documents above would now be combined")
print("with the original user query and sent to a model like GPT-4o")
print("as context to generate an informed and accurate response.")
print("Example prompt structure for GPT-4o:")
print("```")
print(f"System: You are a helpful AI assistant. Answer the user's question based ONLY on the provided context documents.")
print(f"User: Context Documents:\n1. {final_context_docs[0][:50]}...\n2. {final_context_docs[1][:50]}...\n[...]\n\nQuestion: {user_query}\n\nAnswer:")
print("```")
else:
print("\nCould not retrieve context.")
if not query_embedding:
print("Reason: Failed to generate embedding for the user query.")
if not kb_embeddings_data:
print("Reason: No knowledge base embeddings were successfully generated.")
Code Breakdown Explanation
This example demonstrates the core mechanism behind context retrieval for AI assistants using embeddings – finding relevant information from a knowledge base to answer a user's query.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions.
- Includes standard imports (
- Knowledge Base Definition:
- A list of dictionaries (
knowledge_base
) simulates the information store the AI assistant can access. Each dictionary represents a document or chunk of information and includes anid
,source
(optional metadata), and the actual textcontent
.
- A list of dictionaries (
- Knowledge Base Embedding Generation:
- The script iterates through each
doc
in theknowledge_base
. - It calls
get_embedding
on thedoc["content"]
to get its vector representation. - It stores the
doc['id']
and its correspondingembedding
vector together inkb_embeddings_data
. This is the crucial pre-computation step – embeddings for the knowledge base are typically generated offline and stored (often in a specialized vector database) for fast retrieval.
- The script iterates through each
- User Query:
- A sample
user_query
string represents the question asked to the AI assistant.
- A sample
- Query Embedding Generation:
- The
get_embedding
function is called for theuser_query
to get its vector representation in the same embedding space as the knowledge base documents.
- The
- Similarity Search (Context Retrieval):
- It iterates through all the pre-computed embeddings in
kb_embeddings_data
. - For each knowledge base document, it calculates the
cosine_similarity
between thequery_embedding
and the document's embedding. - It stores the document's
id
and its similarityscore
relative to the query in aretrieved_context
list.
- It iterates through all the pre-computed embeddings in
- Ranking and Selection:
- The
retrieved_context
list is sorted byscore
in descending order, bringing the most semantically relevant documents to the top. - The script selects the top N (e.g., 3) documents from this sorted list. These documents represent the most relevant context found in the knowledge base for the user's query.
- The
- Displaying Retrieved Context:
- The script prints the details (ID, score, source, content preview) of the top N context documents found.
- Conceptual Next Step (Crucial Explanation):
- The final print statements explain the purpose of this retrieval process. The content of these
final_context_docs
would not be the final answer. Instead, they would be combined with the originaluser_query
and passed as context to a large language model like GPT-4o in a subsequent API call. - An example prompt structure is shown, illustrating how the retrieved context grounds the AI assistant, enabling it to generate an informed response based on the relevant information found in the knowledge base, rather than relying solely on its general knowledge.
- The final print statements explain the purpose of this retrieval process. The content of these
This example effectively demonstrates the retrieval part of Retrieval-Augmented Generation (RAG), showing how embeddings bridge the gap between a user's query and relevant information stored in a knowledge base, enabling more accurate and context-aware AI assistants.
3.2.5 Anomaly and similarity detection
Identifying unusual patterns or finding similar items in large datasets by comparing their semantic representations is a fundamental application of embedding technology. This powerful technique transforms raw data into mathematical vectors that capture the essence of their content, enabling sophisticated analysis at scale. Here's how these systems work and their key applications:
- Detect Anomalies
- Flag unusual transactions or behaviors that deviate from normal patterns - For example, detecting suspicious credit card purchases by comparing them against typical spending patterns
- Identify potential security threats or fraud attempts - Such as recognizing unusual login patterns or detecting fake accounts based on behavior analysis
- Spot data quality issues or outliers in datasets - Including identifying incorrect data entries or unusual measurements that might indicate equipment malfunction
- Find Similarities
- Group related documents, images, or data points based on semantic meaning - This allows systems to cluster similar content even when the exact wording differs, making it easier to organize large collections of information
- Match similar customer inquiries or support tickets - Helping customer service teams identify common issues and standardize responses to frequent problems
- Identify duplicate or near-duplicate content - Useful for content management systems to maintain data quality and reduce redundancy
By converting data points into embedding vectors, systems can measure how "different" or "similar" items are to each other using mathematical distance calculations. This process works by mapping each item to a point in a high-dimensional space, where similar items are positioned closer together and dissimilar items are farther apart. This mathematical representation makes it possible to automatically flag unusual patterns or group related items together at scale, enabling both anomaly detection and similarity matching in ways that would be impossible with traditional rule-based systems.
Example:
The following code example demonstrates similarity and anomaly detection using OpenAI embeddings.
This script will:
- Define a dataset of text items (e.g., descriptions of transactions or events).
- Generate embeddings for these items.
- Similarity Detection: Find items most similar to a given target item.
- Anomaly Detection: Identify items that are least similar (most anomalous) compared to the rest of the dataset using a simple average similarity approach.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-01-13 15:40:00 CDT"
current_location = "Houston, Texas, United States"
print(f"Running Similarity & Anomaly Detection example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
# Clamp the value to handle potential floating point inaccuracies slightly outside [-1, 1]
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return np.clip(similarity, -1.0, 1.0)
# --- Similarity and Anomaly Detection Implementation ---
# 1. Define your dataset (e.g., transaction descriptions, log entries)
# Includes mostly normal items and a couple of potentially anomalous ones.
dataset = [
{"id": "txn001", "description": "Grocery purchase at Local Supermarket"},
{"id": "txn002", "description": "Monthly subscription fee for streaming service"},
{"id": "txn003", "description": "Dinner payment at Italian Restaurant"},
{"id": "txn004", "description": "Online order for electronics from TechStore"},
{"id": "txn005", "description": "Fuel purchase at Gas Station"},
{"id": "txn006", "description": "Purchase of fresh produce and bread"}, # Similar to txn001
{"id": "txn007", "description": "Payment for movie streaming subscription"}, # Similar to txn002
{"id": "txn008", "description": "Unusual large wire transfer to overseas account"}, # Potential Anomaly 1
{"id": "txn009", "description": "Purchase of rare antique collectible vase"}, # Potential Anomaly 2
{"id": "txn010", "description": "Coffee purchase at Cafe Central"}
]
print(f"\nDataset contains {len(dataset)} items.")
# 2. Generate embeddings for all items in the dataset (pre-computation)
print("\nGenerating embeddings for the dataset...")
dataset_embeddings_data = []
for item in dataset:
embedding = get_embedding(client, item["description"])
if embedding:
# Store item ID, description, and its embedding
dataset_embeddings_data.append({
"id": item["id"],
"description": item["description"],
"embedding": embedding
})
else:
print(f"Skipping item {item['id']} due to embedding error.")
if not dataset_embeddings_data:
print("\nError: No embeddings were generated. Cannot perform analysis.")
exit()
print(f"\nSuccessfully generated embeddings for {len(dataset_embeddings_data)} items.")
# --- Part A: Similarity Detection ---
print("\n--- Part A: Similarity Detection ---")
# Select a target item to find similar items for
target_item_id_similarity = "txn001" # Find items similar to "Grocery purchase..."
print(f"Finding items similar to item ID: {target_item_id_similarity}")
# Find the target item's data
target_item_data = next((item for item in dataset_embeddings_data if item["id"] == target_item_id_similarity), None)
if target_item_data:
target_embedding = target_item_data["embedding"]
similar_items = []
# Calculate similarity between the target and all other items
for item_data in dataset_embeddings_data:
if item_data["id"] == target_item_id_similarity:
continue # Skip self-comparison
similarity = cosine_similarity(target_embedding, item_data["embedding"])
similar_items.append({
"id": item_data["id"],
"description": item_data["description"],
"score": similarity
})
# Sort by similarity score
similar_items.sort(key=lambda x: x["score"], reverse=True)
# Display top N similar items
print(f"\nItems most similar to: \"{target_item_data['description']}\"")
top_n_similar = 2
for i, item in enumerate(similar_items[:top_n_similar]):
print(f"{i+1}. ID: {item['id']}, Score: {item['score']:.4f}")
print(f" Description: {item['description']}")
print("-" * 10)
else:
print(f"Error: Could not find data for target item ID '{target_item_id_similarity}'.")
# --- Part B: Anomaly Detection (Simple Approach) ---
print("\n--- Part B: Anomaly Detection (Low Average Similarity) ---")
# Calculate the average similarity of each item to all other items
item_avg_similarities = []
num_items = len(dataset_embeddings_data)
if num_items < 2:
print("Need at least 2 items with embeddings to calculate average similarities.")
else:
print("\nCalculating average similarities for anomaly detection...")
for i in range(num_items):
current_item = dataset_embeddings_data[i]
total_similarity = 0
# Compare current item to all others
for j in range(num_items):
if i == j: # Don't compare item to itself
continue
other_item = dataset_embeddings_data[j]
similarity = cosine_similarity(current_item["embedding"], other_item["embedding"])
total_similarity += similarity
# Calculate average similarity (avoid division by zero if only 1 item)
average_similarity = total_similarity / (num_items - 1) if num_items > 1 else 0
item_avg_similarities.append({
"id": current_item["id"],
"description": current_item["description"],
"avg_score": average_similarity
})
print(f"Item ID {current_item['id']} - Avg Similarity: {average_similarity:.4f}")
# Sort items by average similarity in ascending order (lowest first = most anomalous)
item_avg_similarities.sort(key=lambda x: x["avg_score"])
# Display top N potential anomalies (items least similar to others)
print("\nPotential Anomalies (Lowest Average Similarity):")
top_n_anomalies = 3
for i, item in enumerate(item_avg_similarities[:top_n_anomalies]):
print(f"{i+1}. ID: {item['id']}, Avg Score: {item['avg_score']:.4f}")
print(f" Description: {item['description']}")
print("-" * 10)
print("\nNote: Low average similarity suggests an item is semantically")
print("different from the majority of other items in this dataset.")
Code Breakdown Explanation
This example demonstrates using OpenAI embeddings for both finding similar items and detecting potential anomalies within a dataset based on semantic meaning.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions. Thecosine_similarity
function now includesnp.clip
to ensure the output is strictly within [-1, 1].
- Includes standard imports (
- Dataset Definition:
- A list of dictionaries (
dataset
) simulates the data to be analyzed (e.g., transaction descriptions). Each item has anid
and a textdescription
. The sample data includes mostly common items and a few conceptually different ones intended as potential anomalies.
- A list of dictionaries (
- Dataset Embedding Generation:
- The script iterates through each
item
in thedataset
. - It calls
get_embedding
on theitem["description"]
. - It stores the
item['id']
,item['description']
, and its correspondingembedding
vector together indataset_embeddings_data
. This pre-computation is essential.
- The script iterates through each
- Part A: Similarity Detection:
- Target Selection: An item ID (
target_item_id_similarity
) is chosen to find similar items for. - Target Embedding Retrieval: The script finds the pre-computed embedding for the target item.
- Comparison: It iterates through all other items in
dataset_embeddings_data
, calculates thecosine_similarity
between the target item's embedding and each other item's embedding. - Ranking: The results (other item ID, description, similarity score) are stored and then sorted by score in descending order.
- Display: The top N most similar items are printed.
- Target Selection: An item ID (
- Part B: Anomaly Detection (Simple Average Similarity Approach):
- Concept: This simple method identifies anomalies as items that have the lowest average semantic similarity to all other items in the dataset. An item that is very different conceptually from the rest will likely have low similarity scores when compared to most others.
- Calculation:
- The script iterates through each item (
current_item
) indataset_embeddings_data
. - For each
current_item
, it iterates through all other items in the dataset. - It calculates the
cosine_similarity
between thecurrent_item
and everyother_item
. - It sums these similarities and calculates the average similarity for the
current_item
.
- The script iterates through each item (
- Ranking: The items are stored along with their calculated
avg_score
and then sorted by this score in ascending order (lowest average similarity first). - Display: The top N items with the lowest average similarity scores are printed as potential anomalies. A note explains the interpretation.
This example showcases two powerful applications: finding related content (similarity) and identifying outliers (anomaly detection) by leveraging the semantic understanding captured within OpenAI embeddings.
3.2.6 Clustering & Tagging
Automatically organize and label content based on semantic similarity - a powerful technique that uses embedding vectors to understand the true meaning and relationships between different pieces of content. This approach goes far beyond traditional keyword matching, allowing for much more nuanced and accurate content organization.
When content is clustered, similar items naturally group together based on their semantic meaning, even if they use different terminology to express the same concepts. For example, documents about "automotive maintenance" and "car repair" would cluster together despite using different words.
This intelligent organization helps create intuitive navigation systems, improves content discovery, and makes large document collections more manageable by grouping related items together. Some key benefits include:
- Automatic tag generation based on cluster themes
- Dynamic organization that adapts as new content is added
- Improved search relevance through semantic understanding
- Better content discovery through related-item suggestions
The clustering process can be fine-tuned to create either broad categories or more granular subcategories, depending on the specific needs of your content organization system. This flexibility makes it a valuable tool for managing everything from digital libraries to enterprise knowledge bases.
Example:
Let's examine a code example that demonstrates clustering and tagging using OpenAI embeddings and GPT-4o.
This script will:
- Define a collection of documents.
- Generate embeddings for the documents.
- Cluster the documents using K-Means based on their embeddings.
- For each cluster, use GPT-4o to analyze the documents within it and generate a descriptive tag or label.
- Display the documents grouped by cluster along with their AI-generated tags.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np
from sklearn.cluster import KMeans # For clustering algorithm
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2024-12-31 15:48:00 CDT"
current_location = "San Antonio, Texas, United States"
print(f"Running Clustering & Tagging example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function to Generate Cluster Tag using GPT-4o ---
def generate_cluster_tag(client, documents_in_cluster):
"""Uses GPT-4o to suggest a tag/label for a cluster of documents."""
if not documents_in_cluster:
return "Empty Cluster"
# Combine content for context, limiting total length if necessary
# Using first few hundred chars of each doc might be enough
max_context_length = 3000 # Limit context to avoid excessive token usage
context = ""
for i, doc in enumerate(documents_in_cluster):
doc_preview = f"Document {i+1}: {doc[:300]}...\n"
if len(context) + len(doc_preview) > max_context_length:
break
context += doc_preview
if not context:
return "Error: Could not create context"
system_prompt = "You are an expert at identifying themes and creating concise labels."
user_prompt = f"""Based on the following document excerpts from a single cluster, suggest a short, descriptive tag or label (2-5 words) that captures the main theme or topic of this group.
Document Excerpts:
---
{context.strip()}
---
Suggested Tag/Label:
"""
print(f"\nGenerating tag for cluster with {len(documents_in_cluster)} documents...")
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=20, # Short response expected
temperature=0.3 # More deterministic label
)
tag = response.choices[0].message.content.strip().replace('"', '') # Clean up quotes
print(f"Generated tag: '{tag}'")
return tag
except OpenAIError as e:
print(f"OpenAI API Error generating tag: {e}")
return "Tagging Error"
except Exception as e:
print(f"An unexpected error occurred during tag generation: {e}")
return "Tagging Error"
# --- Clustering and Tagging Implementation ---
# 1. Define your collection of documents
# Covers topics: Space Exploration, Cooking/Food, Web Development
documents = [
"NASA launches new probe to study Jupiter's moons.",
"Recipe for authentic Italian pasta carbonara.",
"JavaScript frameworks like React and Vue dominate front-end development.",
"The James Webb Space Telescope captures stunning images of distant galaxies.",
"Tips for baking the perfect sourdough bread at home.",
"Understanding asynchronous programming in Node.js.",
"SpaceX successfully lands its reusable rocket booster after launch.",
"Exploring the different types of olive oil and their uses in cooking.",
"CSS Grid vs Flexbox: Choosing the right layout module.",
"The search for habitable exoplanets continues with new telescope data.",
"How to make delicious homemade pizza from scratch.",
"Building RESTful APIs using Express.js and MongoDB."
]
print(f"\nDocument collection contains {len(documents)} documents.")
# 2. Generate embeddings for all documents
print("\nGenerating embeddings for the document collection...")
embeddings = []
valid_documents = [] # Keep track of documents for which embedding was successful
for doc in documents:
embedding = get_embedding(client, doc)
if embedding:
embeddings.append(embedding)
valid_documents.append(doc) # Add corresponding document text
else:
print(f"Skipping document due to embedding error: \"{doc[:70]}...\"")
if not embeddings:
print("\nError: No embeddings were generated. Cannot perform clustering.")
exit()
print(f"\nSuccessfully generated embeddings for {len(valid_documents)} documents.")
# Convert embeddings list to a NumPy array for scikit-learn
embedding_matrix = np.array(embeddings)
# 3. Apply Clustering Algorithm (K-Means)
# Choose the number of clusters (k). We expect 3 topics here.
n_clusters = 3
print(f"\nApplying K-Means clustering with k={n_clusters}...")
try:
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
kmeans.fit(embedding_matrix)
cluster_labels = kmeans.labels_
print("Clustering complete.")
except Exception as e:
print(f"An error occurred during clustering: {e}")
exit()
# 4. Group Documents by Cluster
print("\nGrouping documents by cluster...")
clustered_documents = {i: [] for i in range(n_clusters)}
for i, label in enumerate(cluster_labels):
clustered_documents[label].append(valid_documents[i])
# 5. Generate Tags for Each Cluster using GPT-4o
print("\nGenerating tags for each cluster...")
cluster_tags = {}
for cluster_id, docs_in_cluster in clustered_documents.items():
tag = generate_cluster_tag(client, docs_in_cluster)
cluster_tags[cluster_id] = tag
# 6. Display Documents by Cluster with Generated Tags
print(f"\n--- Documents Grouped by Cluster and Tag (k={n_clusters}) ---")
for cluster_id, docs_in_cluster in clustered_documents.items():
generated_tag = cluster_tags.get(cluster_id, "Unknown Tag")
print(f"\nCluster {cluster_id + 1} - Suggested Tag: '{generated_tag}'")
print("-" * (28 + len(generated_tag))) # Adjust underline length
if not docs_in_cluster:
print(" (No documents in this cluster)")
else:
for doc_text in docs_in_cluster:
print(f" - {doc_text}") # Print full document text here
print("\nClustering and Tagging process complete.")
Code Breakdown Explanation
This script demonstrates how to automatically group similar documents by their semantic meaning using embeddings, then uses GPT-4o to generate descriptive tags for each group.
- Setup & Helpers:
- Includes standard imports plus
KMeans
fromsklearn.cluster
. - Initializes the OpenAI client.
- Includes the
get_embedding
helper function.
- Includes standard imports plus
- New Helper Function:
generate_cluster_tag
:- Purpose: Takes a list of documents belonging to a single cluster and uses GPT-4o to suggest a concise tag summarizing their common theme.
- Input: The
client
object anddocuments_in_cluster
(a list of text strings). - Context Creation: It concatenates parts of the documents (e.g., first 300 characters) to create a context string for GPT-4o, respecting a maximum length to manage token usage.
- Prompt Engineering: It constructs a prompt asking GPT-4o to act as an expert theme identifier and suggest a short tag (2-5 words) based on the provided document excerpts.
- API Call: Uses
client.chat.completions.create
withmodel="gpt-4o"
and the specialized prompt. A low temperature is used for more focused tag generation. - Output: Returns the cleaned-up tag suggested by GPT-4o, or an error message.
- Document Collection: A list named
documents
holds sample text content covering a few distinct topics (Space, Cooking, Web Development). - Embedding Generation:
- The script iterates through the
documents
, generates an embedding for each usingget_embedding
, and stores successful embeddings and corresponding text inembeddings
andvalid_documents
. - The embeddings are converted to a NumPy array (
embedding_matrix
).
- The script iterates through the
- Clustering (K-Means):
- The number of clusters (
n_clusters
) is set (e.g.,k=3
). KMeans
fromscikit-learn
is initialized and fitted to theembedding_matrix
.kmeans.labels_
provides the cluster assignment for each document.
- The number of clusters (
- Grouping Documents:
- A dictionary (
clustered_documents
) is created to store the text of documents belonging to each cluster ID.
- A dictionary (
- Generating Cluster Tags:
- The script iterates through the
clustered_documents
dictionary. - For each
cluster_id
and its list ofdocs_in_cluster
, it calls thegenerate_cluster_tag
helper function. - The suggested tag for each cluster is stored in the
cluster_tags
dictionary.
- The script iterates through the
- Displaying Results:
- The script iterates through the clusters again.
- For each cluster, it retrieves the generated tag from
cluster_tags
. - It prints the cluster number, the suggested tag, and then lists the full text of all documents belonging to that cluster.
This example showcases a powerful workflow: using embeddings for unsupervised grouping of content based on meaning (clustering) and then leveraging an LLM like GPT-4o to interpret those groupings and assign meaningful labels (tagging), automating content organization.
3.2.7 Content Recommendations
Content recommendation systems powered by embeddings represent a significant advancement in personalization technology. By analyzing semantic relationships, these systems can understand the nuanced meaning and context of content in ways that traditional keyword-based systems cannot.
Here's a detailed look at how embedding-based recommendations work:
- Content Analysis:
- The system generates sophisticated embedding vectors for each piece of content in the database
- These vectors capture nuanced characteristics like writing style, topic depth, and emotional tone
- Advanced algorithms analyze patterns across multiple dimensions of content features
- User Preference Modeling:
- The system tracks detailed interaction patterns including time spent, engagement level, and sharing behavior
- Historical preferences are weighted and combined to create multi-dimensional user profiles
- Both explicit feedback (ratings, likes) and implicit signals (scroll depth, repeat visits) are considered
- Contextual Understanding:
- Real-time factors like device type and location are incorporated into the recommendation algorithm
- The system identifies patterns in content consumption based on time of day and day of week
- Current session behavior is analyzed to understand immediate user interests
- Dynamic Adaptation:
- Machine learning models continuously refine user profiles based on new interactions
- The system learns from both positive and negative feedback to improve accuracy
- Recommendation strategies are automatically adjusted based on performance metrics
This sophisticated approach enables recommendation engines to deliver highly personalized experiences through several key capabilities:
- Identify content similarities that might not be apparent through traditional metadata
- Can detect thematic connections between items even when they use different terminology
- Recognizes similar writing styles, tone, and complexity levels across content
- Understand the progression of user interests over time
- Tracks how preferences evolve from basic to advanced topics
- Identifies shifts in user interests across different categories
- Make cross-domain recommendations (e.g., suggesting articles based on watched videos)
- Connects content across different media types based on semantic relationships
- Leverages learning from one domain to enhance recommendations in another
- Account for seasonal trends and temporal relevance
- Adjusts recommendations based on time-sensitive factors like holidays or events
- Considers current trends and their impact on user interests
The result is a highly personalized experience that can suggest truly relevant videos, articles, or products that match users' interests, both current and evolving. This goes far beyond simple "users who liked X also liked Y" algorithms, creating a more engaging and valuable user experience.
Example:
Here's a code example that demonstrates the core concept of content recommendations using embeddings.
This script focuses on finding semantically similar content items based on their embeddings, which is the foundation for the more advanced recommendation features you described.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2024-11-30 15:52:00 CDT"
current_location = "Orlando, Florida, United States"
print(f"Running Content Recommendation example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return np.clip(similarity, -1.0, 1.0) # Ensure value is within valid range
# --- Content Recommendation Implementation ---
# 1. Define your Content Catalog (e.g., articles, blog posts)
# In a real application, this would come from a database or CMS.
content_catalog = [
{"id": "art001", "title": "Introduction to Quantum Computing", "content": "Exploring the basics of qubits, superposition, and entanglement in quantum mechanics and their potential for computation."},
{"id": "art002", "title": "Healthy Mediterranean Diet Recipes", "content": "Delicious and easy recipes focusing on fresh vegetables, olive oil, fish, and whole grains for a heart-healthy lifestyle."},
{"id": "art003", "title": "The Future of Artificial Intelligence in Healthcare", "content": "How AI and machine learning are transforming diagnostics, drug discovery, and personalized medicine."},
{"id": "art004", "title": "Beginner's Guide to Python Programming", "content": "Learn the fundamentals of Python syntax, data types, control flow, and functions to start coding."},
{"id": "art005", "title": "Understanding Neural Networks and Deep Learning", "content": "An overview of artificial neural networks, backpropagation, and the concepts behind deep learning models."},
{"id": "art006", "title": "Travel Guide: Hiking the Swiss Alps", "content": "Tips for planning your trip, recommended trails, essential gear, and stunning viewpoints in the Swiss Alps."},
{"id": "art007", "title": "Mastering the Art of French Pastry", "content": "Techniques for creating classic French desserts like croissants, macarons, and éclairs."},
{"id": "art008", "title": "Ethical Considerations in AI Development", "content": "Discussing bias, fairness, transparency, and accountability in the development and deployment of artificial intelligence systems."}
]
print(f"\nContent catalog contains {len(content_catalog)} items.")
# 2. Generate embeddings for all content items (pre-computation)
print("\nGenerating embeddings for the content catalog...")
content_embeddings_data = []
for item in content_catalog:
# Use title and content for embedding
text_to_embed = f"Title: {item['title']}\nContent: {item['content']}"
embedding = get_embedding(client, text_to_embed)
if embedding:
# Store item ID and its embedding
content_embeddings_data.append({"id": item["id"], "embedding": embedding})
else:
print(f"Skipping item {item['id']} due to embedding error.")
if not content_embeddings_data:
print("\nError: No embeddings were generated. Cannot provide recommendations.")
exit()
print(f"\nSuccessfully generated embeddings for {len(content_embeddings_data)} content items.")
# 3. Select a target item (e.g., an article the user just read)
target_item_id = "art003" # User read "The Future of Artificial Intelligence in Healthcare"
print(f"\nFinding content similar to item ID: {target_item_id}")
# Find the embedding for the target item
target_embedding = None
for item_data in content_embeddings_data:
if item_data["id"] == target_item_id:
target_embedding = item_data["embedding"]
break
if target_embedding is None:
print(f"Error: Could not find the embedding for the target item ID '{target_item_id}'.")
exit()
# 4. Calculate similarity between the target item and all other items
recommendations = []
print("\nCalculating similarities...")
for item_data in content_embeddings_data:
# Don't recommend the item itself
if item_data["id"] == target_item_id:
continue
similarity = cosine_similarity(target_embedding, item_data["embedding"])
recommendations.append({"id": item_data["id"], "score": similarity})
# 5. Sort potential recommendations by similarity score
recommendations.sort(key=lambda x: x["score"], reverse=True)
# 6. Display top N recommendations
print("\n--- Top Content Recommendations ---")
# Find the original title for the target item for context
target_item_info = next((item for item in content_catalog if item["id"] == target_item_id), None)
if target_item_info:
print(f"Because you read: \"{target_item_info['title']}\"\n")
if not recommendations:
print("No recommendations found (or error calculating similarities).")
else:
top_n = 3
print(f"Top {top_n} recommended items:")
for i, rec in enumerate(recommendations[:top_n]):
# Find the full item details from the original catalog
rec_details = next((item for item in content_catalog if item["id"] == rec["id"]), None)
if rec_details:
print(f"{i+1}. ID: {rec['id']}, Similarity Score: {rec['score']:.4f}")
print(f" Title: {rec_details['title']}")
print(f" Content Snippet: {rec_details['content'][:100]}...") # Truncate content
print("-" * 10)
else:
print(f"{i+1}. ID: {rec['id']}, Score: {rec['score']:.4f} (Details not found)")
print("-" * 10)
if len(recommendations) > top_n:
print(f"(Showing top {top_n} of {len(recommendations)} potential recommendations)")
print("\nNote: This demonstrates basic content-to-content similarity.")
print("Advanced systems incorporate user profiles, interaction history, context, etc.")
Code Breakdown Explanation
This script demonstrates a fundamental approach to content recommendation using OpenAI embeddings, focusing on finding items semantically similar to a target item.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions.
- Includes standard imports (
- Content Catalog:
- A list of dictionaries (
content_catalog
) simulates the available content (e.g., articles). Each item has anid
,title
, andcontent
.
- A list of dictionaries (
- Content Embedding Generation (Pre-computation):
- The script iterates through each
item
in thecontent_catalog
. - Combined Text: It creates a combined text string from the item's
title
andcontent
to generate a richer embedding that captures more semantic detail. - It calls
get_embedding
for this combined text. - It stores the
item['id']
and itsembedding
vector incontent_embeddings_data
. This pre-computation is vital for efficiency.
- The script iterates through each
- Target Item Selection:
- A
target_item_id
is chosen (e.g.,art003
), simulating an item the user has interacted with (e.g., read). - The script retrieves the pre-computed embedding for this target item.
- A
- Similarity Calculation:
- It iterates through all other items in
content_embeddings_data
. - It calculates the
cosine_similarity
between thetarget_embedding
and each other item's embedding. - It stores the other item's
id
and its similarityscore
in therecommendations
list.
- It iterates through all other items in
- Ranking Recommendations:
- The
recommendations
list is sorted byscore
in descending order, placing the most semantically similar content items first.
- The
- Displaying Results:
- The script prints the title of the target item for context ("Because you read...").
- It displays the top N (e.g., 3) recommended items, showing their ID, similarity score, title, and a snippet of their content.
- Contextual Note: The final print statements explicitly mention that this example shows basic content-to-content similarity. Advanced recommendation systems, as described in the section text, would integrate user profiles (embeddings based on interaction history), real-time context (time, location), explicit feedback, and potentially more complex algorithms beyond simple cosine similarity. However, the core principle of using embeddings to measure semantic relatedness remains fundamental.
This example effectively illustrates how embeddings enable recommendations based on understanding the meaning of content, allowing suggestions that go beyond simple keyword or category matching.
3.2.8 Email Triage / Prioritization
Embedding technology enables sophisticated email analysis and categorization by understanding the semantic meaning of messages. This advanced system employs multiple layers of analysis to streamline email management:
- Urgency Detection
- Identify time-sensitive matters requiring immediate attention through natural language processing
- Recognize urgent language patterns and contextual cues by analyzing word choice, sentence structure, and historical patterns
- Flag critical emails based on sender importance, keywords, and organizational hierarchy
- Smart Categorization
- Group related email threads and conversations using semantic similarity matching
- Sort messages by project, department, or business function through content analysis
- Create dynamic folders based on emerging topics and trends
- Apply machine learning to improve categorization accuracy over time
- Intent Classification
- Distinguish between requests, updates, and FYI messages using advanced natural language understanding
- Prioritize action items and delegate tasks automatically based on content and context
- Identify follow-up requirements and set automated reminders
- Extract key deadlines and commitments from message content
By leveraging semantic understanding, the system creates an intelligent email processing pipeline that can handle hundreds of messages simultaneously. The embedding-based analysis examines not just keywords, but the actual meaning and context of each message, considering factors such as:
- Message context within ongoing conversations
- Historical patterns of communication
- Organizational relationships and hierarchies
- Project timelines and priorities
This comprehensive approach significantly reduces the cognitive load of email management by automatically handling routine classification and prioritization tasks. The system ensures that important messages receive immediate attention while maintaining an organized structure for all communications. As a result, professionals can focus on high-value activities instead of spending hours manually sorting through their inbox, leading to improved productivity and faster response times for critical communications.
Example:
This script simulates categorizing incoming emails based on their semantic similarity to predefined categories like "Urgent Request," "Project Update,"
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2024-10-31 15:54:00 CDT"
current_location = "Plano, Texas, United States"
print(f"Running Email Triage/Prioritization example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return np.clip(similarity, -1.0, 1.0) # Ensure value is within valid range
# --- Email Triage/Prioritization Implementation ---
# 1. Define Sample Emails (Subject + Snippet)
emails = [
{"id": "email01", "subject": "Urgent: Server Down!", "body_snippet": "The main production server seems to be unresponsive. We need immediate assistance to investigate and bring it back online."},
{"id": "email02", "subject": "Meeting Minutes - Project Phoenix Sync", "body_snippet": "Attached are the minutes from today's sync call. Key decisions included finalizing the Q3 roadmap. Action items assigned."},
{"id": "email03", "subject": "Quick Question about Report", "body_snippet": "Hi team, just had a quick question regarding the methodology used in the latest market analysis report. Can someone clarify?"},
{"id": "email04", "subject": "Fwd: Company Newsletter - April Edition", "body_snippet": "Sharing the latest company newsletter for your information."},
{"id": "email05", "subject": "Action Required: Submit Timesheet by EOD", "body_snippet": "Friendly reminder to please submit your weekly timesheet by the end of the day today. This is mandatory."},
{"id": "email06", "subject": "Update on Q2 Marketing Campaign", "body_snippet": "Just wanted to provide a brief update on the campaign performance metrics we discussed last week. See attached summary."},
{"id": "email07", "subject": "Can you approve this request ASAP?", "body_snippet": "Need your approval on the attached budget request urgently to proceed with the vendor contract."}
]
print(f"\nProcessing {len(emails)} emails.")
# 2. Define Categories/Priorities and their Semantic Representations
# We represent each category with a descriptive phrase.
categories = {
"Urgent Action Required": "Requires immediate attention, critical issue, deadline, ASAP request, mandatory task.",
"Project Update / Status": "Information about ongoing projects, progress reports, meeting minutes, status updates.",
"Question / Request for Info": "Asking for clarification, seeking information, query about details.",
"General Info / FYI": "Newsletter, announcement, sharing information, non-actionable update."
}
print(f"\nDefined categories: {list(categories.keys())}")
# 3. Generate embeddings for Categories (pre-computation recommended)
print("\nGenerating embeddings for categories...")
category_embeddings = {}
for category_name, category_description in categories.items():
embedding = get_embedding(client, category_description)
if embedding:
category_embeddings[category_name] = embedding
else:
print(f"Skipping category '{category_name}' due to embedding error.")
if not category_embeddings:
print("\nError: No embeddings generated for categories. Cannot triage emails.")
exit()
# 4. Process Each Email: Generate Embedding and Find Best Category
print("\nTriaging emails...")
email_results = []
for email in emails:
# Combine subject and body for better context
email_content = f"Subject: {email['subject']}\nBody: {email['body_snippet']}"
email_embedding = get_embedding(client, email_content)
if not email_embedding:
print(f"Skipping email {email['id']} due to embedding error.")
continue
# Find the category with the highest similarity
best_category = None
max_similarity = -1 # Cosine similarity ranges from -1 to 1
for category_name, category_embedding in category_embeddings.items():
similarity = cosine_similarity(email_embedding, category_embedding)
print(f" Email {email['id']} vs Category '{category_name}': Score {similarity:.4f}")
if similarity > max_similarity:
max_similarity = similarity
best_category = category_name
email_results.append({
"id": email["id"],
"subject": email["subject"],
"assigned_category": best_category,
"score": max_similarity
})
print(f"-> Email {email['id']} assigned to: '{best_category}' (Score: {max_similarity:.4f})")
# 5. Display Triage Results
print("\n--- Email Triage Results ---")
if not email_results:
print("No emails were successfully triaged.")
else:
# Optional: Group by category for display
results_by_category = {cat: [] for cat in categories.keys()}
for result in email_results:
if result["assigned_category"]: # Check if category was assigned
results_by_category[result["assigned_category"]].append(result)
for category_name, items in results_by_category.items():
print(f"\nCategory: {category_name}")
print("-" * (10 + len(category_name)))
if not items:
print(" (No emails assigned)")
else:
# Sort items within category by score if desired
items.sort(key=lambda x: x['score'], reverse=True)
for item in items:
print(f" - ID: {item['id']}, Subject: \"{item['subject']}\" (Score: {item['score']:.3f})")
print("\nEmail triage process complete.")
Code Breakdown Explanation
This example shows how OpenAI embeddings can automatically sort and prioritize emails by understanding their meaning, demonstrating an intelligent email management system.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions.
- Includes standard imports (
- Sample Email Data:
- A list of dictionaries (
emails
) simulates incoming messages. Each email has anid
,subject
, and abody_snippet
.
- A list of dictionaries (
- Category Definitions:
- A dictionary (
categories
) defines the target categories for triage (e.g., "Urgent Action Required", "Project Update / Status"). - Key Idea: Each category is represented by a descriptive phrase or list of keywords that captures its semantic essence. This description is what will be embedded.
- A dictionary (
- Category Embedding Generation:
- The script iterates through the defined
categories
. - It calls
get_embedding
on the description associated with each category name. - The resulting embedding vector for each category is stored in the
category_embeddings
dictionary. This step would typically be pre-computed and stored.
- The script iterates through the defined
- Email Processing Loop:
- The script iterates through each
email
in the sample data. - Content Combination: It combines the
subject
andbody_snippet
into a singleemail_content
string to provide richer context for the embedding. - Email Embedding: It calls
get_embedding
to get the vector representation of the current email's content. - Similarity Calculation:
- It then iterates through the pre-computed
category_embeddings
. - For each category, it calculates the
cosine_similarity
between theemail_embedding
and thecategory_embedding
. - It keeps track of the
best_category
(the one with the highest similarity score found so far) and the correspondingmax_similarity
score.
- It then iterates through the pre-computed
- Assignment: After comparing the email to all categories, the email is assigned the
best_category
found. The result (email ID, subject, assigned category, score) is stored.
- The script iterates through each
- Displaying Triage Results:
- The script prints the final assignments.
- Optional Grouping: It includes logic to group the results by the assigned category for a clearer presentation, showing which emails fell into the "Urgent," "Update," etc., buckets.
This example effectively demonstrates how embeddings allow for intelligent categorization based on meaning. An email asking for "approval ASAP" can be correctly identified as "Urgent Action Required" even without using the exact word "urgent," because its embedding will be semantically close to the embedding of the "Urgent Action Required" category description. This is far more robust than simple keyword filtering.
3.2 When to Use Embeddings
Embeddings have revolutionized how we process and understand textual information in modern AI applications. While traditional text processing methods rely on exact matches or basic keyword searching, embeddings provide a sophisticated way to capture the nuanced meanings and relationships between pieces of text. By converting words and phrases into high-dimensional numerical vectors, embeddings enable machines to understand semantic relationships and similarities in ways that more closely mirror human understanding.
Let's explore the key scenarios where embeddings prove particularly valuable, showcasing how this technology transforms various aspects of information processing and retrieval. Understanding these use cases is crucial for developers and organizations looking to leverage the full potential of embedding technology in their applications.
3.2.1 Semantic search
Finding relevant information based on meaning rather than just keywords, enabling more intelligent search results. Unlike traditional keyword-based search that matches exact words or phrases, semantic search understands the intent and contextual meaning of a query by analyzing the underlying relationships between words and concepts. This advanced approach allows the system to comprehend variations in language, context, and even user intent.
For example, a search for "natural language processing" would also return relevant results about "NLP," "computational linguistics," or "text analysis." When a user searches for "treating common cold symptoms," the system would understand and return results about "flu remedies," "reducing fever," and "cough medicine" - even if these exact phrases aren't used. This technology leverages embedding vectors to calculate similarity scores between queries and documents, transforming each piece of text into a high-dimensional numerical representation that captures its semantic meaning. This mathematical approach enables more nuanced and accurate search results that account for:
- Synonyms and related terms (like "car" and "automobile")
- Conceptual relationships (connecting "python" to both programming and snakes, depending on context)
- Multiple languages (finding relevant content even when written in different languages)
- Contextual variations (understanding that "apple" could refer to either the fruit or the technology company)
- Intent matching (recognizing that "how to fix a flat tire" and "tire repair instructions" are seeking the same information)
Example:
Here is a code example demonstrating semantic search using OpenAI embeddings, based on the content you provided.
This script will:
- Define a small set of documents.
- Generate embeddings for these documents and a search query.
- Calculate the similarity between the query and each document.
- Rank the documents by relevance based on semantic similarity.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-03-22 15:22:00 CDT"
current_location = "Grapevine, Texas, United States"
print(f"Running Semantic Search example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings example)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print(f"Generating embedding for: \"{text[:50]}...\"") # Print truncated text
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
print("Embedding generation successful.")
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{text[:50]}...': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{text[:50]}...': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings example)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
# print("Error: Cannot calculate similarity with None vectors.")
return 0.0 # Return 0 if any vector is missing
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
# print("Warning: One or both vectors have zero magnitude.")
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return similarity
# --- Semantic Search Implementation ---
# 1. Define your document store (a list of text strings)
# In a real application, this could come from a database, files, etc.
document_store = [
"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
"Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy.",
"Artificial intelligence research focuses on creating systems capable of performing tasks that typically require human intelligence.",
"A recipe for classic French onion soup involves caramelizing onions and topping with bread and cheese.",
"Machine learning, a subset of AI, involves algorithms that allow systems to learn from data.",
"The Louvre Museum in Paris is the world's largest art museum and a historic monument.",
"Natural Language Processing (NLP) enables computers to understand and process human language.",
"Baking bread requires careful measurement of ingredients like flour, water, yeast, and salt."
]
print(f"\nDocument store contains {len(document_store)} documents.")
# 2. Generate embeddings for all documents in the store (pre-computation)
# In a real app, you'd store these embeddings alongside the documents.
print("\nGenerating embeddings for the document store...")
document_embeddings = []
for doc in document_store:
embedding = get_embedding(client, doc)
# Store the document text and its embedding together
if embedding: # Only store if embedding was successful
document_embeddings.append({"text": doc, "embedding": embedding})
else:
print(f"Skipping document due to embedding error: \"{doc[:50]}...\"")
print(f"\nSuccessfully generated embeddings for {len(document_embeddings)} documents.")
# 3. Define the user's search query
search_query = "What is AI?"
# search_query = "Things to see in Paris"
# search_query = "How does NLP work?"
# search_query = "Cooking instructions"
print(f"\nSearch Query: \"{search_query}\"")
# 4. Generate embedding for the search query
print("\nGenerating embedding for the search query...")
query_embedding = get_embedding(client, search_query)
# 5. Calculate similarity and rank documents
search_results = []
if query_embedding and document_embeddings:
print("\nCalculating similarities...")
for doc_data in document_embeddings:
similarity = cosine_similarity(query_embedding, doc_data["embedding"])
search_results.append({"text": doc_data["text"], "score": similarity})
# Sort results by similarity score in descending order
search_results.sort(key=lambda x: x["score"], reverse=True)
# 6. Display results
print("\n--- Semantic Search Results ---")
print(f"Top results for query: \"{search_query}\"\n")
if not search_results:
print("No results found (or error calculating similarities).")
else:
# Display top N results (e.g., top 3)
top_n = 3
for i, result in enumerate(search_results[:top_n]):
print(f"{i+1}. Score: {result['score']:.4f}")
print(f" Text: {result['text']}")
print("-" * 10)
if len(search_results) > top_n:
print(f"(Showing top {top_n} of {len(search_results)} results)")
else:
print("\nCould not perform search.")
if not query_embedding:
print("Reason: Failed to generate embedding for the search query.")
if not document_embeddings:
print("Reason: No document embeddings were successfully generated.")
Code Breakdown Explanation:
- Setup & Helpers: Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions from the previous example. - Document Store: A simple Python list (
document_store
) holds the text content of the documents we want to search through. In a real application, this data would likely come from a database or file system. - Document Embedding Generation:
- The script iterates through each document in the
document_store
. - It calls
get_embedding
for each document to get its numerical representation. - It stores the original document text and its corresponding embedding vector together (e.g., in a list of dictionaries). This pre-computation step is crucial for efficiency in real systems – you generate document embeddings once and store them. Error handling ensures documents are skipped if embedding fails.
- The script iterates through each document in the
- Search Query: A sample
search_query
string is defined. - Query Embedding Generation: The
get_embedding
function is called again, this time for thesearch_query
. - Similarity Calculation & Ranking:
- It checks if both the query embedding and document embeddings were successfully generated.
- It iterates through the stored
document_embeddings
. - For each document, it calculates the
cosine_similarity
between thequery_embedding
and the document's embedding. - The document text and its calculated similarity score are stored in a
search_results
list. - Finally,
search_results.sort(...)
arranges the list based on thescore
in descending order (highest similarity first).
- Display Results: The script prints the top N (e.g., 3) most relevant documents from the sorted list, showing their similarity score and text content.
This example clearly illustrates the core concept of semantic search: converting both documents and queries into embeddings and then using vector similarity (like cosine similarity) to find documents that are semantically related to the query, even if they don't share the exact keywords.
3.2.2 Topic clustering
Topic clustering is a sophisticated technique for organizing and analyzing large document collections by automatically grouping them based on their semantic content. This advanced application of embeddings transforms the way we process and understand large-scale document collections, offering a powerful solution for content organization. The system works by converting each document into a high-dimensional embedding vector that captures its meaning, then using clustering algorithms to group similar vectors together.
This powerful application of embeddings empowers systems to:
- Identify thematic patterns across thousands of documents without manual labeling - the system can automatically detect common topics and themes across vast document collections, saving countless hours of manual categorization work
- Group similar discussions, articles, or content pieces into intuitive categories - by understanding the semantic relationships between documents, the system can create meaningful groupings that reflect natural topic divisions, even when documents use different terminology to discuss the same concepts
- Discover emerging topics and trends within large document collections - as new content is added, the system can identify new thematic clusters forming, helping organizations stay ahead of developing trends in their field
- Create dynamic content hierarchies that adapt as new documents are added - unlike traditional static categorization systems, embedding-based clustering can automatically reorganize and refine category structures as the content collection grows and evolves
For example, a news organization could use topic clustering to automatically group thousands of articles into categories like "Technology", "Politics", or "Sports", even when these topics aren't explicitly tagged. The embeddings capture the semantic relationships between articles by analyzing the actual meaning and context of the content, not just keywords. This enables much more sophisticated grouping that can understand subtle distinctions - for instance, recognizing that an article about the economic impact of sports stadiums belongs in both "Sports" and "Business" categories, or that articles about different programming languages all belong in a "Technology" cluster despite using completely different terminology.
Example:
Below is a code example that demonstrates topic clustering using OpenAI embeddings and the K-means algorithm from scikit-learn
.
This code will:
- Define a list of sample documents covering different implicit topics.
- Generate embeddings for each document using OpenAI's API.
- Apply the K-Means clustering algorithm to group the embedding vectors.
- Display the documents belonging to each identified cluster.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np
from sklearn.cluster import KMeans # For clustering algorithm
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-03-23 15:26:00 CDT"
current_location = "Dallas, Texas, United States"
print(f"Running Topic Clustering example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
# Truncate text for printing if it's too long
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
# print("Embedding generation successful.") # Reduce verbosity
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Topic Clustering Implementation ---
# 1. Define your collection of documents
# These documents cover roughly 3 topics: AI/Tech, Travel/Geography, Food/Cooking
documents = [
"Artificial intelligence research focuses on creating systems capable of performing tasks that typically require human intelligence.",
"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
"A recipe for classic French onion soup involves caramelizing onions and topping with bread and cheese.",
"Machine learning, a subset of AI, involves algorithms that allow systems to learn from data.",
"The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials.",
"Natural Language Processing (NLP) enables computers to understand and process human language.",
"Baking bread requires careful measurement of ingredients like flour, water, yeast, and salt.",
"The Colosseum in Rome, Italy, is an oval amphitheatre in the centre of the city.",
"Deep learning utilizes artificial neural networks with multiple layers to model complex patterns.",
"Sushi is a traditional Japanese dish of prepared vinegared rice, usually with some sugar and salt, accompanying a variety of ingredients, such as seafood, often raw, and vegetables."
]
print(f"\nDocument collection contains {len(documents)} documents.")
# 2. Generate embeddings for all documents
print("\nGenerating embeddings for the document collection...")
embeddings = []
valid_documents = [] # Keep track of documents for which embedding was successful
for doc in documents:
embedding = get_embedding(client, doc)
if embedding:
embeddings.append(embedding)
valid_documents.append(doc) # Add corresponding document text
else:
print(f"Skipping document due to embedding error: \"{doc[:70]}...\"")
if not embeddings:
print("\nError: No embeddings were generated. Cannot perform clustering.")
exit()
print(f"\nSuccessfully generated embeddings for {len(valid_documents)} documents.")
# Convert embeddings list to a NumPy array for scikit-learn
embedding_matrix = np.array(embeddings)
# 3. Apply Clustering Algorithm (K-Means)
# We need to choose the number of clusters (k). Let's assume we expect 3 topics.
# In real applications, determining the optimal 'k' often requires experimentation
# (e.g., using the elbow method or silhouette scores).
n_clusters = 3
print(f"\nApplying K-Means clustering with k={n_clusters}...")
try:
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) # n_init suppresses warning
kmeans.fit(embedding_matrix)
cluster_labels = kmeans.labels_
print("Clustering complete.")
except Exception as e:
print(f"An error occurred during clustering: {e}")
exit()
# 4. Display Documents by Cluster
print(f"\n--- Documents Grouped by Topic Cluster (k={n_clusters}) ---")
# Create a dictionary to hold documents for each cluster
clustered_documents = {i: [] for i in range(n_clusters)}
# Assign each document (that had a valid embedding) to its cluster
for i, label in enumerate(cluster_labels):
clustered_documents[label].append(valid_documents[i])
# Print the contents of each cluster
for cluster_id, docs_in_cluster in clustered_documents.items():
print(f"\nCluster {cluster_id + 1}:")
if not docs_in_cluster:
print(" (No documents in this cluster)")
else:
for doc_text in docs_in_cluster:
# Print truncated document text for readability
print_text = doc_text[:100] + "..." if len(doc_text) > 100 else doc_text
print(f" - {print_text}")
print("-" * 20)
print("\nNote: The quality of clustering depends on the data, the embedding model,")
print("and the chosen number of clusters (k). Cluster numbers are arbitrary.")
Code Breakdown Explanation:
- Setup & Helpers:
- Includes standard imports plus
KMeans
fromsklearn.cluster
. - Initializes the OpenAI client.
- Includes the
get_embedding
helper function (same as before).
- Includes standard imports plus
- Document Collection: A list named
documents
holds the text content. The sample documents are chosen to represent a few distinct underlying topics (AI/Tech, Travel/Geography, Food/Cooking). - Embedding Generation:
- The script iterates through the
documents
. - It calls
get_embedding
for each document. - It stores the successful embeddings in the
embeddings
list and the corresponding document text invalid_documents
. This ensures that the indices match later. - Error handling skips documents if embedding generation fails.
- The list of embedding vectors is converted into a NumPy array (
embedding_matrix
), which is the standard input format forscikit-learn
algorithms.
- The script iterates through the
- Clustering (K-Means):
- Choosing
k
: The number of clusters (n_clusters
) is set (here,k=3
, assuming we expect three topics based on the sample data). A comment highlights that finding the optimalk
is often a separate task in real-world scenarios. - Initialization: A
KMeans
object is created.n_clusters
specifies the desired number of groups.random_state
ensures reproducibility.n_init=10
runs the algorithm multiple times with different starting centroids and chooses the best result (suppresses a future warning). - Fitting:
kmeans.fit(embedding_matrix)
performs the K-Means clustering algorithm on the document embeddings. It finds cluster centers and assigns each embedding vector to the nearest center. - Labels:
kmeans.labels_
contains an array where each element indicates the cluster ID (0, 1, 2, etc.) assigned to the corresponding document embedding.
- Choosing
- Displaying Results:
- A dictionary (
clustered_documents
) is created to organize the results, with keys representing cluster IDs. - The script iterates through the
cluster_labels
assigned by K-Means. For each document's indexi
, it finds its assignedlabel
and appends the corresponding text fromvalid_documents[i]
to the list for that cluster ID in the dictionary. - Finally, it loops through the
clustered_documents
dictionary and prints the text of the documents belonging to each cluster, clearly grouping them by the topic cluster identified by the algorithm.
- A dictionary (
This example demonstrates the power of embeddings for unsupervised topic discovery. By converting text to vectors, we can use mathematical algorithms like K-Means to group semantically similar documents without needing pre-defined labels.
3.2.3 Recommendation Systems
Suggesting related items by understanding the deeper connections between different pieces of content. This powerful application of embeddings enables systems to provide personalized recommendations by analyzing the semantic relationships between items. The embedding vectors capture subtle patterns and similarities that might not be immediately obvious to human observers.
Here's how recommendation systems leverage embeddings:
- Content-Based Filtering
- Systems analyze the actual content characteristics (like text descriptions, features, or attributes)
- Each item is converted into an embedding vector that represents its key features
- Similar items are found by measuring the distance between these vectors
- Collaborative Filtering Enhancement
- User behaviors and preferences are also converted into embeddings
- The system can identify patterns in user-item interactions
- This helps predict which items a user might like based on similar users' preferences
For example, a video streaming service can recommend shows not just based on genre tags, but by understanding thematic elements, storytelling styles, and complex narrative patterns. The embedding vectors can capture nuanced features like:
- Pacing and plot complexity
- Character development styles
- Emotional tone and atmosphere
- Visual and directorial techniques
Similarly, e-commerce platforms can suggest products by understanding the contextual similarities in product descriptions, user behavior, and item characteristics. This includes analyzing:
- Product descriptions and features
- User browsing and purchase patterns
- Price points and quality levels
- Brand relationships and market positioning
This semantic understanding leads to more accurate and relevant recommendations compared to traditional methods that rely solely on explicit categories or user ratings. The system can identify subtle connections and patterns that might be missed by conventional recommendation approaches, resulting in more engaging and personalized user experiences.
Example:
The following code example demonstrates how OpenAI embeddings can be used to build a simple content-based recommendation system.
This script will:
- Define a small catalog of items (e.g., movie descriptions).
- Generate embeddings for these items.
- Choose a target item.
- Find other items in the catalog that are semantically similar to the target item based on their embeddings.
- Present the most similar items as recommendations.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-03-24 15:29:00 CDT"
current_location = "Austin, Texas, United States"
print(f"Running Recommendation System example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
# Truncate text for printing if it's too long
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
# print("Embedding generation successful.") # Reduce verbosity
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0 # Return 0 if any vector is missing
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return similarity
# --- Recommendation System Implementation ---
# 1. Define your item catalog (e.g., movie descriptions)
# In a real application, this would come from a database.
item_catalog = [
{"id": "mov001", "title": "Space Odyssey: The Final Frontier", "description": "A visually stunning sci-fi epic exploring humanity's place in the universe, featuring complex themes and groundbreaking special effects."},
{"id": "mov002", "title": "Galactic Wars: Attack of the Clones", "description": "An action-packed space opera with laser battles, alien creatures, and a classic good versus evil storyline."},
{"id": "com001", "title": "Laugh Riot", "description": "A slapstick comedy about mistaken identities and hilarious mishaps during a weekend getaway."},
{"id": "doc001", "title": "Wonders of the Deep", "description": "An awe-inspiring documentary showcasing the beauty and mystery of marine life in the world's oceans."},
{"id": "mov003", "title": "Cyber City 2077", "description": "A gritty cyberpunk thriller set in a dystopian future, exploring themes of technology, consciousness, and rebellion."},
{"id": "com002", "title": "The Office Party", "description": "A witty ensemble comedy centered around awkward interactions and office politics during an annual holiday celebration."},
{"id": "doc002", "title": "Cosmic Journeys", "description": "A documentary exploring the vastness of space, black holes, distant galaxies, and the search for extraterrestrial life."},
{"id": "mov004", "title": "Interstellar Echoes", "description": "A philosophical science fiction film about astronauts travelling through a wormhole in search of a new home for humanity."}
]
print(f"\nItem catalog contains {len(item_catalog)} items.")
# 2. Generate embeddings for all items in the catalog (pre-computation)
print("\nGenerating embeddings for the item catalog...")
item_embeddings_data = []
for item in item_catalog:
# Combine title and description for a richer embedding
text_to_embed = f"{item['title']}: {item['description']}"
embedding = get_embedding(client, text_to_embed)
if embedding:
# Store item ID and its embedding
item_embeddings_data.append({"id": item["id"], "embedding": embedding})
else:
print(f"Skipping item {item['id']} due to embedding error.")
if not item_embeddings_data:
print("\nError: No embeddings were generated. Cannot provide recommendations.")
exit()
print(f"\nSuccessfully generated embeddings for {len(item_embeddings_data)} items.")
# 3. Select a target item for which to find recommendations
target_item_id = "mov001" # Let's find movies similar to "Space Odyssey"
print(f"\nFinding recommendations similar to item ID: {target_item_id}")
# Find the embedding for the target item
target_embedding = None
for item_data in item_embeddings_data:
if item_data["id"] == target_item_id:
target_embedding = item_data["embedding"]
break
if target_embedding is None:
print(f"Error: Could not find the embedding for the target item ID '{target_item_id}'.")
exit()
# 4. Calculate similarity between the target item and all other items
recommendations = []
print("\nCalculating similarities...")
for item_data in item_embeddings_data:
# Don't compare the item with itself
if item_data["id"] == target_item_id:
continue
similarity = cosine_similarity(target_embedding, item_data["embedding"])
recommendations.append({"id": item_data["id"], "score": similarity})
# 5. Sort potential recommendations by similarity score
recommendations.sort(key=lambda x: x["score"], reverse=True)
# 6. Display top N recommendations
print("\n--- Top Recommendations ---")
# Find the original title/description for the target item for context
target_item_info = next((item for item in item_catalog if item["id"] == target_item_id), None)
if target_item_info:
print(f"Based on: \"{target_item_info['title']}\"\n")
if not recommendations:
print("No recommendations found (or error calculating similarities).")
else:
top_n = 3
print(f"Top {top_n} most similar items:")
for i, rec in enumerate(recommendations[:top_n]):
# Find the full item details from the original catalog
rec_details = next((item for item in item_catalog if item["id"] == rec["id"]), None)
if rec_details:
print(f"{i+1}. ID: {rec['id']}, Score: {rec['score']:.4f}")
print(f" Title: {rec_details['title']}")
print(f" Description: {rec_details['description'][:100]}...") # Truncate description
print("-" * 10)
else:
print(f"{i+1}. ID: {rec['id']}, Score: {rec['score']:.4f} (Details not found)")
print("-" * 10)
if len(recommendations) > top_n:
print(f"(Showing top {top_n} of {len(recommendations)} potential recommendations)")
Code Breakdown Explanation
This example demonstrates how to build a straightforward content-based recommendation system by combining OpenAI embeddings with cosine similarity calculations.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions previously defined.
- Includes standard imports (
- Item Catalog:
- A list of dictionaries (
item_catalog
) represents the items available for recommendation (e.g., movies). Each item has anid
,title
, anddescription
. In a real system, this would likely be loaded from a database.
- A list of dictionaries (
- Item Embedding Generation:
- The script iterates through each
item
in theitem_catalog
. - Content Combination: It combines the
title
anddescription
into a single string (text_to_embed
). This provides richer context to the embedding model than using just the title or description alone. - It calls
get_embedding
for this combined text. - It stores the
item['id']
and its correspondingembedding
vector together in theitem_embeddings_data
list. This pre-computation step is standard practice for recommendation systems.
- The script iterates through each
- Target Item Selection:
- A
target_item_id
variable is set to specify the item for which we want recommendations (e.g., find items similar tomov001
). - The script retrieves the pre-computed embedding vector for this
target_item_id
from theitem_embeddings_data
list.
- A
- Similarity Calculation:
- It iterates through all the items with embeddings in
item_embeddings_data
. - Exclusion: It explicitly skips the comparison if the current item's ID matches the
target_item_id
(an item shouldn't recommend itself). - For every other item, it calculates the
cosine_similarity
between thetarget_embedding
and the current item's embedding. - It stores the other item's
id
and its calculated similarityscore
in arecommendations
list.
- It iterates through all the items with embeddings in
- Ranking Recommendations:
- The
recommendations
list is sorted usingrecommendations.sort(...)
based on thescore
field in descending order, placing the most similar items at the beginning of the list.
- The
- Displaying Results:
- The script prints the title of the target item for context.
- It then iterates through the top N (e.g., 3) items in the sorted
recommendations
list. - For each recommended item ID, it looks up the full details (title, description) from the original
item_catalog
. - It prints the rank, ID, similarity score, title, and a truncated description for each recommended item.
This example effectively shows how embeddings capture semantic meaning, allowing the system to recommend items based on content similarity (e.g., recommending other philosophical sci-fi movies similar to "Space Odyssey") rather than just explicit tags or user history.
3.2.4 Context retrieval for AI assistants
Helping chatbots and AI systems find and use relevant information from large knowledge bases by converting both queries and stored knowledge into embeddings. This process involves several key steps:
First, the system converts all documents in its knowledge base into embedding vectors - numerical representations that capture the semantic meaning of the text. These embeddings are stored in a vector database for quick retrieval.
When an AI assistant receives a question, it converts that query into an embedding vector using the same process. This ensures that both the stored knowledge and the incoming questions are represented in the same mathematical space.
The system then performs a similarity search to find the most relevant information. This search compares the query embedding to all stored embeddings, typically using techniques like cosine similarity or nearest neighbor search. The beauty of this approach is that it can identify semantically similar content even when the exact wording differs significantly.
For example, a query about "laptop won't turn on" might match documentation about "computer power issues" because their embeddings capture the similar underlying meaning. This semantic matching is far more powerful than traditional keyword-based search.
Once relevant information is identified, it can be used to generate more accurate, informed responses. This is particularly powerful for domain-specific applications where the AI needs to access technical documentation, product information, or company policies. The system can handle complex queries by combining multiple pieces of relevant context, ensuring responses are both accurate and comprehensive.
Example:
Below is a code example that demonstrates how AI assistants can retrieve context using OpenAI embeddings, implementing the concepts discussed in section 3.2.4.
The script illustrates the essential process of searching a knowledge base to provide relevant context for an AI assistant's responses.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-02-10 15:35:00 CDT"
current_location = "Grapevine, Texas, United States"
print(f"Running Context Retrieval example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return similarity
# --- Context Retrieval Implementation ---
# 1. Define your Knowledge Base (list of text documents/chunks)
# This represents the information the AI assistant can draw upon.
knowledge_base = [
{"id": "doc001", "source": "troubleshooting_guide.txt", "content": "If your laptop fails to power on, first check the power adapter connection. Ensure the cable is securely plugged into both the laptop and the wall outlet. Try a different outlet if possible."},
{"id": "doc002", "source": "troubleshooting_guide.txt", "content": "A blinking power light often indicates a battery issue or a charging problem. Try removing the battery (if removable) and powering on with only the adapter connected."},
{"id": "doc003", "source": "faq.html", "content": "To reset your password, go to the login page and click the 'Forgot Password' link. Follow the instructions sent to your registered email address."},
{"id": "doc004", "source": "product_manual.pdf", "content": "The Model X laptop uses a USB-C port for charging. Ensure you are using the correct wattage power adapter (65W minimum recommended)."},
{"id": "doc005", "source": "troubleshooting_guide.txt", "content": "No display output? Check if the laptop is making any sounds (fan spinning, beeps). Try connecting an external monitor to rule out a screen issue."},
{"id": "doc006", "source": "support_articles/power_issues.md", "content": "Holding the power button down for 15-30 seconds can perform a hard reset, sometimes resolving power-on failures."},
{"id": "doc007", "source": "faq.html", "content": "Software updates can be found in the 'System Settings' under the 'Updates' section. Ensure you are connected to the internet."}
]
print(f"\nKnowledge base contains {len(knowledge_base)} documents/chunks.")
# 2. Generate embeddings for the knowledge base (pre-computation)
print("\nGenerating embeddings for the knowledge base...")
kb_embeddings_data = []
for doc in knowledge_base:
embedding = get_embedding(client, doc["content"])
if embedding:
# Store document ID and its embedding
kb_embeddings_data.append({"id": doc["id"], "embedding": embedding})
else:
print(f"Skipping document {doc['id']} due to embedding error.")
if not kb_embeddings_data:
print("\nError: No embeddings were generated for the knowledge base. Cannot retrieve context.")
exit()
print(f"\nSuccessfully generated embeddings for {len(kb_embeddings_data)} knowledge base documents.")
# 3. Define the user's query to the AI assistant
user_query = "My computer won't start up."
# user_query = "How do I update the system software?"
# user_query = "Screen is black when I press the power button."
print(f"\nUser Query: \"{user_query}\"")
# 4. Generate embedding for the user query
print("\nGenerating embedding for the user query...")
query_embedding = get_embedding(client, user_query)
# 5. Find relevant documents from the knowledge base using similarity search
retrieved_context = []
if query_embedding and kb_embeddings_data:
print("\nCalculating similarities to find relevant context...")
for doc_data in kb_embeddings_data:
similarity = cosine_similarity(query_embedding, doc_data["embedding"])
retrieved_context.append({"id": doc_data["id"], "score": similarity})
# Sort context documents by similarity score in descending order
retrieved_context.sort(key=lambda x: x["score"], reverse=True)
# 6. Select Top N relevant documents to use as context
top_n_context = 3
print(f"\n--- Top {top_n_context} Relevant Context Documents Found ---")
if not retrieved_context:
print("No relevant context found (or error calculating similarities).")
else:
final_context_docs = []
for i, context_item in enumerate(retrieved_context[:top_n_context]):
# Find the full document details from the original knowledge base
doc_details = next((doc for doc in knowledge_base if doc["id"] == context_item["id"]), None)
if doc_details:
print(f"{i+1}. ID: {context_item['id']}, Score: {context_item['score']:.4f}")
print(f" Source: {doc_details['source']}")
print(f" Content: {doc_details['content'][:150]}...") # Truncate content
print("-" * 10)
final_context_docs.append(doc_details['content']) # Store content for next step
else:
print(f"{i+1}. ID: {context_item['id']}, Score: {context_item['score']:.4f} (Details not found)")
print("-" * 10)
if len(retrieved_context) > top_n_context:
print(f"(Showing top {top_n_context} of {len(retrieved_context)} potential context documents)")
# --- Next Step (Conceptual - Not coded here) ---
print("\n--- Next Step: Generating AI Assistant Response ---")
print("The content from the relevant documents above would now be combined")
print("with the original user query and sent to a model like GPT-4o")
print("as context to generate an informed and accurate response.")
print("Example prompt structure for GPT-4o:")
print("```")
print(f"System: You are a helpful AI assistant. Answer the user's question based ONLY on the provided context documents.")
print(f"User: Context Documents:\n1. {final_context_docs[0][:50]}...\n2. {final_context_docs[1][:50]}...\n[...]\n\nQuestion: {user_query}\n\nAnswer:")
print("```")
else:
print("\nCould not retrieve context.")
if not query_embedding:
print("Reason: Failed to generate embedding for the user query.")
if not kb_embeddings_data:
print("Reason: No knowledge base embeddings were successfully generated.")
Code Breakdown Explanation
This example demonstrates the core mechanism behind context retrieval for AI assistants using embeddings – finding relevant information from a knowledge base to answer a user's query.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions.
- Includes standard imports (
- Knowledge Base Definition:
- A list of dictionaries (
knowledge_base
) simulates the information store the AI assistant can access. Each dictionary represents a document or chunk of information and includes anid
,source
(optional metadata), and the actual textcontent
.
- A list of dictionaries (
- Knowledge Base Embedding Generation:
- The script iterates through each
doc
in theknowledge_base
. - It calls
get_embedding
on thedoc["content"]
to get its vector representation. - It stores the
doc['id']
and its correspondingembedding
vector together inkb_embeddings_data
. This is the crucial pre-computation step – embeddings for the knowledge base are typically generated offline and stored (often in a specialized vector database) for fast retrieval.
- The script iterates through each
- User Query:
- A sample
user_query
string represents the question asked to the AI assistant.
- A sample
- Query Embedding Generation:
- The
get_embedding
function is called for theuser_query
to get its vector representation in the same embedding space as the knowledge base documents.
- The
- Similarity Search (Context Retrieval):
- It iterates through all the pre-computed embeddings in
kb_embeddings_data
. - For each knowledge base document, it calculates the
cosine_similarity
between thequery_embedding
and the document's embedding. - It stores the document's
id
and its similarityscore
relative to the query in aretrieved_context
list.
- It iterates through all the pre-computed embeddings in
- Ranking and Selection:
- The
retrieved_context
list is sorted byscore
in descending order, bringing the most semantically relevant documents to the top. - The script selects the top N (e.g., 3) documents from this sorted list. These documents represent the most relevant context found in the knowledge base for the user's query.
- The
- Displaying Retrieved Context:
- The script prints the details (ID, score, source, content preview) of the top N context documents found.
- Conceptual Next Step (Crucial Explanation):
- The final print statements explain the purpose of this retrieval process. The content of these
final_context_docs
would not be the final answer. Instead, they would be combined with the originaluser_query
and passed as context to a large language model like GPT-4o in a subsequent API call. - An example prompt structure is shown, illustrating how the retrieved context grounds the AI assistant, enabling it to generate an informed response based on the relevant information found in the knowledge base, rather than relying solely on its general knowledge.
- The final print statements explain the purpose of this retrieval process. The content of these
This example effectively demonstrates the retrieval part of Retrieval-Augmented Generation (RAG), showing how embeddings bridge the gap between a user's query and relevant information stored in a knowledge base, enabling more accurate and context-aware AI assistants.
3.2.5 Anomaly and similarity detection
Identifying unusual patterns or finding similar items in large datasets by comparing their semantic representations is a fundamental application of embedding technology. This powerful technique transforms raw data into mathematical vectors that capture the essence of their content, enabling sophisticated analysis at scale. Here's how these systems work and their key applications:
- Detect Anomalies
- Flag unusual transactions or behaviors that deviate from normal patterns - For example, detecting suspicious credit card purchases by comparing them against typical spending patterns
- Identify potential security threats or fraud attempts - Such as recognizing unusual login patterns or detecting fake accounts based on behavior analysis
- Spot data quality issues or outliers in datasets - Including identifying incorrect data entries or unusual measurements that might indicate equipment malfunction
- Find Similarities
- Group related documents, images, or data points based on semantic meaning - This allows systems to cluster similar content even when the exact wording differs, making it easier to organize large collections of information
- Match similar customer inquiries or support tickets - Helping customer service teams identify common issues and standardize responses to frequent problems
- Identify duplicate or near-duplicate content - Useful for content management systems to maintain data quality and reduce redundancy
By converting data points into embedding vectors, systems can measure how "different" or "similar" items are to each other using mathematical distance calculations. This process works by mapping each item to a point in a high-dimensional space, where similar items are positioned closer together and dissimilar items are farther apart. This mathematical representation makes it possible to automatically flag unusual patterns or group related items together at scale, enabling both anomaly detection and similarity matching in ways that would be impossible with traditional rule-based systems.
Example:
The following code example demonstrates similarity and anomaly detection using OpenAI embeddings.
This script will:
- Define a dataset of text items (e.g., descriptions of transactions or events).
- Generate embeddings for these items.
- Similarity Detection: Find items most similar to a given target item.
- Anomaly Detection: Identify items that are least similar (most anomalous) compared to the rest of the dataset using a simple average similarity approach.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2025-01-13 15:40:00 CDT"
current_location = "Houston, Texas, United States"
print(f"Running Similarity & Anomaly Detection example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
# Clamp the value to handle potential floating point inaccuracies slightly outside [-1, 1]
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return np.clip(similarity, -1.0, 1.0)
# --- Similarity and Anomaly Detection Implementation ---
# 1. Define your dataset (e.g., transaction descriptions, log entries)
# Includes mostly normal items and a couple of potentially anomalous ones.
dataset = [
{"id": "txn001", "description": "Grocery purchase at Local Supermarket"},
{"id": "txn002", "description": "Monthly subscription fee for streaming service"},
{"id": "txn003", "description": "Dinner payment at Italian Restaurant"},
{"id": "txn004", "description": "Online order for electronics from TechStore"},
{"id": "txn005", "description": "Fuel purchase at Gas Station"},
{"id": "txn006", "description": "Purchase of fresh produce and bread"}, # Similar to txn001
{"id": "txn007", "description": "Payment for movie streaming subscription"}, # Similar to txn002
{"id": "txn008", "description": "Unusual large wire transfer to overseas account"}, # Potential Anomaly 1
{"id": "txn009", "description": "Purchase of rare antique collectible vase"}, # Potential Anomaly 2
{"id": "txn010", "description": "Coffee purchase at Cafe Central"}
]
print(f"\nDataset contains {len(dataset)} items.")
# 2. Generate embeddings for all items in the dataset (pre-computation)
print("\nGenerating embeddings for the dataset...")
dataset_embeddings_data = []
for item in dataset:
embedding = get_embedding(client, item["description"])
if embedding:
# Store item ID, description, and its embedding
dataset_embeddings_data.append({
"id": item["id"],
"description": item["description"],
"embedding": embedding
})
else:
print(f"Skipping item {item['id']} due to embedding error.")
if not dataset_embeddings_data:
print("\nError: No embeddings were generated. Cannot perform analysis.")
exit()
print(f"\nSuccessfully generated embeddings for {len(dataset_embeddings_data)} items.")
# --- Part A: Similarity Detection ---
print("\n--- Part A: Similarity Detection ---")
# Select a target item to find similar items for
target_item_id_similarity = "txn001" # Find items similar to "Grocery purchase..."
print(f"Finding items similar to item ID: {target_item_id_similarity}")
# Find the target item's data
target_item_data = next((item for item in dataset_embeddings_data if item["id"] == target_item_id_similarity), None)
if target_item_data:
target_embedding = target_item_data["embedding"]
similar_items = []
# Calculate similarity between the target and all other items
for item_data in dataset_embeddings_data:
if item_data["id"] == target_item_id_similarity:
continue # Skip self-comparison
similarity = cosine_similarity(target_embedding, item_data["embedding"])
similar_items.append({
"id": item_data["id"],
"description": item_data["description"],
"score": similarity
})
# Sort by similarity score
similar_items.sort(key=lambda x: x["score"], reverse=True)
# Display top N similar items
print(f"\nItems most similar to: \"{target_item_data['description']}\"")
top_n_similar = 2
for i, item in enumerate(similar_items[:top_n_similar]):
print(f"{i+1}. ID: {item['id']}, Score: {item['score']:.4f}")
print(f" Description: {item['description']}")
print("-" * 10)
else:
print(f"Error: Could not find data for target item ID '{target_item_id_similarity}'.")
# --- Part B: Anomaly Detection (Simple Approach) ---
print("\n--- Part B: Anomaly Detection (Low Average Similarity) ---")
# Calculate the average similarity of each item to all other items
item_avg_similarities = []
num_items = len(dataset_embeddings_data)
if num_items < 2:
print("Need at least 2 items with embeddings to calculate average similarities.")
else:
print("\nCalculating average similarities for anomaly detection...")
for i in range(num_items):
current_item = dataset_embeddings_data[i]
total_similarity = 0
# Compare current item to all others
for j in range(num_items):
if i == j: # Don't compare item to itself
continue
other_item = dataset_embeddings_data[j]
similarity = cosine_similarity(current_item["embedding"], other_item["embedding"])
total_similarity += similarity
# Calculate average similarity (avoid division by zero if only 1 item)
average_similarity = total_similarity / (num_items - 1) if num_items > 1 else 0
item_avg_similarities.append({
"id": current_item["id"],
"description": current_item["description"],
"avg_score": average_similarity
})
print(f"Item ID {current_item['id']} - Avg Similarity: {average_similarity:.4f}")
# Sort items by average similarity in ascending order (lowest first = most anomalous)
item_avg_similarities.sort(key=lambda x: x["avg_score"])
# Display top N potential anomalies (items least similar to others)
print("\nPotential Anomalies (Lowest Average Similarity):")
top_n_anomalies = 3
for i, item in enumerate(item_avg_similarities[:top_n_anomalies]):
print(f"{i+1}. ID: {item['id']}, Avg Score: {item['avg_score']:.4f}")
print(f" Description: {item['description']}")
print("-" * 10)
print("\nNote: Low average similarity suggests an item is semantically")
print("different from the majority of other items in this dataset.")
Code Breakdown Explanation
This example demonstrates using OpenAI embeddings for both finding similar items and detecting potential anomalies within a dataset based on semantic meaning.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions. Thecosine_similarity
function now includesnp.clip
to ensure the output is strictly within [-1, 1].
- Includes standard imports (
- Dataset Definition:
- A list of dictionaries (
dataset
) simulates the data to be analyzed (e.g., transaction descriptions). Each item has anid
and a textdescription
. The sample data includes mostly common items and a few conceptually different ones intended as potential anomalies.
- A list of dictionaries (
- Dataset Embedding Generation:
- The script iterates through each
item
in thedataset
. - It calls
get_embedding
on theitem["description"]
. - It stores the
item['id']
,item['description']
, and its correspondingembedding
vector together indataset_embeddings_data
. This pre-computation is essential.
- The script iterates through each
- Part A: Similarity Detection:
- Target Selection: An item ID (
target_item_id_similarity
) is chosen to find similar items for. - Target Embedding Retrieval: The script finds the pre-computed embedding for the target item.
- Comparison: It iterates through all other items in
dataset_embeddings_data
, calculates thecosine_similarity
between the target item's embedding and each other item's embedding. - Ranking: The results (other item ID, description, similarity score) are stored and then sorted by score in descending order.
- Display: The top N most similar items are printed.
- Target Selection: An item ID (
- Part B: Anomaly Detection (Simple Average Similarity Approach):
- Concept: This simple method identifies anomalies as items that have the lowest average semantic similarity to all other items in the dataset. An item that is very different conceptually from the rest will likely have low similarity scores when compared to most others.
- Calculation:
- The script iterates through each item (
current_item
) indataset_embeddings_data
. - For each
current_item
, it iterates through all other items in the dataset. - It calculates the
cosine_similarity
between thecurrent_item
and everyother_item
. - It sums these similarities and calculates the average similarity for the
current_item
.
- The script iterates through each item (
- Ranking: The items are stored along with their calculated
avg_score
and then sorted by this score in ascending order (lowest average similarity first). - Display: The top N items with the lowest average similarity scores are printed as potential anomalies. A note explains the interpretation.
This example showcases two powerful applications: finding related content (similarity) and identifying outliers (anomaly detection) by leveraging the semantic understanding captured within OpenAI embeddings.
3.2.6 Clustering & Tagging
Automatically organize and label content based on semantic similarity - a powerful technique that uses embedding vectors to understand the true meaning and relationships between different pieces of content. This approach goes far beyond traditional keyword matching, allowing for much more nuanced and accurate content organization.
When content is clustered, similar items naturally group together based on their semantic meaning, even if they use different terminology to express the same concepts. For example, documents about "automotive maintenance" and "car repair" would cluster together despite using different words.
This intelligent organization helps create intuitive navigation systems, improves content discovery, and makes large document collections more manageable by grouping related items together. Some key benefits include:
- Automatic tag generation based on cluster themes
- Dynamic organization that adapts as new content is added
- Improved search relevance through semantic understanding
- Better content discovery through related-item suggestions
The clustering process can be fine-tuned to create either broad categories or more granular subcategories, depending on the specific needs of your content organization system. This flexibility makes it a valuable tool for managing everything from digital libraries to enterprise knowledge bases.
Example:
Let's examine a code example that demonstrates clustering and tagging using OpenAI embeddings and GPT-4o.
This script will:
- Define a collection of documents.
- Generate embeddings for the documents.
- Cluster the documents using K-Means based on their embeddings.
- For each cluster, use GPT-4o to analyze the documents within it and generate a descriptive tag or label.
- Display the documents grouped by cluster along with their AI-generated tags.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np
from sklearn.cluster import KMeans # For clustering algorithm
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2024-12-31 15:48:00 CDT"
current_location = "San Antonio, Texas, United States"
print(f"Running Clustering & Tagging example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function to Generate Cluster Tag using GPT-4o ---
def generate_cluster_tag(client, documents_in_cluster):
"""Uses GPT-4o to suggest a tag/label for a cluster of documents."""
if not documents_in_cluster:
return "Empty Cluster"
# Combine content for context, limiting total length if necessary
# Using first few hundred chars of each doc might be enough
max_context_length = 3000 # Limit context to avoid excessive token usage
context = ""
for i, doc in enumerate(documents_in_cluster):
doc_preview = f"Document {i+1}: {doc[:300]}...\n"
if len(context) + len(doc_preview) > max_context_length:
break
context += doc_preview
if not context:
return "Error: Could not create context"
system_prompt = "You are an expert at identifying themes and creating concise labels."
user_prompt = f"""Based on the following document excerpts from a single cluster, suggest a short, descriptive tag or label (2-5 words) that captures the main theme or topic of this group.
Document Excerpts:
---
{context.strip()}
---
Suggested Tag/Label:
"""
print(f"\nGenerating tag for cluster with {len(documents_in_cluster)} documents...")
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=20, # Short response expected
temperature=0.3 # More deterministic label
)
tag = response.choices[0].message.content.strip().replace('"', '') # Clean up quotes
print(f"Generated tag: '{tag}'")
return tag
except OpenAIError as e:
print(f"OpenAI API Error generating tag: {e}")
return "Tagging Error"
except Exception as e:
print(f"An unexpected error occurred during tag generation: {e}")
return "Tagging Error"
# --- Clustering and Tagging Implementation ---
# 1. Define your collection of documents
# Covers topics: Space Exploration, Cooking/Food, Web Development
documents = [
"NASA launches new probe to study Jupiter's moons.",
"Recipe for authentic Italian pasta carbonara.",
"JavaScript frameworks like React and Vue dominate front-end development.",
"The James Webb Space Telescope captures stunning images of distant galaxies.",
"Tips for baking the perfect sourdough bread at home.",
"Understanding asynchronous programming in Node.js.",
"SpaceX successfully lands its reusable rocket booster after launch.",
"Exploring the different types of olive oil and their uses in cooking.",
"CSS Grid vs Flexbox: Choosing the right layout module.",
"The search for habitable exoplanets continues with new telescope data.",
"How to make delicious homemade pizza from scratch.",
"Building RESTful APIs using Express.js and MongoDB."
]
print(f"\nDocument collection contains {len(documents)} documents.")
# 2. Generate embeddings for all documents
print("\nGenerating embeddings for the document collection...")
embeddings = []
valid_documents = [] # Keep track of documents for which embedding was successful
for doc in documents:
embedding = get_embedding(client, doc)
if embedding:
embeddings.append(embedding)
valid_documents.append(doc) # Add corresponding document text
else:
print(f"Skipping document due to embedding error: \"{doc[:70]}...\"")
if not embeddings:
print("\nError: No embeddings were generated. Cannot perform clustering.")
exit()
print(f"\nSuccessfully generated embeddings for {len(valid_documents)} documents.")
# Convert embeddings list to a NumPy array for scikit-learn
embedding_matrix = np.array(embeddings)
# 3. Apply Clustering Algorithm (K-Means)
# Choose the number of clusters (k). We expect 3 topics here.
n_clusters = 3
print(f"\nApplying K-Means clustering with k={n_clusters}...")
try:
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
kmeans.fit(embedding_matrix)
cluster_labels = kmeans.labels_
print("Clustering complete.")
except Exception as e:
print(f"An error occurred during clustering: {e}")
exit()
# 4. Group Documents by Cluster
print("\nGrouping documents by cluster...")
clustered_documents = {i: [] for i in range(n_clusters)}
for i, label in enumerate(cluster_labels):
clustered_documents[label].append(valid_documents[i])
# 5. Generate Tags for Each Cluster using GPT-4o
print("\nGenerating tags for each cluster...")
cluster_tags = {}
for cluster_id, docs_in_cluster in clustered_documents.items():
tag = generate_cluster_tag(client, docs_in_cluster)
cluster_tags[cluster_id] = tag
# 6. Display Documents by Cluster with Generated Tags
print(f"\n--- Documents Grouped by Cluster and Tag (k={n_clusters}) ---")
for cluster_id, docs_in_cluster in clustered_documents.items():
generated_tag = cluster_tags.get(cluster_id, "Unknown Tag")
print(f"\nCluster {cluster_id + 1} - Suggested Tag: '{generated_tag}'")
print("-" * (28 + len(generated_tag))) # Adjust underline length
if not docs_in_cluster:
print(" (No documents in this cluster)")
else:
for doc_text in docs_in_cluster:
print(f" - {doc_text}") # Print full document text here
print("\nClustering and Tagging process complete.")
Code Breakdown Explanation
This script demonstrates how to automatically group similar documents by their semantic meaning using embeddings, then uses GPT-4o to generate descriptive tags for each group.
- Setup & Helpers:
- Includes standard imports plus
KMeans
fromsklearn.cluster
. - Initializes the OpenAI client.
- Includes the
get_embedding
helper function.
- Includes standard imports plus
- New Helper Function:
generate_cluster_tag
:- Purpose: Takes a list of documents belonging to a single cluster and uses GPT-4o to suggest a concise tag summarizing their common theme.
- Input: The
client
object anddocuments_in_cluster
(a list of text strings). - Context Creation: It concatenates parts of the documents (e.g., first 300 characters) to create a context string for GPT-4o, respecting a maximum length to manage token usage.
- Prompt Engineering: It constructs a prompt asking GPT-4o to act as an expert theme identifier and suggest a short tag (2-5 words) based on the provided document excerpts.
- API Call: Uses
client.chat.completions.create
withmodel="gpt-4o"
and the specialized prompt. A low temperature is used for more focused tag generation. - Output: Returns the cleaned-up tag suggested by GPT-4o, or an error message.
- Document Collection: A list named
documents
holds sample text content covering a few distinct topics (Space, Cooking, Web Development). - Embedding Generation:
- The script iterates through the
documents
, generates an embedding for each usingget_embedding
, and stores successful embeddings and corresponding text inembeddings
andvalid_documents
. - The embeddings are converted to a NumPy array (
embedding_matrix
).
- The script iterates through the
- Clustering (K-Means):
- The number of clusters (
n_clusters
) is set (e.g.,k=3
). KMeans
fromscikit-learn
is initialized and fitted to theembedding_matrix
.kmeans.labels_
provides the cluster assignment for each document.
- The number of clusters (
- Grouping Documents:
- A dictionary (
clustered_documents
) is created to store the text of documents belonging to each cluster ID.
- A dictionary (
- Generating Cluster Tags:
- The script iterates through the
clustered_documents
dictionary. - For each
cluster_id
and its list ofdocs_in_cluster
, it calls thegenerate_cluster_tag
helper function. - The suggested tag for each cluster is stored in the
cluster_tags
dictionary.
- The script iterates through the
- Displaying Results:
- The script iterates through the clusters again.
- For each cluster, it retrieves the generated tag from
cluster_tags
. - It prints the cluster number, the suggested tag, and then lists the full text of all documents belonging to that cluster.
This example showcases a powerful workflow: using embeddings for unsupervised grouping of content based on meaning (clustering) and then leveraging an LLM like GPT-4o to interpret those groupings and assign meaningful labels (tagging), automating content organization.
3.2.7 Content Recommendations
Content recommendation systems powered by embeddings represent a significant advancement in personalization technology. By analyzing semantic relationships, these systems can understand the nuanced meaning and context of content in ways that traditional keyword-based systems cannot.
Here's a detailed look at how embedding-based recommendations work:
- Content Analysis:
- The system generates sophisticated embedding vectors for each piece of content in the database
- These vectors capture nuanced characteristics like writing style, topic depth, and emotional tone
- Advanced algorithms analyze patterns across multiple dimensions of content features
- User Preference Modeling:
- The system tracks detailed interaction patterns including time spent, engagement level, and sharing behavior
- Historical preferences are weighted and combined to create multi-dimensional user profiles
- Both explicit feedback (ratings, likes) and implicit signals (scroll depth, repeat visits) are considered
- Contextual Understanding:
- Real-time factors like device type and location are incorporated into the recommendation algorithm
- The system identifies patterns in content consumption based on time of day and day of week
- Current session behavior is analyzed to understand immediate user interests
- Dynamic Adaptation:
- Machine learning models continuously refine user profiles based on new interactions
- The system learns from both positive and negative feedback to improve accuracy
- Recommendation strategies are automatically adjusted based on performance metrics
This sophisticated approach enables recommendation engines to deliver highly personalized experiences through several key capabilities:
- Identify content similarities that might not be apparent through traditional metadata
- Can detect thematic connections between items even when they use different terminology
- Recognizes similar writing styles, tone, and complexity levels across content
- Understand the progression of user interests over time
- Tracks how preferences evolve from basic to advanced topics
- Identifies shifts in user interests across different categories
- Make cross-domain recommendations (e.g., suggesting articles based on watched videos)
- Connects content across different media types based on semantic relationships
- Leverages learning from one domain to enhance recommendations in another
- Account for seasonal trends and temporal relevance
- Adjusts recommendations based on time-sensitive factors like holidays or events
- Considers current trends and their impact on user interests
The result is a highly personalized experience that can suggest truly relevant videos, articles, or products that match users' interests, both current and evolving. This goes far beyond simple "users who liked X also liked Y" algorithms, creating a more engaging and valuable user experience.
Example:
Here's a code example that demonstrates the core concept of content recommendations using embeddings.
This script focuses on finding semantically similar content items based on their embeddings, which is the foundation for the more advanced recommendation features you described.
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2024-11-30 15:52:00 CDT"
current_location = "Orlando, Florida, United States"
print(f"Running Content Recommendation example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return np.clip(similarity, -1.0, 1.0) # Ensure value is within valid range
# --- Content Recommendation Implementation ---
# 1. Define your Content Catalog (e.g., articles, blog posts)
# In a real application, this would come from a database or CMS.
content_catalog = [
{"id": "art001", "title": "Introduction to Quantum Computing", "content": "Exploring the basics of qubits, superposition, and entanglement in quantum mechanics and their potential for computation."},
{"id": "art002", "title": "Healthy Mediterranean Diet Recipes", "content": "Delicious and easy recipes focusing on fresh vegetables, olive oil, fish, and whole grains for a heart-healthy lifestyle."},
{"id": "art003", "title": "The Future of Artificial Intelligence in Healthcare", "content": "How AI and machine learning are transforming diagnostics, drug discovery, and personalized medicine."},
{"id": "art004", "title": "Beginner's Guide to Python Programming", "content": "Learn the fundamentals of Python syntax, data types, control flow, and functions to start coding."},
{"id": "art005", "title": "Understanding Neural Networks and Deep Learning", "content": "An overview of artificial neural networks, backpropagation, and the concepts behind deep learning models."},
{"id": "art006", "title": "Travel Guide: Hiking the Swiss Alps", "content": "Tips for planning your trip, recommended trails, essential gear, and stunning viewpoints in the Swiss Alps."},
{"id": "art007", "title": "Mastering the Art of French Pastry", "content": "Techniques for creating classic French desserts like croissants, macarons, and éclairs."},
{"id": "art008", "title": "Ethical Considerations in AI Development", "content": "Discussing bias, fairness, transparency, and accountability in the development and deployment of artificial intelligence systems."}
]
print(f"\nContent catalog contains {len(content_catalog)} items.")
# 2. Generate embeddings for all content items (pre-computation)
print("\nGenerating embeddings for the content catalog...")
content_embeddings_data = []
for item in content_catalog:
# Use title and content for embedding
text_to_embed = f"Title: {item['title']}\nContent: {item['content']}"
embedding = get_embedding(client, text_to_embed)
if embedding:
# Store item ID and its embedding
content_embeddings_data.append({"id": item["id"], "embedding": embedding})
else:
print(f"Skipping item {item['id']} due to embedding error.")
if not content_embeddings_data:
print("\nError: No embeddings were generated. Cannot provide recommendations.")
exit()
print(f"\nSuccessfully generated embeddings for {len(content_embeddings_data)} content items.")
# 3. Select a target item (e.g., an article the user just read)
target_item_id = "art003" # User read "The Future of Artificial Intelligence in Healthcare"
print(f"\nFinding content similar to item ID: {target_item_id}")
# Find the embedding for the target item
target_embedding = None
for item_data in content_embeddings_data:
if item_data["id"] == target_item_id:
target_embedding = item_data["embedding"]
break
if target_embedding is None:
print(f"Error: Could not find the embedding for the target item ID '{target_item_id}'.")
exit()
# 4. Calculate similarity between the target item and all other items
recommendations = []
print("\nCalculating similarities...")
for item_data in content_embeddings_data:
# Don't recommend the item itself
if item_data["id"] == target_item_id:
continue
similarity = cosine_similarity(target_embedding, item_data["embedding"])
recommendations.append({"id": item_data["id"], "score": similarity})
# 5. Sort potential recommendations by similarity score
recommendations.sort(key=lambda x: x["score"], reverse=True)
# 6. Display top N recommendations
print("\n--- Top Content Recommendations ---")
# Find the original title for the target item for context
target_item_info = next((item for item in content_catalog if item["id"] == target_item_id), None)
if target_item_info:
print(f"Because you read: \"{target_item_info['title']}\"\n")
if not recommendations:
print("No recommendations found (or error calculating similarities).")
else:
top_n = 3
print(f"Top {top_n} recommended items:")
for i, rec in enumerate(recommendations[:top_n]):
# Find the full item details from the original catalog
rec_details = next((item for item in content_catalog if item["id"] == rec["id"]), None)
if rec_details:
print(f"{i+1}. ID: {rec['id']}, Similarity Score: {rec['score']:.4f}")
print(f" Title: {rec_details['title']}")
print(f" Content Snippet: {rec_details['content'][:100]}...") # Truncate content
print("-" * 10)
else:
print(f"{i+1}. ID: {rec['id']}, Score: {rec['score']:.4f} (Details not found)")
print("-" * 10)
if len(recommendations) > top_n:
print(f"(Showing top {top_n} of {len(recommendations)} potential recommendations)")
print("\nNote: This demonstrates basic content-to-content similarity.")
print("Advanced systems incorporate user profiles, interaction history, context, etc.")
Code Breakdown Explanation
This script demonstrates a fundamental approach to content recommendation using OpenAI embeddings, focusing on finding items semantically similar to a target item.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions.
- Includes standard imports (
- Content Catalog:
- A list of dictionaries (
content_catalog
) simulates the available content (e.g., articles). Each item has anid
,title
, andcontent
.
- A list of dictionaries (
- Content Embedding Generation (Pre-computation):
- The script iterates through each
item
in thecontent_catalog
. - Combined Text: It creates a combined text string from the item's
title
andcontent
to generate a richer embedding that captures more semantic detail. - It calls
get_embedding
for this combined text. - It stores the
item['id']
and itsembedding
vector incontent_embeddings_data
. This pre-computation is vital for efficiency.
- The script iterates through each
- Target Item Selection:
- A
target_item_id
is chosen (e.g.,art003
), simulating an item the user has interacted with (e.g., read). - The script retrieves the pre-computed embedding for this target item.
- A
- Similarity Calculation:
- It iterates through all other items in
content_embeddings_data
. - It calculates the
cosine_similarity
between thetarget_embedding
and each other item's embedding. - It stores the other item's
id
and its similarityscore
in therecommendations
list.
- It iterates through all other items in
- Ranking Recommendations:
- The
recommendations
list is sorted byscore
in descending order, placing the most semantically similar content items first.
- The
- Displaying Results:
- The script prints the title of the target item for context ("Because you read...").
- It displays the top N (e.g., 3) recommended items, showing their ID, similarity score, title, and a snippet of their content.
- Contextual Note: The final print statements explicitly mention that this example shows basic content-to-content similarity. Advanced recommendation systems, as described in the section text, would integrate user profiles (embeddings based on interaction history), real-time context (time, location), explicit feedback, and potentially more complex algorithms beyond simple cosine similarity. However, the core principle of using embeddings to measure semantic relatedness remains fundamental.
This example effectively illustrates how embeddings enable recommendations based on understanding the meaning of content, allowing suggestions that go beyond simple keyword or category matching.
3.2.8 Email Triage / Prioritization
Embedding technology enables sophisticated email analysis and categorization by understanding the semantic meaning of messages. This advanced system employs multiple layers of analysis to streamline email management:
- Urgency Detection
- Identify time-sensitive matters requiring immediate attention through natural language processing
- Recognize urgent language patterns and contextual cues by analyzing word choice, sentence structure, and historical patterns
- Flag critical emails based on sender importance, keywords, and organizational hierarchy
- Smart Categorization
- Group related email threads and conversations using semantic similarity matching
- Sort messages by project, department, or business function through content analysis
- Create dynamic folders based on emerging topics and trends
- Apply machine learning to improve categorization accuracy over time
- Intent Classification
- Distinguish between requests, updates, and FYI messages using advanced natural language understanding
- Prioritize action items and delegate tasks automatically based on content and context
- Identify follow-up requirements and set automated reminders
- Extract key deadlines and commitments from message content
By leveraging semantic understanding, the system creates an intelligent email processing pipeline that can handle hundreds of messages simultaneously. The embedding-based analysis examines not just keywords, but the actual meaning and context of each message, considering factors such as:
- Message context within ongoing conversations
- Historical patterns of communication
- Organizational relationships and hierarchies
- Project timelines and priorities
This comprehensive approach significantly reduces the cognitive load of email management by automatically handling routine classification and prioritization tasks. The system ensures that important messages receive immediate attention while maintaining an organized structure for all communications. As a result, professionals can focus on high-value activities instead of spending hours manually sorting through their inbox, leading to improved productivity and faster response times for critical communications.
Example:
This script simulates categorizing incoming emails based on their semantic similarity to predefined categories like "Urgent Request," "Project Update,"
import os
from openai import OpenAI, OpenAIError
from dotenv import load_dotenv
import numpy as np # For cosine similarity calculation
import datetime
# --- Configuration ---
load_dotenv()
# Get the current date and location context
current_timestamp = "2024-10-31 15:54:00 CDT"
current_location = "Plano, Texas, United States"
print(f"Running Email Triage/Prioritization example at: {current_timestamp}")
print(f"Location Context: {current_location}")
# Initialize the OpenAI client
try:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables.")
client = OpenAI(api_key=api_key)
print("OpenAI client initialized.")
except ValueError as e:
print(f"Configuration Error: {e}")
exit()
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
exit()
# Define the embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
# --- Helper Function to Generate Embedding ---
# (Same as in previous embeddings examples)
def get_embedding(client, text, model=EMBEDDING_MODEL):
"""Generates an embedding for the given text using the specified model."""
print_text = text[:70] + "..." if len(text) > 70 else text
print(f"Generating embedding for: \"{print_text}\"")
try:
response = client.embeddings.create(
input=text,
model=model
)
embedding_vector = response.data[0].embedding
return embedding_vector
except OpenAIError as e:
print(f"OpenAI API Error generating embedding for text '{print_text}': {e}")
return None
except Exception as e:
print(f"An unexpected error occurred during embedding generation for text '{print_text}': {e}")
return None
# --- Helper Function for Cosine Similarity ---
# (Same as in previous embeddings examples)
def cosine_similarity(vec_a, vec_b):
"""Calculates the cosine similarity between two vectors."""
if vec_a is None or vec_b is None:
return 0.0
vec_a = np.array(vec_a)
vec_b = np.array(vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
if norm_a == 0 or norm_b == 0:
return 0.0
else:
similarity = np.dot(vec_a, vec_b) / (norm_a * norm_b)
return np.clip(similarity, -1.0, 1.0) # Ensure value is within valid range
# --- Email Triage/Prioritization Implementation ---
# 1. Define Sample Emails (Subject + Snippet)
emails = [
{"id": "email01", "subject": "Urgent: Server Down!", "body_snippet": "The main production server seems to be unresponsive. We need immediate assistance to investigate and bring it back online."},
{"id": "email02", "subject": "Meeting Minutes - Project Phoenix Sync", "body_snippet": "Attached are the minutes from today's sync call. Key decisions included finalizing the Q3 roadmap. Action items assigned."},
{"id": "email03", "subject": "Quick Question about Report", "body_snippet": "Hi team, just had a quick question regarding the methodology used in the latest market analysis report. Can someone clarify?"},
{"id": "email04", "subject": "Fwd: Company Newsletter - April Edition", "body_snippet": "Sharing the latest company newsletter for your information."},
{"id": "email05", "subject": "Action Required: Submit Timesheet by EOD", "body_snippet": "Friendly reminder to please submit your weekly timesheet by the end of the day today. This is mandatory."},
{"id": "email06", "subject": "Update on Q2 Marketing Campaign", "body_snippet": "Just wanted to provide a brief update on the campaign performance metrics we discussed last week. See attached summary."},
{"id": "email07", "subject": "Can you approve this request ASAP?", "body_snippet": "Need your approval on the attached budget request urgently to proceed with the vendor contract."}
]
print(f"\nProcessing {len(emails)} emails.")
# 2. Define Categories/Priorities and their Semantic Representations
# We represent each category with a descriptive phrase.
categories = {
"Urgent Action Required": "Requires immediate attention, critical issue, deadline, ASAP request, mandatory task.",
"Project Update / Status": "Information about ongoing projects, progress reports, meeting minutes, status updates.",
"Question / Request for Info": "Asking for clarification, seeking information, query about details.",
"General Info / FYI": "Newsletter, announcement, sharing information, non-actionable update."
}
print(f"\nDefined categories: {list(categories.keys())}")
# 3. Generate embeddings for Categories (pre-computation recommended)
print("\nGenerating embeddings for categories...")
category_embeddings = {}
for category_name, category_description in categories.items():
embedding = get_embedding(client, category_description)
if embedding:
category_embeddings[category_name] = embedding
else:
print(f"Skipping category '{category_name}' due to embedding error.")
if not category_embeddings:
print("\nError: No embeddings generated for categories. Cannot triage emails.")
exit()
# 4. Process Each Email: Generate Embedding and Find Best Category
print("\nTriaging emails...")
email_results = []
for email in emails:
# Combine subject and body for better context
email_content = f"Subject: {email['subject']}\nBody: {email['body_snippet']}"
email_embedding = get_embedding(client, email_content)
if not email_embedding:
print(f"Skipping email {email['id']} due to embedding error.")
continue
# Find the category with the highest similarity
best_category = None
max_similarity = -1 # Cosine similarity ranges from -1 to 1
for category_name, category_embedding in category_embeddings.items():
similarity = cosine_similarity(email_embedding, category_embedding)
print(f" Email {email['id']} vs Category '{category_name}': Score {similarity:.4f}")
if similarity > max_similarity:
max_similarity = similarity
best_category = category_name
email_results.append({
"id": email["id"],
"subject": email["subject"],
"assigned_category": best_category,
"score": max_similarity
})
print(f"-> Email {email['id']} assigned to: '{best_category}' (Score: {max_similarity:.4f})")
# 5. Display Triage Results
print("\n--- Email Triage Results ---")
if not email_results:
print("No emails were successfully triaged.")
else:
# Optional: Group by category for display
results_by_category = {cat: [] for cat in categories.keys()}
for result in email_results:
if result["assigned_category"]: # Check if category was assigned
results_by_category[result["assigned_category"]].append(result)
for category_name, items in results_by_category.items():
print(f"\nCategory: {category_name}")
print("-" * (10 + len(category_name)))
if not items:
print(" (No emails assigned)")
else:
# Sort items within category by score if desired
items.sort(key=lambda x: x['score'], reverse=True)
for item in items:
print(f" - ID: {item['id']}, Subject: \"{item['subject']}\" (Score: {item['score']:.3f})")
print("\nEmail triage process complete.")
Code Breakdown Explanation
This example shows how OpenAI embeddings can automatically sort and prioritize emails by understanding their meaning, demonstrating an intelligent email management system.
- Setup & Helpers:
- Includes standard imports (
openai
,os
,dotenv
,numpy
), client initialization, and theget_embedding
andcosine_similarity
helper functions.
- Includes standard imports (
- Sample Email Data:
- A list of dictionaries (
emails
) simulates incoming messages. Each email has anid
,subject
, and abody_snippet
.
- A list of dictionaries (
- Category Definitions:
- A dictionary (
categories
) defines the target categories for triage (e.g., "Urgent Action Required", "Project Update / Status"). - Key Idea: Each category is represented by a descriptive phrase or list of keywords that captures its semantic essence. This description is what will be embedded.
- A dictionary (
- Category Embedding Generation:
- The script iterates through the defined
categories
. - It calls
get_embedding
on the description associated with each category name. - The resulting embedding vector for each category is stored in the
category_embeddings
dictionary. This step would typically be pre-computed and stored.
- The script iterates through the defined
- Email Processing Loop:
- The script iterates through each
email
in the sample data. - Content Combination: It combines the
subject
andbody_snippet
into a singleemail_content
string to provide richer context for the embedding. - Email Embedding: It calls
get_embedding
to get the vector representation of the current email's content. - Similarity Calculation:
- It then iterates through the pre-computed
category_embeddings
. - For each category, it calculates the
cosine_similarity
between theemail_embedding
and thecategory_embedding
. - It keeps track of the
best_category
(the one with the highest similarity score found so far) and the correspondingmax_similarity
score.
- It then iterates through the pre-computed
- Assignment: After comparing the email to all categories, the email is assigned the
best_category
found. The result (email ID, subject, assigned category, score) is stored.
- The script iterates through each
- Displaying Triage Results:
- The script prints the final assignments.
- Optional Grouping: It includes logic to group the results by the assigned category for a clearer presentation, showing which emails fell into the "Urgent," "Update," etc., buckets.
This example effectively demonstrates how embeddings allow for intelligent categorization based on meaning. An email asking for "approval ASAP" can be correctly identified as "Urgent Action Required" even without using the exact word "urgent," because its embedding will be semantically close to the embedding of the "Urgent Action Required" category description. This is far more robust than simple keyword filtering.