Chapter 6: Practical Machine Learning Projects
6.3 Project 3: Customer Segmentation Using K-Means Clustering
In this project, we'll dive deep into the world of unsupervised learning to segment customers based on their purchasing behavior. Customer segmentation is a crucial technique in marketing and business strategy, allowing companies to tailor their approaches to different customer groups effectively.
Why is customer segmentation important?
- Personalized Marketing: Tailor marketing strategies to specific customer groups.
- Product Development: Identify needs of different customer segments.
- Customer Retention: Focus efforts on high-value customer segments.
- Resource Allocation: Optimize resource distribution across customer groups.
In this project, we will:
- Load and explore a customer dataset
- Preprocess and prepare the data for clustering
- Apply K-Means clustering to segment customers
- Visualize and interpret the resulting clusters
- Evaluate the clustering performance
- Discuss potential improvements and future work
6.3.1 Load and Explore the Dataset
We'll begin our analysis by importing the customer dataset into our working environment. This initial step is crucial as it sets the foundation for our entire project. Once the data is loaded, we'll conduct a comprehensive exploration to gain insights into its structure, features, and overall characteristics.
This exploratory phase is essential for understanding the nature of our data, identifying potential patterns or anomalies, and informing our subsequent analytical decisions. By thoroughly examining the dataset's composition, we'll be better equipped to choose appropriate preprocessing techniques and analytical methods in the later stages of our project.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Load the customer dataset
url = 'https://example.com/customer_data.csv' # Replace with actual URL or file path
customer_df = pd.read_csv(url)
# Display the first few rows of the dataset
print(customer_df.head())
# Display basic information about the dataset
print(customer_df.info())
# Summary statistics
print(customer_df.describe())
# Check for missing values
print(customer_df.isnull().sum())
# Visualize the distribution of annual income and spending score
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(customer_df['Annual Income (k$)'], kde=True)
plt.title('Distribution of Annual Income')
plt.subplot(1, 2, 2)
sns.histplot(customer_df['Spending Score (1-100)'], kde=True)
plt.title('Distribution of Spending Score')
plt.tight_layout()
plt.show()
# Scatter plot of Annual Income vs Spending Score
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', data=customer_df)
plt.title('Customer Distribution: Annual Income vs Spending Score')
plt.show()
Here's a breakdown of its main components:
- Importing libraries: The script imports necessary Python libraries for data manipulation (pandas), visualization (matplotlib, seaborn), and machine learning (sklearn).
- Loading data: It loads a customer dataset from a CSV file using pandas.
- Data exploration: The code displays the first few rows of the dataset, basic information about the dataset, summary statistics, and checks for missing values.
- Data visualization: It creates two types of visualizations:
- Histograms: Shows the distribution of annual income and spending score.
- Scatter plot: Displays the relationship between annual income and spending score.
This code is part of the initial data exploration phase of a customer segmentation project using K-Means clustering. It helps in understanding the structure and characteristics of the dataset before proceeding with further analysis and clustering.
This code snippet loads the dataset, displays basic information, checks for missing values, and creates visualizations to help us understand the distribution of our key features.
6.3.2 Data Preprocessing
Before we can apply the K-Means clustering algorithm, it's crucial to properly prepare our dataset. This preparatory phase, known as data preprocessing, involves several important steps to ensure our data is in the optimal format for analysis. First, we need to address any missing values in our dataset, as these can significantly impact our results.
This might involve either removing rows with missing data or using various imputation techniques to fill in the gaps. Next, we'll carefully select the most relevant features for our clustering analysis, focusing on those that are most likely to reveal meaningful patterns in customer behavior.
Finally, we'll scale our data to ensure all features are on a comparable scale, which is particularly important for distance-based algorithms like K-Means. This scaling process helps prevent features with larger magnitudes from dominating the clustering results, allowing for a more balanced and accurate analysis of our customer segments.
# Select relevant features for clustering
features = ['Annual Income (k$)', 'Spending Score (1-100)']
# Check for missing values in selected features
print(customer_df[features].isnull().sum())
# If there are missing values, we can either drop them or impute them
# For this example, we'll drop any rows with missing values
customer_df_clean = customer_df.dropna(subset=features)
# Scale the features
scaler = StandardScaler()
customer_df_scaled = scaler.fit_transform(customer_df_clean[features])
# Convert scaled features back to a DataFrame for easier handling
customer_df_scaled = pd.DataFrame(customer_df_scaled, columns=features)
print("Scaled data:")
print(customer_df_scaled.head())
# Visualize the scaled data
plt.figure(figsize=(10, 6))
sns.scatterplot(x=features[0], y=features[1], data=customer_df_scaled)
plt.title('Scaled Customer Distribution: Annual Income vs Spending Score')
plt.show()
Here's a breakdown of what the code does:
- Feature selection: It selects two relevant features for clustering: 'Annual Income (k$)' and 'Spending Score (1-100)'.
- Handling missing values: It checks for missing values in the selected features and removes any rows with missing data.
- Data scaling: It uses StandardScaler to scale the features, which is crucial for K-means clustering as it ensures all features contribute equally to the distance calculations.
- Data conversion: The scaled data is converted back to a DataFrame for easier handling.
- Visualization: It creates a scatter plot of the scaled data to visualize the distribution of customers based on their annual income and spending score.
This preprocessing step is essential as it prepares the data for the K-means clustering algorithm, ensuring that the analysis will be balanced and accurate.
In this step, we've selected our relevant features, handled any missing values, and scaled our data using StandardScaler. Scaling is crucial for K-Means clustering as it ensures all features contribute equally to the distance calculations.
6.3.3 Apply K-Means Clustering
With our data now properly preprocessed, we're ready to apply the K-Means clustering algorithm to our customer dataset. This powerful unsupervised learning technique will help us identify distinct groups within our customer base. To ensure we're using the optimal number of clusters for our analysis, we'll employ the elbow method.
This approach involves running the K-Means algorithm with different numbers of clusters and plotting the resulting inertia (within-cluster sum of squares) against the number of clusters. The "elbow" in this plot - where the rate of decrease in inertia starts to level off - will indicate the ideal number of clusters for our dataset.
# Elbow Method to find the optimal number of clusters
inertias = []
k_range = range(1, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(customer_df_scaled)
inertias.append(kmeans.inertia_)
# Plot the elbow curve
plt.figure(figsize=(10, 6))
plt.plot(k_range, inertias, 'bx-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('The Elbow Method showing the optimal k')
plt.show()
# Based on the elbow curve, let's choose the optimal number of clusters
optimal_k = 5 # This should be determined from the elbow curve
# Apply K-Means with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
customer_df_clean['Cluster'] = kmeans.fit_predict(customer_df_scaled)
# Visualize the clusters
plt.figure(figsize=(12, 8))
scatter = plt.scatter(customer_df_clean['Annual Income (k$)'],
customer_df_clean['Spending Score (1-100)'],
c=customer_df_clean['Cluster'],
cmap='viridis')
plt.colorbar(scatter)
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Segments')
plt.show()
Here's a breakdown of its main components:
- Elbow Method: This technique is used to determine the optimal number of clusters. It involves:
- Running K-Means with different numbers of clusters (1 to 10)
- Calculating the inertia (within-cluster sum of squares) for each
- Plotting the inertia against the number of clusters
- The "elbow" in this plot indicates the ideal number of clusters
- Applying K-Means: Once the optimal number of clusters is determined (set to 5 in this example), the algorithm is applied to the scaled customer data.
- Visualization: The resulting clusters are visualized in a scatter plot, with:
- Annual Income on the x-axis
- Spending Score on the y-axis
- Different colors representing different clusters
This process helps identify distinct groups of customers based on their income and spending behavior, which can be used for targeted marketing strategies.
In this step, we've used the elbow method to determine the optimal number of clusters, applied K-Means clustering with this optimal number, and visualized the resulting clusters.
6.3.4 Interpret the Clusters
Now that we have successfully applied K-Means clustering to our customer dataset, it's time to delve deeper into the results and extract meaningful insights. Let's carefully examine and interpret the clusters we've identified to gain a comprehensive understanding of our customer segments.
This analysis will provide valuable information about the distinct groups within our customer base, allowing us to tailor our strategies and approaches more effectively.
# Calculate cluster centroids
centroids = customer_df_clean.groupby('Cluster')[features].mean()
print("Cluster Centroids:")
print(centroids)
# Analyze cluster sizes
cluster_sizes = customer_df_clean['Cluster'].value_counts().sort_index()
print("\nCluster Sizes:")
print(cluster_sizes)
# Visualize cluster characteristics
plt.figure(figsize=(12, 6))
sns.boxplot(x='Cluster', y='Annual Income (k$)', data=customer_df_clean)
plt.title('Annual Income Distribution by Cluster')
plt.show()
plt.figure(figsize=(12, 6))
sns.boxplot(x='Cluster', y='Spending Score (1-100)', data=customer_df_clean)
plt.title('Spending Score Distribution by Cluster')
plt.show()
This code snippet is part of the customer segmentation project using K-Means clustering. It focuses on interpreting the clusters that have been created. Here's a breakdown of what the code does:
- Calculate cluster centroids: It computes the mean values of the features for each cluster, giving us a central point that represents each cluster.
- Analyze cluster sizes: It counts how many customers are in each cluster, which helps understand the distribution of customers across segments.
- Visualize cluster characteristics: It creates two box plots:
- - One showing the distribution of Annual Income for each cluster
- - Another showing the distribution of Spending Score for each cluster
This analysis is crucial for gaining insights into the distinct groups within the customer base, which can then be used to tailor marketing strategies and improve customer engagement.
Based on these visualizations and statistics, we can interpret our clusters:
- Cluster 0: High income, high spending score - "Premium Customers"
- Cluster 1: Low income, high spending score - "Careful Spenders"
- Cluster 2: Medium income, medium spending score - "Average Customers"
- Cluster 3: High income, low spending score - "Potential Savers"
- Cluster 4: Low income, low spending score - "Budget Conscious"
6.3.5 Evaluate Clustering Performance
To evaluate the effectiveness of our clustering approach, we'll employ the silhouette score, a powerful metric that quantifies how well each data point fits within its assigned cluster. This score provides valuable insights by measuring the similarity of an object to its own cluster in comparison to other clusters.
By analyzing these scores, we can gain a comprehensive understanding of our clustering quality and identify potential areas for improvement.
from sklearn.metrics import silhouette_score, silhouette_samples
# Ensure there are at least 2 clusters
if len(set(customer_df_clean['Cluster'])) > 1:
# Calculate silhouette score
silhouette_avg = silhouette_score(customer_df_scaled, customer_df_clean['Cluster'])
print(f"The average silhouette score is: {silhouette_avg:.4f}")
# Compute silhouette scores for each sample
silhouette_values = silhouette_samples(customer_df_scaled, customer_df_clean['Cluster'])
# Visualize silhouette scores
plt.figure(figsize=(10, 6))
plt.hist(silhouette_values, bins=20, alpha=0.7, edgecolor="black")
plt.axvline(silhouette_avg, color="red", linestyle="--", label=f"Average Silhouette Score: {silhouette_avg:.4f}")
plt.xlabel("Silhouette Score")
plt.ylabel("Frequency")
plt.title("Distribution of Silhouette Scores")
plt.legend()
plt.show()
else:
print("Silhouette score cannot be computed with only one cluster.")
Here's a breakdown of what the code does:
- Calculate the silhouette score: This is done using the
silhouette_score
function, which measures how similar an object is to its own cluster compared to other clusters. The average silhouette score for all data points is calculated and printed. - Calculate individual silhouette scores: The
silhouette_samples
function is used to compute the silhouette score for each data point. - Visualize the distribution of silhouette scores: A histogram is created to show the distribution of silhouette scores across all data points. This helps in understanding the overall quality of the clustering.
- Add a vertical line for the average score: A red dashed line is added to the histogram to indicate the average silhouette score, making it easy to compare individual scores to the overall average.
The silhouette score ranges from -1 to 1, with higher values indicating better clustering. A score above 0.5 is generally considered good. This visualization helps in assessing the quality of the clustering and identifying potential areas for improvement.
6.3.6 Potential Improvements and Future Work
While our current model provides valuable insights, there are several ways we could potentially improve its performance:
- Feature Engineering: Create new features or transform existing ones to capture more complex relationships. For example, we could create a feature that represents the ratio of spending score to annual income.
- Try Other Algorithms: Experiment with more advanced clustering algorithms like DBSCAN or Gaussian Mixture Models, which can handle clusters of different shapes and densities.
- Dimensionality Reduction: If we have more features, we could use techniques like PCA to reduce dimensionality before clustering.
- Incorporate More Data: If possible, include more customer attributes like age, gender, or purchase history to create more nuanced segments.
- Time Series Analysis: If we have data over time, we could analyze how customers move between segments.
6.3.7 Conclusion
In this project, we have successfully implemented K-Means clustering to segment customers based on their annual income and spending score. Our journey encompassed the entire data science process, from initial data loading and meticulous preprocessing to sophisticated model evaluation and in-depth interpretation. We navigated through each step with precision, ensuring the integrity and reliability of our analysis.
The customer segments we've uncovered through this process are not merely statistical groupings, but rather provide profound insights into our customer base. These segments offer a nuanced understanding of different customer behaviors and preferences, which can be leveraged to significantly enhance our business strategies. By tailoring our marketing approaches to these distinct groups, we can create more personalized and effective campaigns that resonate with each segment's unique characteristics.
Furthermore, these insights extend beyond marketing, potentially influencing product development, customer service approaches, and overall business decision-making. The ability to engage with customers in a more targeted manner, based on their segmentation, can lead to improved customer satisfaction, increased loyalty, and ultimately, better business outcomes. As we move forward, these customer segments will serve as a valuable foundation for data-driven decision making across various aspects of our operations.
6.3 Project 3: Customer Segmentation Using K-Means Clustering
In this project, we'll dive deep into the world of unsupervised learning to segment customers based on their purchasing behavior. Customer segmentation is a crucial technique in marketing and business strategy, allowing companies to tailor their approaches to different customer groups effectively.
Why is customer segmentation important?
- Personalized Marketing: Tailor marketing strategies to specific customer groups.
- Product Development: Identify needs of different customer segments.
- Customer Retention: Focus efforts on high-value customer segments.
- Resource Allocation: Optimize resource distribution across customer groups.
In this project, we will:
- Load and explore a customer dataset
- Preprocess and prepare the data for clustering
- Apply K-Means clustering to segment customers
- Visualize and interpret the resulting clusters
- Evaluate the clustering performance
- Discuss potential improvements and future work
6.3.1 Load and Explore the Dataset
We'll begin our analysis by importing the customer dataset into our working environment. This initial step is crucial as it sets the foundation for our entire project. Once the data is loaded, we'll conduct a comprehensive exploration to gain insights into its structure, features, and overall characteristics.
This exploratory phase is essential for understanding the nature of our data, identifying potential patterns or anomalies, and informing our subsequent analytical decisions. By thoroughly examining the dataset's composition, we'll be better equipped to choose appropriate preprocessing techniques and analytical methods in the later stages of our project.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Load the customer dataset
url = 'https://example.com/customer_data.csv' # Replace with actual URL or file path
customer_df = pd.read_csv(url)
# Display the first few rows of the dataset
print(customer_df.head())
# Display basic information about the dataset
print(customer_df.info())
# Summary statistics
print(customer_df.describe())
# Check for missing values
print(customer_df.isnull().sum())
# Visualize the distribution of annual income and spending score
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(customer_df['Annual Income (k$)'], kde=True)
plt.title('Distribution of Annual Income')
plt.subplot(1, 2, 2)
sns.histplot(customer_df['Spending Score (1-100)'], kde=True)
plt.title('Distribution of Spending Score')
plt.tight_layout()
plt.show()
# Scatter plot of Annual Income vs Spending Score
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', data=customer_df)
plt.title('Customer Distribution: Annual Income vs Spending Score')
plt.show()
Here's a breakdown of its main components:
- Importing libraries: The script imports necessary Python libraries for data manipulation (pandas), visualization (matplotlib, seaborn), and machine learning (sklearn).
- Loading data: It loads a customer dataset from a CSV file using pandas.
- Data exploration: The code displays the first few rows of the dataset, basic information about the dataset, summary statistics, and checks for missing values.
- Data visualization: It creates two types of visualizations:
- Histograms: Shows the distribution of annual income and spending score.
- Scatter plot: Displays the relationship between annual income and spending score.
This code is part of the initial data exploration phase of a customer segmentation project using K-Means clustering. It helps in understanding the structure and characteristics of the dataset before proceeding with further analysis and clustering.
This code snippet loads the dataset, displays basic information, checks for missing values, and creates visualizations to help us understand the distribution of our key features.
6.3.2 Data Preprocessing
Before we can apply the K-Means clustering algorithm, it's crucial to properly prepare our dataset. This preparatory phase, known as data preprocessing, involves several important steps to ensure our data is in the optimal format for analysis. First, we need to address any missing values in our dataset, as these can significantly impact our results.
This might involve either removing rows with missing data or using various imputation techniques to fill in the gaps. Next, we'll carefully select the most relevant features for our clustering analysis, focusing on those that are most likely to reveal meaningful patterns in customer behavior.
Finally, we'll scale our data to ensure all features are on a comparable scale, which is particularly important for distance-based algorithms like K-Means. This scaling process helps prevent features with larger magnitudes from dominating the clustering results, allowing for a more balanced and accurate analysis of our customer segments.
# Select relevant features for clustering
features = ['Annual Income (k$)', 'Spending Score (1-100)']
# Check for missing values in selected features
print(customer_df[features].isnull().sum())
# If there are missing values, we can either drop them or impute them
# For this example, we'll drop any rows with missing values
customer_df_clean = customer_df.dropna(subset=features)
# Scale the features
scaler = StandardScaler()
customer_df_scaled = scaler.fit_transform(customer_df_clean[features])
# Convert scaled features back to a DataFrame for easier handling
customer_df_scaled = pd.DataFrame(customer_df_scaled, columns=features)
print("Scaled data:")
print(customer_df_scaled.head())
# Visualize the scaled data
plt.figure(figsize=(10, 6))
sns.scatterplot(x=features[0], y=features[1], data=customer_df_scaled)
plt.title('Scaled Customer Distribution: Annual Income vs Spending Score')
plt.show()
Here's a breakdown of what the code does:
- Feature selection: It selects two relevant features for clustering: 'Annual Income (k$)' and 'Spending Score (1-100)'.
- Handling missing values: It checks for missing values in the selected features and removes any rows with missing data.
- Data scaling: It uses StandardScaler to scale the features, which is crucial for K-means clustering as it ensures all features contribute equally to the distance calculations.
- Data conversion: The scaled data is converted back to a DataFrame for easier handling.
- Visualization: It creates a scatter plot of the scaled data to visualize the distribution of customers based on their annual income and spending score.
This preprocessing step is essential as it prepares the data for the K-means clustering algorithm, ensuring that the analysis will be balanced and accurate.
In this step, we've selected our relevant features, handled any missing values, and scaled our data using StandardScaler. Scaling is crucial for K-Means clustering as it ensures all features contribute equally to the distance calculations.
6.3.3 Apply K-Means Clustering
With our data now properly preprocessed, we're ready to apply the K-Means clustering algorithm to our customer dataset. This powerful unsupervised learning technique will help us identify distinct groups within our customer base. To ensure we're using the optimal number of clusters for our analysis, we'll employ the elbow method.
This approach involves running the K-Means algorithm with different numbers of clusters and plotting the resulting inertia (within-cluster sum of squares) against the number of clusters. The "elbow" in this plot - where the rate of decrease in inertia starts to level off - will indicate the ideal number of clusters for our dataset.
# Elbow Method to find the optimal number of clusters
inertias = []
k_range = range(1, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(customer_df_scaled)
inertias.append(kmeans.inertia_)
# Plot the elbow curve
plt.figure(figsize=(10, 6))
plt.plot(k_range, inertias, 'bx-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('The Elbow Method showing the optimal k')
plt.show()
# Based on the elbow curve, let's choose the optimal number of clusters
optimal_k = 5 # This should be determined from the elbow curve
# Apply K-Means with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
customer_df_clean['Cluster'] = kmeans.fit_predict(customer_df_scaled)
# Visualize the clusters
plt.figure(figsize=(12, 8))
scatter = plt.scatter(customer_df_clean['Annual Income (k$)'],
customer_df_clean['Spending Score (1-100)'],
c=customer_df_clean['Cluster'],
cmap='viridis')
plt.colorbar(scatter)
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Segments')
plt.show()
Here's a breakdown of its main components:
- Elbow Method: This technique is used to determine the optimal number of clusters. It involves:
- Running K-Means with different numbers of clusters (1 to 10)
- Calculating the inertia (within-cluster sum of squares) for each
- Plotting the inertia against the number of clusters
- The "elbow" in this plot indicates the ideal number of clusters
- Applying K-Means: Once the optimal number of clusters is determined (set to 5 in this example), the algorithm is applied to the scaled customer data.
- Visualization: The resulting clusters are visualized in a scatter plot, with:
- Annual Income on the x-axis
- Spending Score on the y-axis
- Different colors representing different clusters
This process helps identify distinct groups of customers based on their income and spending behavior, which can be used for targeted marketing strategies.
In this step, we've used the elbow method to determine the optimal number of clusters, applied K-Means clustering with this optimal number, and visualized the resulting clusters.
6.3.4 Interpret the Clusters
Now that we have successfully applied K-Means clustering to our customer dataset, it's time to delve deeper into the results and extract meaningful insights. Let's carefully examine and interpret the clusters we've identified to gain a comprehensive understanding of our customer segments.
This analysis will provide valuable information about the distinct groups within our customer base, allowing us to tailor our strategies and approaches more effectively.
# Calculate cluster centroids
centroids = customer_df_clean.groupby('Cluster')[features].mean()
print("Cluster Centroids:")
print(centroids)
# Analyze cluster sizes
cluster_sizes = customer_df_clean['Cluster'].value_counts().sort_index()
print("\nCluster Sizes:")
print(cluster_sizes)
# Visualize cluster characteristics
plt.figure(figsize=(12, 6))
sns.boxplot(x='Cluster', y='Annual Income (k$)', data=customer_df_clean)
plt.title('Annual Income Distribution by Cluster')
plt.show()
plt.figure(figsize=(12, 6))
sns.boxplot(x='Cluster', y='Spending Score (1-100)', data=customer_df_clean)
plt.title('Spending Score Distribution by Cluster')
plt.show()
This code snippet is part of the customer segmentation project using K-Means clustering. It focuses on interpreting the clusters that have been created. Here's a breakdown of what the code does:
- Calculate cluster centroids: It computes the mean values of the features for each cluster, giving us a central point that represents each cluster.
- Analyze cluster sizes: It counts how many customers are in each cluster, which helps understand the distribution of customers across segments.
- Visualize cluster characteristics: It creates two box plots:
- - One showing the distribution of Annual Income for each cluster
- - Another showing the distribution of Spending Score for each cluster
This analysis is crucial for gaining insights into the distinct groups within the customer base, which can then be used to tailor marketing strategies and improve customer engagement.
Based on these visualizations and statistics, we can interpret our clusters:
- Cluster 0: High income, high spending score - "Premium Customers"
- Cluster 1: Low income, high spending score - "Careful Spenders"
- Cluster 2: Medium income, medium spending score - "Average Customers"
- Cluster 3: High income, low spending score - "Potential Savers"
- Cluster 4: Low income, low spending score - "Budget Conscious"
6.3.5 Evaluate Clustering Performance
To evaluate the effectiveness of our clustering approach, we'll employ the silhouette score, a powerful metric that quantifies how well each data point fits within its assigned cluster. This score provides valuable insights by measuring the similarity of an object to its own cluster in comparison to other clusters.
By analyzing these scores, we can gain a comprehensive understanding of our clustering quality and identify potential areas for improvement.
from sklearn.metrics import silhouette_score, silhouette_samples
# Ensure there are at least 2 clusters
if len(set(customer_df_clean['Cluster'])) > 1:
# Calculate silhouette score
silhouette_avg = silhouette_score(customer_df_scaled, customer_df_clean['Cluster'])
print(f"The average silhouette score is: {silhouette_avg:.4f}")
# Compute silhouette scores for each sample
silhouette_values = silhouette_samples(customer_df_scaled, customer_df_clean['Cluster'])
# Visualize silhouette scores
plt.figure(figsize=(10, 6))
plt.hist(silhouette_values, bins=20, alpha=0.7, edgecolor="black")
plt.axvline(silhouette_avg, color="red", linestyle="--", label=f"Average Silhouette Score: {silhouette_avg:.4f}")
plt.xlabel("Silhouette Score")
plt.ylabel("Frequency")
plt.title("Distribution of Silhouette Scores")
plt.legend()
plt.show()
else:
print("Silhouette score cannot be computed with only one cluster.")
Here's a breakdown of what the code does:
- Calculate the silhouette score: This is done using the
silhouette_score
function, which measures how similar an object is to its own cluster compared to other clusters. The average silhouette score for all data points is calculated and printed. - Calculate individual silhouette scores: The
silhouette_samples
function is used to compute the silhouette score for each data point. - Visualize the distribution of silhouette scores: A histogram is created to show the distribution of silhouette scores across all data points. This helps in understanding the overall quality of the clustering.
- Add a vertical line for the average score: A red dashed line is added to the histogram to indicate the average silhouette score, making it easy to compare individual scores to the overall average.
The silhouette score ranges from -1 to 1, with higher values indicating better clustering. A score above 0.5 is generally considered good. This visualization helps in assessing the quality of the clustering and identifying potential areas for improvement.
6.3.6 Potential Improvements and Future Work
While our current model provides valuable insights, there are several ways we could potentially improve its performance:
- Feature Engineering: Create new features or transform existing ones to capture more complex relationships. For example, we could create a feature that represents the ratio of spending score to annual income.
- Try Other Algorithms: Experiment with more advanced clustering algorithms like DBSCAN or Gaussian Mixture Models, which can handle clusters of different shapes and densities.
- Dimensionality Reduction: If we have more features, we could use techniques like PCA to reduce dimensionality before clustering.
- Incorporate More Data: If possible, include more customer attributes like age, gender, or purchase history to create more nuanced segments.
- Time Series Analysis: If we have data over time, we could analyze how customers move between segments.
6.3.7 Conclusion
In this project, we have successfully implemented K-Means clustering to segment customers based on their annual income and spending score. Our journey encompassed the entire data science process, from initial data loading and meticulous preprocessing to sophisticated model evaluation and in-depth interpretation. We navigated through each step with precision, ensuring the integrity and reliability of our analysis.
The customer segments we've uncovered through this process are not merely statistical groupings, but rather provide profound insights into our customer base. These segments offer a nuanced understanding of different customer behaviors and preferences, which can be leveraged to significantly enhance our business strategies. By tailoring our marketing approaches to these distinct groups, we can create more personalized and effective campaigns that resonate with each segment's unique characteristics.
Furthermore, these insights extend beyond marketing, potentially influencing product development, customer service approaches, and overall business decision-making. The ability to engage with customers in a more targeted manner, based on their segmentation, can lead to improved customer satisfaction, increased loyalty, and ultimately, better business outcomes. As we move forward, these customer segments will serve as a valuable foundation for data-driven decision making across various aspects of our operations.
6.3 Project 3: Customer Segmentation Using K-Means Clustering
In this project, we'll dive deep into the world of unsupervised learning to segment customers based on their purchasing behavior. Customer segmentation is a crucial technique in marketing and business strategy, allowing companies to tailor their approaches to different customer groups effectively.
Why is customer segmentation important?
- Personalized Marketing: Tailor marketing strategies to specific customer groups.
- Product Development: Identify needs of different customer segments.
- Customer Retention: Focus efforts on high-value customer segments.
- Resource Allocation: Optimize resource distribution across customer groups.
In this project, we will:
- Load and explore a customer dataset
- Preprocess and prepare the data for clustering
- Apply K-Means clustering to segment customers
- Visualize and interpret the resulting clusters
- Evaluate the clustering performance
- Discuss potential improvements and future work
6.3.1 Load and Explore the Dataset
We'll begin our analysis by importing the customer dataset into our working environment. This initial step is crucial as it sets the foundation for our entire project. Once the data is loaded, we'll conduct a comprehensive exploration to gain insights into its structure, features, and overall characteristics.
This exploratory phase is essential for understanding the nature of our data, identifying potential patterns or anomalies, and informing our subsequent analytical decisions. By thoroughly examining the dataset's composition, we'll be better equipped to choose appropriate preprocessing techniques and analytical methods in the later stages of our project.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Load the customer dataset
url = 'https://example.com/customer_data.csv' # Replace with actual URL or file path
customer_df = pd.read_csv(url)
# Display the first few rows of the dataset
print(customer_df.head())
# Display basic information about the dataset
print(customer_df.info())
# Summary statistics
print(customer_df.describe())
# Check for missing values
print(customer_df.isnull().sum())
# Visualize the distribution of annual income and spending score
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(customer_df['Annual Income (k$)'], kde=True)
plt.title('Distribution of Annual Income')
plt.subplot(1, 2, 2)
sns.histplot(customer_df['Spending Score (1-100)'], kde=True)
plt.title('Distribution of Spending Score')
plt.tight_layout()
plt.show()
# Scatter plot of Annual Income vs Spending Score
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', data=customer_df)
plt.title('Customer Distribution: Annual Income vs Spending Score')
plt.show()
Here's a breakdown of its main components:
- Importing libraries: The script imports necessary Python libraries for data manipulation (pandas), visualization (matplotlib, seaborn), and machine learning (sklearn).
- Loading data: It loads a customer dataset from a CSV file using pandas.
- Data exploration: The code displays the first few rows of the dataset, basic information about the dataset, summary statistics, and checks for missing values.
- Data visualization: It creates two types of visualizations:
- Histograms: Shows the distribution of annual income and spending score.
- Scatter plot: Displays the relationship between annual income and spending score.
This code is part of the initial data exploration phase of a customer segmentation project using K-Means clustering. It helps in understanding the structure and characteristics of the dataset before proceeding with further analysis and clustering.
This code snippet loads the dataset, displays basic information, checks for missing values, and creates visualizations to help us understand the distribution of our key features.
6.3.2 Data Preprocessing
Before we can apply the K-Means clustering algorithm, it's crucial to properly prepare our dataset. This preparatory phase, known as data preprocessing, involves several important steps to ensure our data is in the optimal format for analysis. First, we need to address any missing values in our dataset, as these can significantly impact our results.
This might involve either removing rows with missing data or using various imputation techniques to fill in the gaps. Next, we'll carefully select the most relevant features for our clustering analysis, focusing on those that are most likely to reveal meaningful patterns in customer behavior.
Finally, we'll scale our data to ensure all features are on a comparable scale, which is particularly important for distance-based algorithms like K-Means. This scaling process helps prevent features with larger magnitudes from dominating the clustering results, allowing for a more balanced and accurate analysis of our customer segments.
# Select relevant features for clustering
features = ['Annual Income (k$)', 'Spending Score (1-100)']
# Check for missing values in selected features
print(customer_df[features].isnull().sum())
# If there are missing values, we can either drop them or impute them
# For this example, we'll drop any rows with missing values
customer_df_clean = customer_df.dropna(subset=features)
# Scale the features
scaler = StandardScaler()
customer_df_scaled = scaler.fit_transform(customer_df_clean[features])
# Convert scaled features back to a DataFrame for easier handling
customer_df_scaled = pd.DataFrame(customer_df_scaled, columns=features)
print("Scaled data:")
print(customer_df_scaled.head())
# Visualize the scaled data
plt.figure(figsize=(10, 6))
sns.scatterplot(x=features[0], y=features[1], data=customer_df_scaled)
plt.title('Scaled Customer Distribution: Annual Income vs Spending Score')
plt.show()
Here's a breakdown of what the code does:
- Feature selection: It selects two relevant features for clustering: 'Annual Income (k$)' and 'Spending Score (1-100)'.
- Handling missing values: It checks for missing values in the selected features and removes any rows with missing data.
- Data scaling: It uses StandardScaler to scale the features, which is crucial for K-means clustering as it ensures all features contribute equally to the distance calculations.
- Data conversion: The scaled data is converted back to a DataFrame for easier handling.
- Visualization: It creates a scatter plot of the scaled data to visualize the distribution of customers based on their annual income and spending score.
This preprocessing step is essential as it prepares the data for the K-means clustering algorithm, ensuring that the analysis will be balanced and accurate.
In this step, we've selected our relevant features, handled any missing values, and scaled our data using StandardScaler. Scaling is crucial for K-Means clustering as it ensures all features contribute equally to the distance calculations.
6.3.3 Apply K-Means Clustering
With our data now properly preprocessed, we're ready to apply the K-Means clustering algorithm to our customer dataset. This powerful unsupervised learning technique will help us identify distinct groups within our customer base. To ensure we're using the optimal number of clusters for our analysis, we'll employ the elbow method.
This approach involves running the K-Means algorithm with different numbers of clusters and plotting the resulting inertia (within-cluster sum of squares) against the number of clusters. The "elbow" in this plot - where the rate of decrease in inertia starts to level off - will indicate the ideal number of clusters for our dataset.
# Elbow Method to find the optimal number of clusters
inertias = []
k_range = range(1, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(customer_df_scaled)
inertias.append(kmeans.inertia_)
# Plot the elbow curve
plt.figure(figsize=(10, 6))
plt.plot(k_range, inertias, 'bx-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('The Elbow Method showing the optimal k')
plt.show()
# Based on the elbow curve, let's choose the optimal number of clusters
optimal_k = 5 # This should be determined from the elbow curve
# Apply K-Means with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
customer_df_clean['Cluster'] = kmeans.fit_predict(customer_df_scaled)
# Visualize the clusters
plt.figure(figsize=(12, 8))
scatter = plt.scatter(customer_df_clean['Annual Income (k$)'],
customer_df_clean['Spending Score (1-100)'],
c=customer_df_clean['Cluster'],
cmap='viridis')
plt.colorbar(scatter)
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Segments')
plt.show()
Here's a breakdown of its main components:
- Elbow Method: This technique is used to determine the optimal number of clusters. It involves:
- Running K-Means with different numbers of clusters (1 to 10)
- Calculating the inertia (within-cluster sum of squares) for each
- Plotting the inertia against the number of clusters
- The "elbow" in this plot indicates the ideal number of clusters
- Applying K-Means: Once the optimal number of clusters is determined (set to 5 in this example), the algorithm is applied to the scaled customer data.
- Visualization: The resulting clusters are visualized in a scatter plot, with:
- Annual Income on the x-axis
- Spending Score on the y-axis
- Different colors representing different clusters
This process helps identify distinct groups of customers based on their income and spending behavior, which can be used for targeted marketing strategies.
In this step, we've used the elbow method to determine the optimal number of clusters, applied K-Means clustering with this optimal number, and visualized the resulting clusters.
6.3.4 Interpret the Clusters
Now that we have successfully applied K-Means clustering to our customer dataset, it's time to delve deeper into the results and extract meaningful insights. Let's carefully examine and interpret the clusters we've identified to gain a comprehensive understanding of our customer segments.
This analysis will provide valuable information about the distinct groups within our customer base, allowing us to tailor our strategies and approaches more effectively.
# Calculate cluster centroids
centroids = customer_df_clean.groupby('Cluster')[features].mean()
print("Cluster Centroids:")
print(centroids)
# Analyze cluster sizes
cluster_sizes = customer_df_clean['Cluster'].value_counts().sort_index()
print("\nCluster Sizes:")
print(cluster_sizes)
# Visualize cluster characteristics
plt.figure(figsize=(12, 6))
sns.boxplot(x='Cluster', y='Annual Income (k$)', data=customer_df_clean)
plt.title('Annual Income Distribution by Cluster')
plt.show()
plt.figure(figsize=(12, 6))
sns.boxplot(x='Cluster', y='Spending Score (1-100)', data=customer_df_clean)
plt.title('Spending Score Distribution by Cluster')
plt.show()
This code snippet is part of the customer segmentation project using K-Means clustering. It focuses on interpreting the clusters that have been created. Here's a breakdown of what the code does:
- Calculate cluster centroids: It computes the mean values of the features for each cluster, giving us a central point that represents each cluster.
- Analyze cluster sizes: It counts how many customers are in each cluster, which helps understand the distribution of customers across segments.
- Visualize cluster characteristics: It creates two box plots:
- - One showing the distribution of Annual Income for each cluster
- - Another showing the distribution of Spending Score for each cluster
This analysis is crucial for gaining insights into the distinct groups within the customer base, which can then be used to tailor marketing strategies and improve customer engagement.
Based on these visualizations and statistics, we can interpret our clusters:
- Cluster 0: High income, high spending score - "Premium Customers"
- Cluster 1: Low income, high spending score - "Careful Spenders"
- Cluster 2: Medium income, medium spending score - "Average Customers"
- Cluster 3: High income, low spending score - "Potential Savers"
- Cluster 4: Low income, low spending score - "Budget Conscious"
6.3.5 Evaluate Clustering Performance
To evaluate the effectiveness of our clustering approach, we'll employ the silhouette score, a powerful metric that quantifies how well each data point fits within its assigned cluster. This score provides valuable insights by measuring the similarity of an object to its own cluster in comparison to other clusters.
By analyzing these scores, we can gain a comprehensive understanding of our clustering quality and identify potential areas for improvement.
from sklearn.metrics import silhouette_score, silhouette_samples
# Ensure there are at least 2 clusters
if len(set(customer_df_clean['Cluster'])) > 1:
# Calculate silhouette score
silhouette_avg = silhouette_score(customer_df_scaled, customer_df_clean['Cluster'])
print(f"The average silhouette score is: {silhouette_avg:.4f}")
# Compute silhouette scores for each sample
silhouette_values = silhouette_samples(customer_df_scaled, customer_df_clean['Cluster'])
# Visualize silhouette scores
plt.figure(figsize=(10, 6))
plt.hist(silhouette_values, bins=20, alpha=0.7, edgecolor="black")
plt.axvline(silhouette_avg, color="red", linestyle="--", label=f"Average Silhouette Score: {silhouette_avg:.4f}")
plt.xlabel("Silhouette Score")
plt.ylabel("Frequency")
plt.title("Distribution of Silhouette Scores")
plt.legend()
plt.show()
else:
print("Silhouette score cannot be computed with only one cluster.")
Here's a breakdown of what the code does:
- Calculate the silhouette score: This is done using the
silhouette_score
function, which measures how similar an object is to its own cluster compared to other clusters. The average silhouette score for all data points is calculated and printed. - Calculate individual silhouette scores: The
silhouette_samples
function is used to compute the silhouette score for each data point. - Visualize the distribution of silhouette scores: A histogram is created to show the distribution of silhouette scores across all data points. This helps in understanding the overall quality of the clustering.
- Add a vertical line for the average score: A red dashed line is added to the histogram to indicate the average silhouette score, making it easy to compare individual scores to the overall average.
The silhouette score ranges from -1 to 1, with higher values indicating better clustering. A score above 0.5 is generally considered good. This visualization helps in assessing the quality of the clustering and identifying potential areas for improvement.
6.3.6 Potential Improvements and Future Work
While our current model provides valuable insights, there are several ways we could potentially improve its performance:
- Feature Engineering: Create new features or transform existing ones to capture more complex relationships. For example, we could create a feature that represents the ratio of spending score to annual income.
- Try Other Algorithms: Experiment with more advanced clustering algorithms like DBSCAN or Gaussian Mixture Models, which can handle clusters of different shapes and densities.
- Dimensionality Reduction: If we have more features, we could use techniques like PCA to reduce dimensionality before clustering.
- Incorporate More Data: If possible, include more customer attributes like age, gender, or purchase history to create more nuanced segments.
- Time Series Analysis: If we have data over time, we could analyze how customers move between segments.
6.3.7 Conclusion
In this project, we have successfully implemented K-Means clustering to segment customers based on their annual income and spending score. Our journey encompassed the entire data science process, from initial data loading and meticulous preprocessing to sophisticated model evaluation and in-depth interpretation. We navigated through each step with precision, ensuring the integrity and reliability of our analysis.
The customer segments we've uncovered through this process are not merely statistical groupings, but rather provide profound insights into our customer base. These segments offer a nuanced understanding of different customer behaviors and preferences, which can be leveraged to significantly enhance our business strategies. By tailoring our marketing approaches to these distinct groups, we can create more personalized and effective campaigns that resonate with each segment's unique characteristics.
Furthermore, these insights extend beyond marketing, potentially influencing product development, customer service approaches, and overall business decision-making. The ability to engage with customers in a more targeted manner, based on their segmentation, can lead to improved customer satisfaction, increased loyalty, and ultimately, better business outcomes. As we move forward, these customer segments will serve as a valuable foundation for data-driven decision making across various aspects of our operations.
6.3 Project 3: Customer Segmentation Using K-Means Clustering
In this project, we'll dive deep into the world of unsupervised learning to segment customers based on their purchasing behavior. Customer segmentation is a crucial technique in marketing and business strategy, allowing companies to tailor their approaches to different customer groups effectively.
Why is customer segmentation important?
- Personalized Marketing: Tailor marketing strategies to specific customer groups.
- Product Development: Identify needs of different customer segments.
- Customer Retention: Focus efforts on high-value customer segments.
- Resource Allocation: Optimize resource distribution across customer groups.
In this project, we will:
- Load and explore a customer dataset
- Preprocess and prepare the data for clustering
- Apply K-Means clustering to segment customers
- Visualize and interpret the resulting clusters
- Evaluate the clustering performance
- Discuss potential improvements and future work
6.3.1 Load and Explore the Dataset
We'll begin our analysis by importing the customer dataset into our working environment. This initial step is crucial as it sets the foundation for our entire project. Once the data is loaded, we'll conduct a comprehensive exploration to gain insights into its structure, features, and overall characteristics.
This exploratory phase is essential for understanding the nature of our data, identifying potential patterns or anomalies, and informing our subsequent analytical decisions. By thoroughly examining the dataset's composition, we'll be better equipped to choose appropriate preprocessing techniques and analytical methods in the later stages of our project.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Load the customer dataset
url = 'https://example.com/customer_data.csv' # Replace with actual URL or file path
customer_df = pd.read_csv(url)
# Display the first few rows of the dataset
print(customer_df.head())
# Display basic information about the dataset
print(customer_df.info())
# Summary statistics
print(customer_df.describe())
# Check for missing values
print(customer_df.isnull().sum())
# Visualize the distribution of annual income and spending score
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(customer_df['Annual Income (k$)'], kde=True)
plt.title('Distribution of Annual Income')
plt.subplot(1, 2, 2)
sns.histplot(customer_df['Spending Score (1-100)'], kde=True)
plt.title('Distribution of Spending Score')
plt.tight_layout()
plt.show()
# Scatter plot of Annual Income vs Spending Score
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', data=customer_df)
plt.title('Customer Distribution: Annual Income vs Spending Score')
plt.show()
Here's a breakdown of its main components:
- Importing libraries: The script imports necessary Python libraries for data manipulation (pandas), visualization (matplotlib, seaborn), and machine learning (sklearn).
- Loading data: It loads a customer dataset from a CSV file using pandas.
- Data exploration: The code displays the first few rows of the dataset, basic information about the dataset, summary statistics, and checks for missing values.
- Data visualization: It creates two types of visualizations:
- Histograms: Shows the distribution of annual income and spending score.
- Scatter plot: Displays the relationship between annual income and spending score.
This code is part of the initial data exploration phase of a customer segmentation project using K-Means clustering. It helps in understanding the structure and characteristics of the dataset before proceeding with further analysis and clustering.
This code snippet loads the dataset, displays basic information, checks for missing values, and creates visualizations to help us understand the distribution of our key features.
6.3.2 Data Preprocessing
Before we can apply the K-Means clustering algorithm, it's crucial to properly prepare our dataset. This preparatory phase, known as data preprocessing, involves several important steps to ensure our data is in the optimal format for analysis. First, we need to address any missing values in our dataset, as these can significantly impact our results.
This might involve either removing rows with missing data or using various imputation techniques to fill in the gaps. Next, we'll carefully select the most relevant features for our clustering analysis, focusing on those that are most likely to reveal meaningful patterns in customer behavior.
Finally, we'll scale our data to ensure all features are on a comparable scale, which is particularly important for distance-based algorithms like K-Means. This scaling process helps prevent features with larger magnitudes from dominating the clustering results, allowing for a more balanced and accurate analysis of our customer segments.
# Select relevant features for clustering
features = ['Annual Income (k$)', 'Spending Score (1-100)']
# Check for missing values in selected features
print(customer_df[features].isnull().sum())
# If there are missing values, we can either drop them or impute them
# For this example, we'll drop any rows with missing values
customer_df_clean = customer_df.dropna(subset=features)
# Scale the features
scaler = StandardScaler()
customer_df_scaled = scaler.fit_transform(customer_df_clean[features])
# Convert scaled features back to a DataFrame for easier handling
customer_df_scaled = pd.DataFrame(customer_df_scaled, columns=features)
print("Scaled data:")
print(customer_df_scaled.head())
# Visualize the scaled data
plt.figure(figsize=(10, 6))
sns.scatterplot(x=features[0], y=features[1], data=customer_df_scaled)
plt.title('Scaled Customer Distribution: Annual Income vs Spending Score')
plt.show()
Here's a breakdown of what the code does:
- Feature selection: It selects two relevant features for clustering: 'Annual Income (k$)' and 'Spending Score (1-100)'.
- Handling missing values: It checks for missing values in the selected features and removes any rows with missing data.
- Data scaling: It uses StandardScaler to scale the features, which is crucial for K-means clustering as it ensures all features contribute equally to the distance calculations.
- Data conversion: The scaled data is converted back to a DataFrame for easier handling.
- Visualization: It creates a scatter plot of the scaled data to visualize the distribution of customers based on their annual income and spending score.
This preprocessing step is essential as it prepares the data for the K-means clustering algorithm, ensuring that the analysis will be balanced and accurate.
In this step, we've selected our relevant features, handled any missing values, and scaled our data using StandardScaler. Scaling is crucial for K-Means clustering as it ensures all features contribute equally to the distance calculations.
6.3.3 Apply K-Means Clustering
With our data now properly preprocessed, we're ready to apply the K-Means clustering algorithm to our customer dataset. This powerful unsupervised learning technique will help us identify distinct groups within our customer base. To ensure we're using the optimal number of clusters for our analysis, we'll employ the elbow method.
This approach involves running the K-Means algorithm with different numbers of clusters and plotting the resulting inertia (within-cluster sum of squares) against the number of clusters. The "elbow" in this plot - where the rate of decrease in inertia starts to level off - will indicate the ideal number of clusters for our dataset.
# Elbow Method to find the optimal number of clusters
inertias = []
k_range = range(1, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(customer_df_scaled)
inertias.append(kmeans.inertia_)
# Plot the elbow curve
plt.figure(figsize=(10, 6))
plt.plot(k_range, inertias, 'bx-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('The Elbow Method showing the optimal k')
plt.show()
# Based on the elbow curve, let's choose the optimal number of clusters
optimal_k = 5 # This should be determined from the elbow curve
# Apply K-Means with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
customer_df_clean['Cluster'] = kmeans.fit_predict(customer_df_scaled)
# Visualize the clusters
plt.figure(figsize=(12, 8))
scatter = plt.scatter(customer_df_clean['Annual Income (k$)'],
customer_df_clean['Spending Score (1-100)'],
c=customer_df_clean['Cluster'],
cmap='viridis')
plt.colorbar(scatter)
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Segments')
plt.show()
Here's a breakdown of its main components:
- Elbow Method: This technique is used to determine the optimal number of clusters. It involves:
- Running K-Means with different numbers of clusters (1 to 10)
- Calculating the inertia (within-cluster sum of squares) for each
- Plotting the inertia against the number of clusters
- The "elbow" in this plot indicates the ideal number of clusters
- Applying K-Means: Once the optimal number of clusters is determined (set to 5 in this example), the algorithm is applied to the scaled customer data.
- Visualization: The resulting clusters are visualized in a scatter plot, with:
- Annual Income on the x-axis
- Spending Score on the y-axis
- Different colors representing different clusters
This process helps identify distinct groups of customers based on their income and spending behavior, which can be used for targeted marketing strategies.
In this step, we've used the elbow method to determine the optimal number of clusters, applied K-Means clustering with this optimal number, and visualized the resulting clusters.
6.3.4 Interpret the Clusters
Now that we have successfully applied K-Means clustering to our customer dataset, it's time to delve deeper into the results and extract meaningful insights. Let's carefully examine and interpret the clusters we've identified to gain a comprehensive understanding of our customer segments.
This analysis will provide valuable information about the distinct groups within our customer base, allowing us to tailor our strategies and approaches more effectively.
# Calculate cluster centroids
centroids = customer_df_clean.groupby('Cluster')[features].mean()
print("Cluster Centroids:")
print(centroids)
# Analyze cluster sizes
cluster_sizes = customer_df_clean['Cluster'].value_counts().sort_index()
print("\nCluster Sizes:")
print(cluster_sizes)
# Visualize cluster characteristics
plt.figure(figsize=(12, 6))
sns.boxplot(x='Cluster', y='Annual Income (k$)', data=customer_df_clean)
plt.title('Annual Income Distribution by Cluster')
plt.show()
plt.figure(figsize=(12, 6))
sns.boxplot(x='Cluster', y='Spending Score (1-100)', data=customer_df_clean)
plt.title('Spending Score Distribution by Cluster')
plt.show()
This code snippet is part of the customer segmentation project using K-Means clustering. It focuses on interpreting the clusters that have been created. Here's a breakdown of what the code does:
- Calculate cluster centroids: It computes the mean values of the features for each cluster, giving us a central point that represents each cluster.
- Analyze cluster sizes: It counts how many customers are in each cluster, which helps understand the distribution of customers across segments.
- Visualize cluster characteristics: It creates two box plots:
- - One showing the distribution of Annual Income for each cluster
- - Another showing the distribution of Spending Score for each cluster
This analysis is crucial for gaining insights into the distinct groups within the customer base, which can then be used to tailor marketing strategies and improve customer engagement.
Based on these visualizations and statistics, we can interpret our clusters:
- Cluster 0: High income, high spending score - "Premium Customers"
- Cluster 1: Low income, high spending score - "Careful Spenders"
- Cluster 2: Medium income, medium spending score - "Average Customers"
- Cluster 3: High income, low spending score - "Potential Savers"
- Cluster 4: Low income, low spending score - "Budget Conscious"
6.3.5 Evaluate Clustering Performance
To evaluate the effectiveness of our clustering approach, we'll employ the silhouette score, a powerful metric that quantifies how well each data point fits within its assigned cluster. This score provides valuable insights by measuring the similarity of an object to its own cluster in comparison to other clusters.
By analyzing these scores, we can gain a comprehensive understanding of our clustering quality and identify potential areas for improvement.
from sklearn.metrics import silhouette_score, silhouette_samples
# Ensure there are at least 2 clusters
if len(set(customer_df_clean['Cluster'])) > 1:
# Calculate silhouette score
silhouette_avg = silhouette_score(customer_df_scaled, customer_df_clean['Cluster'])
print(f"The average silhouette score is: {silhouette_avg:.4f}")
# Compute silhouette scores for each sample
silhouette_values = silhouette_samples(customer_df_scaled, customer_df_clean['Cluster'])
# Visualize silhouette scores
plt.figure(figsize=(10, 6))
plt.hist(silhouette_values, bins=20, alpha=0.7, edgecolor="black")
plt.axvline(silhouette_avg, color="red", linestyle="--", label=f"Average Silhouette Score: {silhouette_avg:.4f}")
plt.xlabel("Silhouette Score")
plt.ylabel("Frequency")
plt.title("Distribution of Silhouette Scores")
plt.legend()
plt.show()
else:
print("Silhouette score cannot be computed with only one cluster.")
Here's a breakdown of what the code does:
- Calculate the silhouette score: This is done using the
silhouette_score
function, which measures how similar an object is to its own cluster compared to other clusters. The average silhouette score for all data points is calculated and printed. - Calculate individual silhouette scores: The
silhouette_samples
function is used to compute the silhouette score for each data point. - Visualize the distribution of silhouette scores: A histogram is created to show the distribution of silhouette scores across all data points. This helps in understanding the overall quality of the clustering.
- Add a vertical line for the average score: A red dashed line is added to the histogram to indicate the average silhouette score, making it easy to compare individual scores to the overall average.
The silhouette score ranges from -1 to 1, with higher values indicating better clustering. A score above 0.5 is generally considered good. This visualization helps in assessing the quality of the clustering and identifying potential areas for improvement.
6.3.6 Potential Improvements and Future Work
While our current model provides valuable insights, there are several ways we could potentially improve its performance:
- Feature Engineering: Create new features or transform existing ones to capture more complex relationships. For example, we could create a feature that represents the ratio of spending score to annual income.
- Try Other Algorithms: Experiment with more advanced clustering algorithms like DBSCAN or Gaussian Mixture Models, which can handle clusters of different shapes and densities.
- Dimensionality Reduction: If we have more features, we could use techniques like PCA to reduce dimensionality before clustering.
- Incorporate More Data: If possible, include more customer attributes like age, gender, or purchase history to create more nuanced segments.
- Time Series Analysis: If we have data over time, we could analyze how customers move between segments.
6.3.7 Conclusion
In this project, we have successfully implemented K-Means clustering to segment customers based on their annual income and spending score. Our journey encompassed the entire data science process, from initial data loading and meticulous preprocessing to sophisticated model evaluation and in-depth interpretation. We navigated through each step with precision, ensuring the integrity and reliability of our analysis.
The customer segments we've uncovered through this process are not merely statistical groupings, but rather provide profound insights into our customer base. These segments offer a nuanced understanding of different customer behaviors and preferences, which can be leveraged to significantly enhance our business strategies. By tailoring our marketing approaches to these distinct groups, we can create more personalized and effective campaigns that resonate with each segment's unique characteristics.
Furthermore, these insights extend beyond marketing, potentially influencing product development, customer service approaches, and overall business decision-making. The ability to engage with customers in a more targeted manner, based on their segmentation, can lead to improved customer satisfaction, increased loyalty, and ultimately, better business outcomes. As we move forward, these customer segments will serve as a valuable foundation for data-driven decision making across various aspects of our operations.