Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconFeature Engineering for Modern Machine Learning with Scikit-Learn
Feature Engineering for Modern Machine Learning with Scikit-Learn

Chapter 1: Real-World Data Analysis Projects

1.3 Practical Exercises for Chapter 1

These exercises provide hands-on practice for customer segmentation and data analysis techniques in retail and healthcare data. Each exercise is designed to strengthen your understanding of data preparation, exploration, and clustering techniques. Solutions with code are included for guidance.

Exercise 1: Handling Missing Values in Retail Data

You have a retail dataset with columns such as CustomerIDAgeTotal Spend, and Purchase Frequency. Your task is to handle missing values as follows:

  1. Drop rows with missing CustomerID.
  2. Fill missing Age values with the median age.
  3. Drop columns with more than 50% missing values.
import pandas as pd

# Sample retail data with missing values
data = {'CustomerID': [1, 2, None, 4, 5],
        'Age': [25, 34, None, 45, 28],
        'Total Spend': [2000, 1500, 3000, None, 1800],
        'Purchase Frequency': [15, 10, 8, 5, None]}
df = pd.DataFrame(data)

# Solution: Handle missing values
df.dropna(subset=['CustomerID'], inplace=True)
df['Age'].fillna(df['Age'].median(), inplace=True)
df = df.dropna(thresh=len(df) * 0.5, axis=1)

print("Data after handling missing values:")
print(df)

In this solution:

Rows with missing CustomerID are removed, Age values are filled with the median, and columns with more than 50% missing data are dropped.

Exercise 2: Encoding Categorical Variables in Healthcare Data

Given a healthcare dataset with columns like Gender and Diagnosis, apply one-hot encoding to convert these categorical variables into dummy variables.

# Sample healthcare data
data = {'PatientID': [1, 2, 3, 4],
        'Gender': ['Male', 'Female', 'Female', 'Male'],
        'Diagnosis': ['Diabetes', 'Hypertension', 'Diabetes', 'Heart Disease']}
df = pd.DataFrame(data)

# Solution: Apply one-hot encoding
df_encoded = pd.get_dummies(df, columns=['Gender', 'Diagnosis'], drop_first=True)

print("Data after encoding categorical variables:")
print(df_encoded)

In this solution:

One-hot encoding converts Gender and Diagnosis into dummy variables, removing the first category to avoid multicollinearity.

Exercise 3: Standardizing Features for Clustering

Using a retail dataset with columns Total SpendPurchase Frequency, and Age, standardize these features to prepare for clustering.

from sklearn.preprocessing import StandardScaler

# Sample retail data
data = {'Total Spend': [2000, 3000, 2500, 1800, 3500],
        'Purchase Frequency': [15, 10, 20, 5, 12],
        'Age': [25, 30, 35, 40, 29]}
df = pd.DataFrame(data)

# Solution: Standardize features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)

print("Standardized Features:")
print(scaled_features)

In this solution:

StandardScaler is used to standardize Total SpendPurchase Frequency, and Age, ensuring they contribute equally to clustering.

Exercise 4: Applying K-means for Customer Segmentation

Using a dataset with Total SpendPurchase Frequency, and Age columns, apply K-means clustering with three clusters to segment customers.

from sklearn.cluster import KMeans

# Sample retail data
data = {'Total Spend': [2000, 3000, 2500, 1800, 3500],
        'Purchase Frequency': [15, 10, 20, 5, 12],
        'Age': [25, 30, 35, 40, 29]}
df = pd.DataFrame(data)

# Standardize features before clustering
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)

# Solution: Apply K-means with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(scaled_features)

print("Clustered Data with K-means:")
print(df)

In this solution:

K-means clustering with n_clusters=3 segments customers, and each customer is assigned to a cluster based on their spending, frequency, and age.

Exercise 5: Using the Elbow Method to Select Optimal K

Using the same retail dataset, use the Elbow Method to determine the optimal number of clusters (K) for K-means clustering.

inertia_values = []
K_range = range(1, 6)

# Calculate inertia for each K
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_features)
    inertia_values.append(kmeans.inertia_)

# Plot inertia values
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot(K_range, inertia_values, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()

In this solution:

The Elbow Method is used to plot inertia values, allowing us to observe the “elbow” point where adding clusters stops significantly reducing inertia.

These exercises cover data preparation, feature encoding, standardization, K-means clustering, and selecting the optimal number of clusters. By practicing these steps, you’ll gain a strong understanding of data analysis and segmentation techniques in retail and healthcare contexts.

1.3 Practical Exercises for Chapter 1

These exercises provide hands-on practice for customer segmentation and data analysis techniques in retail and healthcare data. Each exercise is designed to strengthen your understanding of data preparation, exploration, and clustering techniques. Solutions with code are included for guidance.

Exercise 1: Handling Missing Values in Retail Data

You have a retail dataset with columns such as CustomerIDAgeTotal Spend, and Purchase Frequency. Your task is to handle missing values as follows:

  1. Drop rows with missing CustomerID.
  2. Fill missing Age values with the median age.
  3. Drop columns with more than 50% missing values.
import pandas as pd

# Sample retail data with missing values
data = {'CustomerID': [1, 2, None, 4, 5],
        'Age': [25, 34, None, 45, 28],
        'Total Spend': [2000, 1500, 3000, None, 1800],
        'Purchase Frequency': [15, 10, 8, 5, None]}
df = pd.DataFrame(data)

# Solution: Handle missing values
df.dropna(subset=['CustomerID'], inplace=True)
df['Age'].fillna(df['Age'].median(), inplace=True)
df = df.dropna(thresh=len(df) * 0.5, axis=1)

print("Data after handling missing values:")
print(df)

In this solution:

Rows with missing CustomerID are removed, Age values are filled with the median, and columns with more than 50% missing data are dropped.

Exercise 2: Encoding Categorical Variables in Healthcare Data

Given a healthcare dataset with columns like Gender and Diagnosis, apply one-hot encoding to convert these categorical variables into dummy variables.

# Sample healthcare data
data = {'PatientID': [1, 2, 3, 4],
        'Gender': ['Male', 'Female', 'Female', 'Male'],
        'Diagnosis': ['Diabetes', 'Hypertension', 'Diabetes', 'Heart Disease']}
df = pd.DataFrame(data)

# Solution: Apply one-hot encoding
df_encoded = pd.get_dummies(df, columns=['Gender', 'Diagnosis'], drop_first=True)

print("Data after encoding categorical variables:")
print(df_encoded)

In this solution:

One-hot encoding converts Gender and Diagnosis into dummy variables, removing the first category to avoid multicollinearity.

Exercise 3: Standardizing Features for Clustering

Using a retail dataset with columns Total SpendPurchase Frequency, and Age, standardize these features to prepare for clustering.

from sklearn.preprocessing import StandardScaler

# Sample retail data
data = {'Total Spend': [2000, 3000, 2500, 1800, 3500],
        'Purchase Frequency': [15, 10, 20, 5, 12],
        'Age': [25, 30, 35, 40, 29]}
df = pd.DataFrame(data)

# Solution: Standardize features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)

print("Standardized Features:")
print(scaled_features)

In this solution:

StandardScaler is used to standardize Total SpendPurchase Frequency, and Age, ensuring they contribute equally to clustering.

Exercise 4: Applying K-means for Customer Segmentation

Using a dataset with Total SpendPurchase Frequency, and Age columns, apply K-means clustering with three clusters to segment customers.

from sklearn.cluster import KMeans

# Sample retail data
data = {'Total Spend': [2000, 3000, 2500, 1800, 3500],
        'Purchase Frequency': [15, 10, 20, 5, 12],
        'Age': [25, 30, 35, 40, 29]}
df = pd.DataFrame(data)

# Standardize features before clustering
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)

# Solution: Apply K-means with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(scaled_features)

print("Clustered Data with K-means:")
print(df)

In this solution:

K-means clustering with n_clusters=3 segments customers, and each customer is assigned to a cluster based on their spending, frequency, and age.

Exercise 5: Using the Elbow Method to Select Optimal K

Using the same retail dataset, use the Elbow Method to determine the optimal number of clusters (K) for K-means clustering.

inertia_values = []
K_range = range(1, 6)

# Calculate inertia for each K
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_features)
    inertia_values.append(kmeans.inertia_)

# Plot inertia values
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot(K_range, inertia_values, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()

In this solution:

The Elbow Method is used to plot inertia values, allowing us to observe the “elbow” point where adding clusters stops significantly reducing inertia.

These exercises cover data preparation, feature encoding, standardization, K-means clustering, and selecting the optimal number of clusters. By practicing these steps, you’ll gain a strong understanding of data analysis and segmentation techniques in retail and healthcare contexts.

1.3 Practical Exercises for Chapter 1

These exercises provide hands-on practice for customer segmentation and data analysis techniques in retail and healthcare data. Each exercise is designed to strengthen your understanding of data preparation, exploration, and clustering techniques. Solutions with code are included for guidance.

Exercise 1: Handling Missing Values in Retail Data

You have a retail dataset with columns such as CustomerIDAgeTotal Spend, and Purchase Frequency. Your task is to handle missing values as follows:

  1. Drop rows with missing CustomerID.
  2. Fill missing Age values with the median age.
  3. Drop columns with more than 50% missing values.
import pandas as pd

# Sample retail data with missing values
data = {'CustomerID': [1, 2, None, 4, 5],
        'Age': [25, 34, None, 45, 28],
        'Total Spend': [2000, 1500, 3000, None, 1800],
        'Purchase Frequency': [15, 10, 8, 5, None]}
df = pd.DataFrame(data)

# Solution: Handle missing values
df.dropna(subset=['CustomerID'], inplace=True)
df['Age'].fillna(df['Age'].median(), inplace=True)
df = df.dropna(thresh=len(df) * 0.5, axis=1)

print("Data after handling missing values:")
print(df)

In this solution:

Rows with missing CustomerID are removed, Age values are filled with the median, and columns with more than 50% missing data are dropped.

Exercise 2: Encoding Categorical Variables in Healthcare Data

Given a healthcare dataset with columns like Gender and Diagnosis, apply one-hot encoding to convert these categorical variables into dummy variables.

# Sample healthcare data
data = {'PatientID': [1, 2, 3, 4],
        'Gender': ['Male', 'Female', 'Female', 'Male'],
        'Diagnosis': ['Diabetes', 'Hypertension', 'Diabetes', 'Heart Disease']}
df = pd.DataFrame(data)

# Solution: Apply one-hot encoding
df_encoded = pd.get_dummies(df, columns=['Gender', 'Diagnosis'], drop_first=True)

print("Data after encoding categorical variables:")
print(df_encoded)

In this solution:

One-hot encoding converts Gender and Diagnosis into dummy variables, removing the first category to avoid multicollinearity.

Exercise 3: Standardizing Features for Clustering

Using a retail dataset with columns Total SpendPurchase Frequency, and Age, standardize these features to prepare for clustering.

from sklearn.preprocessing import StandardScaler

# Sample retail data
data = {'Total Spend': [2000, 3000, 2500, 1800, 3500],
        'Purchase Frequency': [15, 10, 20, 5, 12],
        'Age': [25, 30, 35, 40, 29]}
df = pd.DataFrame(data)

# Solution: Standardize features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)

print("Standardized Features:")
print(scaled_features)

In this solution:

StandardScaler is used to standardize Total SpendPurchase Frequency, and Age, ensuring they contribute equally to clustering.

Exercise 4: Applying K-means for Customer Segmentation

Using a dataset with Total SpendPurchase Frequency, and Age columns, apply K-means clustering with three clusters to segment customers.

from sklearn.cluster import KMeans

# Sample retail data
data = {'Total Spend': [2000, 3000, 2500, 1800, 3500],
        'Purchase Frequency': [15, 10, 20, 5, 12],
        'Age': [25, 30, 35, 40, 29]}
df = pd.DataFrame(data)

# Standardize features before clustering
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)

# Solution: Apply K-means with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(scaled_features)

print("Clustered Data with K-means:")
print(df)

In this solution:

K-means clustering with n_clusters=3 segments customers, and each customer is assigned to a cluster based on their spending, frequency, and age.

Exercise 5: Using the Elbow Method to Select Optimal K

Using the same retail dataset, use the Elbow Method to determine the optimal number of clusters (K) for K-means clustering.

inertia_values = []
K_range = range(1, 6)

# Calculate inertia for each K
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_features)
    inertia_values.append(kmeans.inertia_)

# Plot inertia values
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot(K_range, inertia_values, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()

In this solution:

The Elbow Method is used to plot inertia values, allowing us to observe the “elbow” point where adding clusters stops significantly reducing inertia.

These exercises cover data preparation, feature encoding, standardization, K-means clustering, and selecting the optimal number of clusters. By practicing these steps, you’ll gain a strong understanding of data analysis and segmentation techniques in retail and healthcare contexts.

1.3 Practical Exercises for Chapter 1

These exercises provide hands-on practice for customer segmentation and data analysis techniques in retail and healthcare data. Each exercise is designed to strengthen your understanding of data preparation, exploration, and clustering techniques. Solutions with code are included for guidance.

Exercise 1: Handling Missing Values in Retail Data

You have a retail dataset with columns such as CustomerIDAgeTotal Spend, and Purchase Frequency. Your task is to handle missing values as follows:

  1. Drop rows with missing CustomerID.
  2. Fill missing Age values with the median age.
  3. Drop columns with more than 50% missing values.
import pandas as pd

# Sample retail data with missing values
data = {'CustomerID': [1, 2, None, 4, 5],
        'Age': [25, 34, None, 45, 28],
        'Total Spend': [2000, 1500, 3000, None, 1800],
        'Purchase Frequency': [15, 10, 8, 5, None]}
df = pd.DataFrame(data)

# Solution: Handle missing values
df.dropna(subset=['CustomerID'], inplace=True)
df['Age'].fillna(df['Age'].median(), inplace=True)
df = df.dropna(thresh=len(df) * 0.5, axis=1)

print("Data after handling missing values:")
print(df)

In this solution:

Rows with missing CustomerID are removed, Age values are filled with the median, and columns with more than 50% missing data are dropped.

Exercise 2: Encoding Categorical Variables in Healthcare Data

Given a healthcare dataset with columns like Gender and Diagnosis, apply one-hot encoding to convert these categorical variables into dummy variables.

# Sample healthcare data
data = {'PatientID': [1, 2, 3, 4],
        'Gender': ['Male', 'Female', 'Female', 'Male'],
        'Diagnosis': ['Diabetes', 'Hypertension', 'Diabetes', 'Heart Disease']}
df = pd.DataFrame(data)

# Solution: Apply one-hot encoding
df_encoded = pd.get_dummies(df, columns=['Gender', 'Diagnosis'], drop_first=True)

print("Data after encoding categorical variables:")
print(df_encoded)

In this solution:

One-hot encoding converts Gender and Diagnosis into dummy variables, removing the first category to avoid multicollinearity.

Exercise 3: Standardizing Features for Clustering

Using a retail dataset with columns Total SpendPurchase Frequency, and Age, standardize these features to prepare for clustering.

from sklearn.preprocessing import StandardScaler

# Sample retail data
data = {'Total Spend': [2000, 3000, 2500, 1800, 3500],
        'Purchase Frequency': [15, 10, 20, 5, 12],
        'Age': [25, 30, 35, 40, 29]}
df = pd.DataFrame(data)

# Solution: Standardize features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)

print("Standardized Features:")
print(scaled_features)

In this solution:

StandardScaler is used to standardize Total SpendPurchase Frequency, and Age, ensuring they contribute equally to clustering.

Exercise 4: Applying K-means for Customer Segmentation

Using a dataset with Total SpendPurchase Frequency, and Age columns, apply K-means clustering with three clusters to segment customers.

from sklearn.cluster import KMeans

# Sample retail data
data = {'Total Spend': [2000, 3000, 2500, 1800, 3500],
        'Purchase Frequency': [15, 10, 20, 5, 12],
        'Age': [25, 30, 35, 40, 29]}
df = pd.DataFrame(data)

# Standardize features before clustering
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)

# Solution: Apply K-means with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(scaled_features)

print("Clustered Data with K-means:")
print(df)

In this solution:

K-means clustering with n_clusters=3 segments customers, and each customer is assigned to a cluster based on their spending, frequency, and age.

Exercise 5: Using the Elbow Method to Select Optimal K

Using the same retail dataset, use the Elbow Method to determine the optimal number of clusters (K) for K-means clustering.

inertia_values = []
K_range = range(1, 6)

# Calculate inertia for each K
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_features)
    inertia_values.append(kmeans.inertia_)

# Plot inertia values
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot(K_range, inertia_values, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()

In this solution:

The Elbow Method is used to plot inertia values, allowing us to observe the “elbow” point where adding clusters stops significantly reducing inertia.

These exercises cover data preparation, feature encoding, standardization, K-means clustering, and selecting the optimal number of clusters. By practicing these steps, you’ll gain a strong understanding of data analysis and segmentation techniques in retail and healthcare contexts.