Chapter 6: Encoding Categorical Variables
6.3 Practical Exercises for Chapter 6
In this practical section, we will reinforce what you’ve learned in Chapter 6 by implementing various encoding techniques, including Target Encoding, Frequency Encoding, and Ordinal Encoding. These exercises will help you gain hands-on experience and apply the concepts to real-world scenarios.
Exercise 1: Target Encoding
You are working with a dataset that contains a Neighborhood column and the target variable is House Prices. Your task is to:
Apply Target Encoding to the Neighborhood column by calculating the mean House Prices for each neighborhood.
Solution:
import pandas as pd
# Sample data
data = {'Neighborhood': ['A', 'B', 'A', 'C', 'B'],
'SalePrice': [300000, 450000, 350000, 500000, 470000]}
df = pd.DataFrame(data)
# Calculate the mean SalePrice for each neighborhood
neighborhood_mean = df.groupby('Neighborhood')['SalePrice'].mean()
# Apply Target Encoding by mapping the mean SalePrice to the Neighborhood column
df['NeighborhoodEncoded'] = df['Neighborhood'].map(neighborhood_mean)
# View the encoded dataframe
print(df[['Neighborhood', 'SalePrice', 'NeighborhoodEncoded']])
Exercise 2: Target Encoding with Smoothing
You are working with the same dataset from Exercise 1, but you want to apply Target Encoding with Smoothing to reduce the risk of overfitting. Use the following parameters for smoothing: alpha = 5.
Solution:
# Smoothing parameter
alpha = 5
# Global mean SalePrice
global_mean = df['SalePrice'].mean()
# Calculate smoothed mean SalePrice for each neighborhood
df['NeighborhoodEncoded'] = df['Neighborhood'].map(lambda x:
(neighborhood_mean[x] * df['Neighborhood'].value_counts()[x] + global_mean * alpha) /
(df['Neighborhood'].value_counts()[x] + alpha))
# View the smoothed encoded dataframe
print(df[['Neighborhood', 'NeighborhoodEncoded']])
Exercise 3: Frequency Encoding
You are working with a dataset that includes the City column. Your task is to:
Apply Frequency Encoding to the City column by replacing each city with its frequency in the dataset.
Solution:
# Sample data
data = {'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Houston', 'Los Angeles']}
df = pd.DataFrame(data)
# Perform frequency encoding
df['City_Frequency'] = df.groupby('City')['City'].transform('count')
# View the encoded dataframe
print(df)
Exercise 4: Ordinal Encoding
You are working with a dataset that contains an EducationLevel column, with categories like High School, Bachelor, Master, and PhD. Your task is to:
Apply Ordinal Encoding to the EducationLevel column, mapping the education levels to integers based on their order of importance.
Solution:
# Sample data
data = {'EducationLevel': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor']}
df = pd.DataFrame(data)
# Define the ordinal mapping
education_order = {'High School': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4}
# Apply Ordinal Encoding
df['EducationLevelEncoded'] = df['EducationLevel'].map(education_order)
# View the encoded dataframe
print(df)
Exercise 5: Handling High Cardinality with Frequency Encoding
You are working with a dataset that includes a Product Category column, with hundreds of unique product categories. Your task is to:
Apply Frequency Encoding to the Product Category column to handle the high cardinality and simplify the dataset.
Solution:
# Sample data with high cardinality
data = {'ProductCategory': ['Electronics', 'Furniture', 'Electronics', 'Clothing', 'Furniture', 'Clothing', 'Electronics']}
df = pd.DataFrame(data)
# Perform frequency encoding
df['ProductCategory_Frequency'] = df.groupby('ProductCategory')['ProductCategory'].transform('count')
# View the encoded dataframe
print(df)
These practical exercises give you hands-on experience with various encoding techniques, including Target Encoding, Frequency Encoding, and Ordinal Encoding. By practicing these methods, you will be well-equipped to handle categorical variables in machine learning models efficiently, especially when dealing with high-cardinality features or when there are meaningful relationships between the categorical variables and the target variable. Keep practicing and experimenting with different datasets to strengthen your understanding of these techniques!
6.3 Practical Exercises for Chapter 6
In this practical section, we will reinforce what you’ve learned in Chapter 6 by implementing various encoding techniques, including Target Encoding, Frequency Encoding, and Ordinal Encoding. These exercises will help you gain hands-on experience and apply the concepts to real-world scenarios.
Exercise 1: Target Encoding
You are working with a dataset that contains a Neighborhood column and the target variable is House Prices. Your task is to:
Apply Target Encoding to the Neighborhood column by calculating the mean House Prices for each neighborhood.
Solution:
import pandas as pd
# Sample data
data = {'Neighborhood': ['A', 'B', 'A', 'C', 'B'],
'SalePrice': [300000, 450000, 350000, 500000, 470000]}
df = pd.DataFrame(data)
# Calculate the mean SalePrice for each neighborhood
neighborhood_mean = df.groupby('Neighborhood')['SalePrice'].mean()
# Apply Target Encoding by mapping the mean SalePrice to the Neighborhood column
df['NeighborhoodEncoded'] = df['Neighborhood'].map(neighborhood_mean)
# View the encoded dataframe
print(df[['Neighborhood', 'SalePrice', 'NeighborhoodEncoded']])
Exercise 2: Target Encoding with Smoothing
You are working with the same dataset from Exercise 1, but you want to apply Target Encoding with Smoothing to reduce the risk of overfitting. Use the following parameters for smoothing: alpha = 5.
Solution:
# Smoothing parameter
alpha = 5
# Global mean SalePrice
global_mean = df['SalePrice'].mean()
# Calculate smoothed mean SalePrice for each neighborhood
df['NeighborhoodEncoded'] = df['Neighborhood'].map(lambda x:
(neighborhood_mean[x] * df['Neighborhood'].value_counts()[x] + global_mean * alpha) /
(df['Neighborhood'].value_counts()[x] + alpha))
# View the smoothed encoded dataframe
print(df[['Neighborhood', 'NeighborhoodEncoded']])
Exercise 3: Frequency Encoding
You are working with a dataset that includes the City column. Your task is to:
Apply Frequency Encoding to the City column by replacing each city with its frequency in the dataset.
Solution:
# Sample data
data = {'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Houston', 'Los Angeles']}
df = pd.DataFrame(data)
# Perform frequency encoding
df['City_Frequency'] = df.groupby('City')['City'].transform('count')
# View the encoded dataframe
print(df)
Exercise 4: Ordinal Encoding
You are working with a dataset that contains an EducationLevel column, with categories like High School, Bachelor, Master, and PhD. Your task is to:
Apply Ordinal Encoding to the EducationLevel column, mapping the education levels to integers based on their order of importance.
Solution:
# Sample data
data = {'EducationLevel': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor']}
df = pd.DataFrame(data)
# Define the ordinal mapping
education_order = {'High School': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4}
# Apply Ordinal Encoding
df['EducationLevelEncoded'] = df['EducationLevel'].map(education_order)
# View the encoded dataframe
print(df)
Exercise 5: Handling High Cardinality with Frequency Encoding
You are working with a dataset that includes a Product Category column, with hundreds of unique product categories. Your task is to:
Apply Frequency Encoding to the Product Category column to handle the high cardinality and simplify the dataset.
Solution:
# Sample data with high cardinality
data = {'ProductCategory': ['Electronics', 'Furniture', 'Electronics', 'Clothing', 'Furniture', 'Clothing', 'Electronics']}
df = pd.DataFrame(data)
# Perform frequency encoding
df['ProductCategory_Frequency'] = df.groupby('ProductCategory')['ProductCategory'].transform('count')
# View the encoded dataframe
print(df)
These practical exercises give you hands-on experience with various encoding techniques, including Target Encoding, Frequency Encoding, and Ordinal Encoding. By practicing these methods, you will be well-equipped to handle categorical variables in machine learning models efficiently, especially when dealing with high-cardinality features or when there are meaningful relationships between the categorical variables and the target variable. Keep practicing and experimenting with different datasets to strengthen your understanding of these techniques!
6.3 Practical Exercises for Chapter 6
In this practical section, we will reinforce what you’ve learned in Chapter 6 by implementing various encoding techniques, including Target Encoding, Frequency Encoding, and Ordinal Encoding. These exercises will help you gain hands-on experience and apply the concepts to real-world scenarios.
Exercise 1: Target Encoding
You are working with a dataset that contains a Neighborhood column and the target variable is House Prices. Your task is to:
Apply Target Encoding to the Neighborhood column by calculating the mean House Prices for each neighborhood.
Solution:
import pandas as pd
# Sample data
data = {'Neighborhood': ['A', 'B', 'A', 'C', 'B'],
'SalePrice': [300000, 450000, 350000, 500000, 470000]}
df = pd.DataFrame(data)
# Calculate the mean SalePrice for each neighborhood
neighborhood_mean = df.groupby('Neighborhood')['SalePrice'].mean()
# Apply Target Encoding by mapping the mean SalePrice to the Neighborhood column
df['NeighborhoodEncoded'] = df['Neighborhood'].map(neighborhood_mean)
# View the encoded dataframe
print(df[['Neighborhood', 'SalePrice', 'NeighborhoodEncoded']])
Exercise 2: Target Encoding with Smoothing
You are working with the same dataset from Exercise 1, but you want to apply Target Encoding with Smoothing to reduce the risk of overfitting. Use the following parameters for smoothing: alpha = 5.
Solution:
# Smoothing parameter
alpha = 5
# Global mean SalePrice
global_mean = df['SalePrice'].mean()
# Calculate smoothed mean SalePrice for each neighborhood
df['NeighborhoodEncoded'] = df['Neighborhood'].map(lambda x:
(neighborhood_mean[x] * df['Neighborhood'].value_counts()[x] + global_mean * alpha) /
(df['Neighborhood'].value_counts()[x] + alpha))
# View the smoothed encoded dataframe
print(df[['Neighborhood', 'NeighborhoodEncoded']])
Exercise 3: Frequency Encoding
You are working with a dataset that includes the City column. Your task is to:
Apply Frequency Encoding to the City column by replacing each city with its frequency in the dataset.
Solution:
# Sample data
data = {'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Houston', 'Los Angeles']}
df = pd.DataFrame(data)
# Perform frequency encoding
df['City_Frequency'] = df.groupby('City')['City'].transform('count')
# View the encoded dataframe
print(df)
Exercise 4: Ordinal Encoding
You are working with a dataset that contains an EducationLevel column, with categories like High School, Bachelor, Master, and PhD. Your task is to:
Apply Ordinal Encoding to the EducationLevel column, mapping the education levels to integers based on their order of importance.
Solution:
# Sample data
data = {'EducationLevel': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor']}
df = pd.DataFrame(data)
# Define the ordinal mapping
education_order = {'High School': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4}
# Apply Ordinal Encoding
df['EducationLevelEncoded'] = df['EducationLevel'].map(education_order)
# View the encoded dataframe
print(df)
Exercise 5: Handling High Cardinality with Frequency Encoding
You are working with a dataset that includes a Product Category column, with hundreds of unique product categories. Your task is to:
Apply Frequency Encoding to the Product Category column to handle the high cardinality and simplify the dataset.
Solution:
# Sample data with high cardinality
data = {'ProductCategory': ['Electronics', 'Furniture', 'Electronics', 'Clothing', 'Furniture', 'Clothing', 'Electronics']}
df = pd.DataFrame(data)
# Perform frequency encoding
df['ProductCategory_Frequency'] = df.groupby('ProductCategory')['ProductCategory'].transform('count')
# View the encoded dataframe
print(df)
These practical exercises give you hands-on experience with various encoding techniques, including Target Encoding, Frequency Encoding, and Ordinal Encoding. By practicing these methods, you will be well-equipped to handle categorical variables in machine learning models efficiently, especially when dealing with high-cardinality features or when there are meaningful relationships between the categorical variables and the target variable. Keep practicing and experimenting with different datasets to strengthen your understanding of these techniques!
6.3 Practical Exercises for Chapter 6
In this practical section, we will reinforce what you’ve learned in Chapter 6 by implementing various encoding techniques, including Target Encoding, Frequency Encoding, and Ordinal Encoding. These exercises will help you gain hands-on experience and apply the concepts to real-world scenarios.
Exercise 1: Target Encoding
You are working with a dataset that contains a Neighborhood column and the target variable is House Prices. Your task is to:
Apply Target Encoding to the Neighborhood column by calculating the mean House Prices for each neighborhood.
Solution:
import pandas as pd
# Sample data
data = {'Neighborhood': ['A', 'B', 'A', 'C', 'B'],
'SalePrice': [300000, 450000, 350000, 500000, 470000]}
df = pd.DataFrame(data)
# Calculate the mean SalePrice for each neighborhood
neighborhood_mean = df.groupby('Neighborhood')['SalePrice'].mean()
# Apply Target Encoding by mapping the mean SalePrice to the Neighborhood column
df['NeighborhoodEncoded'] = df['Neighborhood'].map(neighborhood_mean)
# View the encoded dataframe
print(df[['Neighborhood', 'SalePrice', 'NeighborhoodEncoded']])
Exercise 2: Target Encoding with Smoothing
You are working with the same dataset from Exercise 1, but you want to apply Target Encoding with Smoothing to reduce the risk of overfitting. Use the following parameters for smoothing: alpha = 5.
Solution:
# Smoothing parameter
alpha = 5
# Global mean SalePrice
global_mean = df['SalePrice'].mean()
# Calculate smoothed mean SalePrice for each neighborhood
df['NeighborhoodEncoded'] = df['Neighborhood'].map(lambda x:
(neighborhood_mean[x] * df['Neighborhood'].value_counts()[x] + global_mean * alpha) /
(df['Neighborhood'].value_counts()[x] + alpha))
# View the smoothed encoded dataframe
print(df[['Neighborhood', 'NeighborhoodEncoded']])
Exercise 3: Frequency Encoding
You are working with a dataset that includes the City column. Your task is to:
Apply Frequency Encoding to the City column by replacing each city with its frequency in the dataset.
Solution:
# Sample data
data = {'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Houston', 'Los Angeles']}
df = pd.DataFrame(data)
# Perform frequency encoding
df['City_Frequency'] = df.groupby('City')['City'].transform('count')
# View the encoded dataframe
print(df)
Exercise 4: Ordinal Encoding
You are working with a dataset that contains an EducationLevel column, with categories like High School, Bachelor, Master, and PhD. Your task is to:
Apply Ordinal Encoding to the EducationLevel column, mapping the education levels to integers based on their order of importance.
Solution:
# Sample data
data = {'EducationLevel': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor']}
df = pd.DataFrame(data)
# Define the ordinal mapping
education_order = {'High School': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4}
# Apply Ordinal Encoding
df['EducationLevelEncoded'] = df['EducationLevel'].map(education_order)
# View the encoded dataframe
print(df)
Exercise 5: Handling High Cardinality with Frequency Encoding
You are working with a dataset that includes a Product Category column, with hundreds of unique product categories. Your task is to:
Apply Frequency Encoding to the Product Category column to handle the high cardinality and simplify the dataset.
Solution:
# Sample data with high cardinality
data = {'ProductCategory': ['Electronics', 'Furniture', 'Electronics', 'Clothing', 'Furniture', 'Clothing', 'Electronics']}
df = pd.DataFrame(data)
# Perform frequency encoding
df['ProductCategory_Frequency'] = df.groupby('ProductCategory')['ProductCategory'].transform('count')
# View the encoded dataframe
print(df)
These practical exercises give you hands-on experience with various encoding techniques, including Target Encoding, Frequency Encoding, and Ordinal Encoding. By practicing these methods, you will be well-equipped to handle categorical variables in machine learning models efficiently, especially when dealing with high-cardinality features or when there are meaningful relationships between the categorical variables and the target variable. Keep practicing and experimenting with different datasets to strengthen your understanding of these techniques!