Chapter 3: The Role of Feature Engineering in Machine Learning
3.3 Practical Exercises for Chapter 3
Now that you’ve completed Chapter 3, it’s time to practice the feature engineering techniques you've learned. The following exercises are designed to help you apply these techniques, with solutions provided for each. These exercises cover key concepts like creating interaction features, handling time-based data, binning numerical features, and target encoding.
Exercise 1: Creating an Interaction Feature
You are working with a dataset of car sales. The dataset contains columns for EngineSize (in liters) and HorsePower. Your task is to:
Create a new feature called PowerToEngineRatio that represents the ratio of horsepower to engine size.
Solution:
import pandas as pd
# Sample data: Car sales
data = {'CarID': [1, 2, 3, 4, 5],
'EngineSize': [2.0, 3.0, 4.0, 2.5, 3.5],
'HorsePower': [150, 200, 250, 180, 220]}
df = pd.DataFrame(data)
# Create an interaction feature: PowerToEngineRatio
df['PowerToEngineRatio'] = df['HorsePower'] / df['EngineSize']
# View the result
print(df[['EngineSize', 'HorsePower', 'PowerToEngineRatio']])
Exercise 2: Handling Time-Based Features
You are given a dataset containing sales transaction data. The dataset includes a TransactionDate column. Your task is to:
- Convert the TransactionDate column to a datetime format.
- Extract the year, month, and day of the week from the TransactionDate column.
Solution:
# Sample data: Sales transactions
data = {'TransactionID': [101, 102, 103, 104, 105],
'TransactionDate': ['2022-05-15', '2023-03-10', '2023-07-22', '2022-12-01', '2023-01-14']}
df = pd.DataFrame(data)
# Convert the TransactionDate column to datetime format
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'])
# Extract year, month, and day of the week
df['Year'] = df['TransactionDate'].dt.year
df['Month'] = df['TransactionDate'].dt.month
df['DayOfWeek'] = df['TransactionDate'].dt.dayofweek
# View the result
print(df[['TransactionDate', 'Year', 'Month', 'DayOfWeek']])
Exercise 3: Binning Numerical Features
You are working with a dataset of customer purchases, which includes a PurchaseAmount column. Your task is to:
Bin the PurchaseAmount into three categories: Low, Medium, and High. Use the following bin ranges:
- Low: Less than $100
- Medium: $100 - $500
- High: Greater than $500
Solution:
# Sample data: Customer purchases
data = {'CustomerID': [1, 2, 3, 4, 5],
'PurchaseAmount': [50, 150, 700, 300, 600]}
df = pd.DataFrame(data)
# Define the bins and labels
bins = [0, 100, 500, float('inf')]
labels = ['Low', 'Medium', 'High']
# Bin the PurchaseAmount into categories
df['PurchaseCategory'] = pd.cut(df['PurchaseAmount'], bins=bins, labels=labels)
# View the result
print(df[['PurchaseAmount', 'PurchaseCategory']])
Exercise 4: Target Encoding for Categorical Variables
You are working with a dataset of house prices. The dataset includes a Neighborhood column and a SalePrice column. Your task is to:
Perform target encoding on the Neighborhood column by replacing each neighborhood with the average SalePrice for that neighborhood.
Solution:
# Sample data: House prices
data = {'HouseID': [1, 2, 3, 4, 5],
'Neighborhood': ['A', 'B', 'A', 'C', 'B'],
'SalePrice': [300000, 450000, 350000, 500000, 470000]}
df = pd.DataFrame(data)
# Calculate the average SalePrice for each neighborhood
neighborhood_avg_price = df.groupby('Neighborhood')['SalePrice'].mean()
# Perform target encoding by mapping the average prices back to the Neighborhood column
df['NeighborhoodEncoded'] = df['Neighborhood'].map(neighborhood_avg_price)
# View the result
print(df[['Neighborhood', 'SalePrice', 'NeighborhoodEncoded']])
Exercise 5: Calculating Time Differences
You are given a dataset of property listings, with columns ListingDate and SaleDate. Your task is to:
Calculate the number of days a property has been on the market (i.e., the difference between the SaleDate and ListingDate).
Solution:
# Sample data: Property listings
data = {'PropertyID': [1, 2, 3, 4, 5],
'ListingDate': ['2023-01-01', '2023-02-15', '2023-03-01', '2023-04-01', '2023-05-01'],
'SaleDate': ['2023-03-15', '2023-04-01', '2023-03-20', '2023-05-15', '2023-06-01']}
df = pd.DataFrame(data)
# Convert ListingDate and SaleDate to datetime format
df['ListingDate'] = pd.to_datetime(df['ListingDate'])
df['SaleDate'] = pd.to_datetime(df['SaleDate'])
# Calculate the number of days on market
df['DaysOnMarket'] = (df['SaleDate'] - df['ListingDate']).dt.days
# View the result
print(df[['ListingDate', 'SaleDate', 'DaysOnMarket']])
These practical exercises help reinforce the feature engineering techniques covered in Chapter 3. By creating interaction features, handling time-based features, binning numerical variables, and applying target encoding, you’ve gained hands-on experience in transforming raw data into meaningful features that improve model performance. Keep practicing these techniques, and you'll continue to improve your machine learning workflows!
3.3 Practical Exercises for Chapter 3
Now that you’ve completed Chapter 3, it’s time to practice the feature engineering techniques you've learned. The following exercises are designed to help you apply these techniques, with solutions provided for each. These exercises cover key concepts like creating interaction features, handling time-based data, binning numerical features, and target encoding.
Exercise 1: Creating an Interaction Feature
You are working with a dataset of car sales. The dataset contains columns for EngineSize (in liters) and HorsePower. Your task is to:
Create a new feature called PowerToEngineRatio that represents the ratio of horsepower to engine size.
Solution:
import pandas as pd
# Sample data: Car sales
data = {'CarID': [1, 2, 3, 4, 5],
'EngineSize': [2.0, 3.0, 4.0, 2.5, 3.5],
'HorsePower': [150, 200, 250, 180, 220]}
df = pd.DataFrame(data)
# Create an interaction feature: PowerToEngineRatio
df['PowerToEngineRatio'] = df['HorsePower'] / df['EngineSize']
# View the result
print(df[['EngineSize', 'HorsePower', 'PowerToEngineRatio']])
Exercise 2: Handling Time-Based Features
You are given a dataset containing sales transaction data. The dataset includes a TransactionDate column. Your task is to:
- Convert the TransactionDate column to a datetime format.
- Extract the year, month, and day of the week from the TransactionDate column.
Solution:
# Sample data: Sales transactions
data = {'TransactionID': [101, 102, 103, 104, 105],
'TransactionDate': ['2022-05-15', '2023-03-10', '2023-07-22', '2022-12-01', '2023-01-14']}
df = pd.DataFrame(data)
# Convert the TransactionDate column to datetime format
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'])
# Extract year, month, and day of the week
df['Year'] = df['TransactionDate'].dt.year
df['Month'] = df['TransactionDate'].dt.month
df['DayOfWeek'] = df['TransactionDate'].dt.dayofweek
# View the result
print(df[['TransactionDate', 'Year', 'Month', 'DayOfWeek']])
Exercise 3: Binning Numerical Features
You are working with a dataset of customer purchases, which includes a PurchaseAmount column. Your task is to:
Bin the PurchaseAmount into three categories: Low, Medium, and High. Use the following bin ranges:
- Low: Less than $100
- Medium: $100 - $500
- High: Greater than $500
Solution:
# Sample data: Customer purchases
data = {'CustomerID': [1, 2, 3, 4, 5],
'PurchaseAmount': [50, 150, 700, 300, 600]}
df = pd.DataFrame(data)
# Define the bins and labels
bins = [0, 100, 500, float('inf')]
labels = ['Low', 'Medium', 'High']
# Bin the PurchaseAmount into categories
df['PurchaseCategory'] = pd.cut(df['PurchaseAmount'], bins=bins, labels=labels)
# View the result
print(df[['PurchaseAmount', 'PurchaseCategory']])
Exercise 4: Target Encoding for Categorical Variables
You are working with a dataset of house prices. The dataset includes a Neighborhood column and a SalePrice column. Your task is to:
Perform target encoding on the Neighborhood column by replacing each neighborhood with the average SalePrice for that neighborhood.
Solution:
# Sample data: House prices
data = {'HouseID': [1, 2, 3, 4, 5],
'Neighborhood': ['A', 'B', 'A', 'C', 'B'],
'SalePrice': [300000, 450000, 350000, 500000, 470000]}
df = pd.DataFrame(data)
# Calculate the average SalePrice for each neighborhood
neighborhood_avg_price = df.groupby('Neighborhood')['SalePrice'].mean()
# Perform target encoding by mapping the average prices back to the Neighborhood column
df['NeighborhoodEncoded'] = df['Neighborhood'].map(neighborhood_avg_price)
# View the result
print(df[['Neighborhood', 'SalePrice', 'NeighborhoodEncoded']])
Exercise 5: Calculating Time Differences
You are given a dataset of property listings, with columns ListingDate and SaleDate. Your task is to:
Calculate the number of days a property has been on the market (i.e., the difference between the SaleDate and ListingDate).
Solution:
# Sample data: Property listings
data = {'PropertyID': [1, 2, 3, 4, 5],
'ListingDate': ['2023-01-01', '2023-02-15', '2023-03-01', '2023-04-01', '2023-05-01'],
'SaleDate': ['2023-03-15', '2023-04-01', '2023-03-20', '2023-05-15', '2023-06-01']}
df = pd.DataFrame(data)
# Convert ListingDate and SaleDate to datetime format
df['ListingDate'] = pd.to_datetime(df['ListingDate'])
df['SaleDate'] = pd.to_datetime(df['SaleDate'])
# Calculate the number of days on market
df['DaysOnMarket'] = (df['SaleDate'] - df['ListingDate']).dt.days
# View the result
print(df[['ListingDate', 'SaleDate', 'DaysOnMarket']])
These practical exercises help reinforce the feature engineering techniques covered in Chapter 3. By creating interaction features, handling time-based features, binning numerical variables, and applying target encoding, you’ve gained hands-on experience in transforming raw data into meaningful features that improve model performance. Keep practicing these techniques, and you'll continue to improve your machine learning workflows!
3.3 Practical Exercises for Chapter 3
Now that you’ve completed Chapter 3, it’s time to practice the feature engineering techniques you've learned. The following exercises are designed to help you apply these techniques, with solutions provided for each. These exercises cover key concepts like creating interaction features, handling time-based data, binning numerical features, and target encoding.
Exercise 1: Creating an Interaction Feature
You are working with a dataset of car sales. The dataset contains columns for EngineSize (in liters) and HorsePower. Your task is to:
Create a new feature called PowerToEngineRatio that represents the ratio of horsepower to engine size.
Solution:
import pandas as pd
# Sample data: Car sales
data = {'CarID': [1, 2, 3, 4, 5],
'EngineSize': [2.0, 3.0, 4.0, 2.5, 3.5],
'HorsePower': [150, 200, 250, 180, 220]}
df = pd.DataFrame(data)
# Create an interaction feature: PowerToEngineRatio
df['PowerToEngineRatio'] = df['HorsePower'] / df['EngineSize']
# View the result
print(df[['EngineSize', 'HorsePower', 'PowerToEngineRatio']])
Exercise 2: Handling Time-Based Features
You are given a dataset containing sales transaction data. The dataset includes a TransactionDate column. Your task is to:
- Convert the TransactionDate column to a datetime format.
- Extract the year, month, and day of the week from the TransactionDate column.
Solution:
# Sample data: Sales transactions
data = {'TransactionID': [101, 102, 103, 104, 105],
'TransactionDate': ['2022-05-15', '2023-03-10', '2023-07-22', '2022-12-01', '2023-01-14']}
df = pd.DataFrame(data)
# Convert the TransactionDate column to datetime format
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'])
# Extract year, month, and day of the week
df['Year'] = df['TransactionDate'].dt.year
df['Month'] = df['TransactionDate'].dt.month
df['DayOfWeek'] = df['TransactionDate'].dt.dayofweek
# View the result
print(df[['TransactionDate', 'Year', 'Month', 'DayOfWeek']])
Exercise 3: Binning Numerical Features
You are working with a dataset of customer purchases, which includes a PurchaseAmount column. Your task is to:
Bin the PurchaseAmount into three categories: Low, Medium, and High. Use the following bin ranges:
- Low: Less than $100
- Medium: $100 - $500
- High: Greater than $500
Solution:
# Sample data: Customer purchases
data = {'CustomerID': [1, 2, 3, 4, 5],
'PurchaseAmount': [50, 150, 700, 300, 600]}
df = pd.DataFrame(data)
# Define the bins and labels
bins = [0, 100, 500, float('inf')]
labels = ['Low', 'Medium', 'High']
# Bin the PurchaseAmount into categories
df['PurchaseCategory'] = pd.cut(df['PurchaseAmount'], bins=bins, labels=labels)
# View the result
print(df[['PurchaseAmount', 'PurchaseCategory']])
Exercise 4: Target Encoding for Categorical Variables
You are working with a dataset of house prices. The dataset includes a Neighborhood column and a SalePrice column. Your task is to:
Perform target encoding on the Neighborhood column by replacing each neighborhood with the average SalePrice for that neighborhood.
Solution:
# Sample data: House prices
data = {'HouseID': [1, 2, 3, 4, 5],
'Neighborhood': ['A', 'B', 'A', 'C', 'B'],
'SalePrice': [300000, 450000, 350000, 500000, 470000]}
df = pd.DataFrame(data)
# Calculate the average SalePrice for each neighborhood
neighborhood_avg_price = df.groupby('Neighborhood')['SalePrice'].mean()
# Perform target encoding by mapping the average prices back to the Neighborhood column
df['NeighborhoodEncoded'] = df['Neighborhood'].map(neighborhood_avg_price)
# View the result
print(df[['Neighborhood', 'SalePrice', 'NeighborhoodEncoded']])
Exercise 5: Calculating Time Differences
You are given a dataset of property listings, with columns ListingDate and SaleDate. Your task is to:
Calculate the number of days a property has been on the market (i.e., the difference between the SaleDate and ListingDate).
Solution:
# Sample data: Property listings
data = {'PropertyID': [1, 2, 3, 4, 5],
'ListingDate': ['2023-01-01', '2023-02-15', '2023-03-01', '2023-04-01', '2023-05-01'],
'SaleDate': ['2023-03-15', '2023-04-01', '2023-03-20', '2023-05-15', '2023-06-01']}
df = pd.DataFrame(data)
# Convert ListingDate and SaleDate to datetime format
df['ListingDate'] = pd.to_datetime(df['ListingDate'])
df['SaleDate'] = pd.to_datetime(df['SaleDate'])
# Calculate the number of days on market
df['DaysOnMarket'] = (df['SaleDate'] - df['ListingDate']).dt.days
# View the result
print(df[['ListingDate', 'SaleDate', 'DaysOnMarket']])
These practical exercises help reinforce the feature engineering techniques covered in Chapter 3. By creating interaction features, handling time-based features, binning numerical variables, and applying target encoding, you’ve gained hands-on experience in transforming raw data into meaningful features that improve model performance. Keep practicing these techniques, and you'll continue to improve your machine learning workflows!
3.3 Practical Exercises for Chapter 3
Now that you’ve completed Chapter 3, it’s time to practice the feature engineering techniques you've learned. The following exercises are designed to help you apply these techniques, with solutions provided for each. These exercises cover key concepts like creating interaction features, handling time-based data, binning numerical features, and target encoding.
Exercise 1: Creating an Interaction Feature
You are working with a dataset of car sales. The dataset contains columns for EngineSize (in liters) and HorsePower. Your task is to:
Create a new feature called PowerToEngineRatio that represents the ratio of horsepower to engine size.
Solution:
import pandas as pd
# Sample data: Car sales
data = {'CarID': [1, 2, 3, 4, 5],
'EngineSize': [2.0, 3.0, 4.0, 2.5, 3.5],
'HorsePower': [150, 200, 250, 180, 220]}
df = pd.DataFrame(data)
# Create an interaction feature: PowerToEngineRatio
df['PowerToEngineRatio'] = df['HorsePower'] / df['EngineSize']
# View the result
print(df[['EngineSize', 'HorsePower', 'PowerToEngineRatio']])
Exercise 2: Handling Time-Based Features
You are given a dataset containing sales transaction data. The dataset includes a TransactionDate column. Your task is to:
- Convert the TransactionDate column to a datetime format.
- Extract the year, month, and day of the week from the TransactionDate column.
Solution:
# Sample data: Sales transactions
data = {'TransactionID': [101, 102, 103, 104, 105],
'TransactionDate': ['2022-05-15', '2023-03-10', '2023-07-22', '2022-12-01', '2023-01-14']}
df = pd.DataFrame(data)
# Convert the TransactionDate column to datetime format
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'])
# Extract year, month, and day of the week
df['Year'] = df['TransactionDate'].dt.year
df['Month'] = df['TransactionDate'].dt.month
df['DayOfWeek'] = df['TransactionDate'].dt.dayofweek
# View the result
print(df[['TransactionDate', 'Year', 'Month', 'DayOfWeek']])
Exercise 3: Binning Numerical Features
You are working with a dataset of customer purchases, which includes a PurchaseAmount column. Your task is to:
Bin the PurchaseAmount into three categories: Low, Medium, and High. Use the following bin ranges:
- Low: Less than $100
- Medium: $100 - $500
- High: Greater than $500
Solution:
# Sample data: Customer purchases
data = {'CustomerID': [1, 2, 3, 4, 5],
'PurchaseAmount': [50, 150, 700, 300, 600]}
df = pd.DataFrame(data)
# Define the bins and labels
bins = [0, 100, 500, float('inf')]
labels = ['Low', 'Medium', 'High']
# Bin the PurchaseAmount into categories
df['PurchaseCategory'] = pd.cut(df['PurchaseAmount'], bins=bins, labels=labels)
# View the result
print(df[['PurchaseAmount', 'PurchaseCategory']])
Exercise 4: Target Encoding for Categorical Variables
You are working with a dataset of house prices. The dataset includes a Neighborhood column and a SalePrice column. Your task is to:
Perform target encoding on the Neighborhood column by replacing each neighborhood with the average SalePrice for that neighborhood.
Solution:
# Sample data: House prices
data = {'HouseID': [1, 2, 3, 4, 5],
'Neighborhood': ['A', 'B', 'A', 'C', 'B'],
'SalePrice': [300000, 450000, 350000, 500000, 470000]}
df = pd.DataFrame(data)
# Calculate the average SalePrice for each neighborhood
neighborhood_avg_price = df.groupby('Neighborhood')['SalePrice'].mean()
# Perform target encoding by mapping the average prices back to the Neighborhood column
df['NeighborhoodEncoded'] = df['Neighborhood'].map(neighborhood_avg_price)
# View the result
print(df[['Neighborhood', 'SalePrice', 'NeighborhoodEncoded']])
Exercise 5: Calculating Time Differences
You are given a dataset of property listings, with columns ListingDate and SaleDate. Your task is to:
Calculate the number of days a property has been on the market (i.e., the difference between the SaleDate and ListingDate).
Solution:
# Sample data: Property listings
data = {'PropertyID': [1, 2, 3, 4, 5],
'ListingDate': ['2023-01-01', '2023-02-15', '2023-03-01', '2023-04-01', '2023-05-01'],
'SaleDate': ['2023-03-15', '2023-04-01', '2023-03-20', '2023-05-15', '2023-06-01']}
df = pd.DataFrame(data)
# Convert ListingDate and SaleDate to datetime format
df['ListingDate'] = pd.to_datetime(df['ListingDate'])
df['SaleDate'] = pd.to_datetime(df['SaleDate'])
# Calculate the number of days on market
df['DaysOnMarket'] = (df['SaleDate'] - df['ListingDate']).dt.days
# View the result
print(df[['ListingDate', 'SaleDate', 'DaysOnMarket']])
These practical exercises help reinforce the feature engineering techniques covered in Chapter 3. By creating interaction features, handling time-based features, binning numerical variables, and applying target encoding, you’ve gained hands-on experience in transforming raw data into meaningful features that improve model performance. Keep practicing these techniques, and you'll continue to improve your machine learning workflows!