Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconData Engineering Foundations
Data Engineering Foundations

Chapter 2: Optimizing Data Workflows

2.4 Practical Exercises for Chapter 2: Optimizing Data Workflows

Now that you've completed Chapter 2, it's time to practice what you’ve learned with these exercises. The following exercises are designed to help you apply advanced data manipulation techniques using Pandas, enhance performance with NumPy, and combine tools for efficient analysis. Each exercise includes a solution block of code to help you check your work.

Exercise 1: Advanced Data Manipulation with Pandas

You are given a dataset of online orders from an e-commerce store. Your task is to:

  1. Filter the dataset to include only orders where the order amount is greater than $200.
  2. Group the dataset by both Category and CustomerID to calculate the total and average order amounts for each group.
  3. Pivot the dataset so that each Category is a column, and the rows represent each CustomerID.
import pandas as pd

# Sample data: Online orders
data = {'OrderID': [1, 2, 3, 4, 5],
        'CustomerID': [101, 102, 103, 101, 104],
        'Category': ['Electronics', 'Clothing', 'Electronics', 'Furniture', 'Furniture'],
        'OrderAmount': [250, 120, 300, 400, 500]}

df = pd.DataFrame(data)

# Solution
# Step 1: Filter orders where OrderAmount > 200
filtered_df = df[df['OrderAmount'] > 200]

# Step 2: Group by Category and CustomerID, and calculate total and average order amounts
grouped_df = filtered_df.groupby(['Category', 'CustomerID']).agg(
    TotalAmount=('OrderAmount', 'sum'),
    AvgAmount=('OrderAmount', 'mean')
).reset_index()

# Step 3: Pivot the dataset so that Category is a column
pivot_df = grouped_df.pivot(index='CustomerID', columns='Category', values='TotalAmount').fillna(0)

print(pivot_df)

Exercise 2: Enhancing Performance with NumPy

Given an array of product prices, your task is to:

  1. Apply a logarithmic transformation to normalize the prices.
  2. Use broadcasting to apply a 20% discount to each price.
  3. Calculate the average discounted price using NumPy's vectorized functions.
import numpy as np

# Sample data: Product prices
prices = np.array([100, 150, 200, 250, 300])

# Solution
# Step 1: Apply a logarithmic transformation
log_prices = np.log(prices)

# Step 2: Apply a 20% discount using broadcasting
discounted_prices = prices * 0.80

# Step 3: Calculate the average discounted price
average_discounted_price = np.mean(discounted_prices)

print("Logarithmic Prices:", log_prices)
print("Discounted Prices:", discounted_prices)
print("Average Discounted Price:", average_discounted_price)

Exercise 3: Combining Pandas and NumPy for Feature Engineering

You have a dataset of customer transactions, including the purchase amount and discount received. Your task is to:

  1. Fill missing values in the PurchaseAmount and Discount columns with the mean of each column.
  2. Create a new feature, NetPurchase, which is the purchase amount after applying the discount.
  3. Use NumPy to create an interaction feature by multiplying the PurchaseAmount and Discount columns.
import pandas as pd
import numpy as np

# Sample data: Customer transactions
data = {'CustomerID': [1, 2, 3, 4, 5],
        'PurchaseAmount': [250, np.nan, 300, 400, np.nan],
        'Discount': [10, 15, 20, np.nan, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Fill missing values
df['PurchaseAmount'].fillna(df['PurchaseAmount'].mean(), inplace=True)
df['Discount'].fillna(df['Discount'].mean(), inplace=True)

# Step 2: Create NetPurchase feature
df['NetPurchase'] = df['PurchaseAmount'] - df['Discount']

# Step 3: Create an interaction feature using NumPy
df['Interaction_Purchase_Discount'] = df['PurchaseAmount'] * df['Discount']

print(df)

Exercise 4: Building a Classification Model with Scikit-learn

You have a dataset of customer transactions. Your task is to:

  1. Create a target variable that flags purchases greater than $300 as high value.
  2. Use Scikit-learn to split the data into training and testing sets.
  3. Build a Random Forest classification model to predict high-value purchases.
  4. Evaluate the model by calculating the accuracy on the test set.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Sample data: Customer transactions
data = {'CustomerID': [1, 2, 3, 4, 5],
        'PurchaseAmount': [250, 350, 300, 400, 150],
        'Discount': [10, 15, 20, 5, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Create target variable (high value if PurchaseAmount > 300)
df['HighValue'] = (df['PurchaseAmount'] > 300).astype(int)

# Step 2: Define features and target
X = df[['PurchaseAmount', 'Discount']]
y = df['HighValue']

# Step 3: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Build and train a Random Forest model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 5: Predict and evaluate accuracy on the test set
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.2f}")

Exercise 5: Using Scikit-learn Pipelines for Streamlined Workflows

You are tasked with creating a streamlined workflow for customer transaction data. Your task is to:

  1. Create a Scikit-learn pipeline that imputes missing values, scales the features, and trains a Random Forest model.
  2. Train the pipeline on the dataset and evaluate the model's performance.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Sample data: Customer transactions
data = {'CustomerID': [1, 2, 3, 4, 5],
        'PurchaseAmount': [250, np.nan, 300, 400, np.nan],
        'Discount': [10, 15, 20, np.nan, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Define features and target
df['HighValue'] = (df['PurchaseAmount'] > 300).astype(int)
X = df[['PurchaseAmount', 'Discount']]
y = df['HighValue']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 2: Create the pipeline
pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values
    ('scaler', StandardScaler()),  # Scale features
    ('classifier', RandomForestClassifier(random_state=42))  # Train Random Forest model
])

# Step 3: Train the pipeline
pipeline.fit(X_train, y_train)

# Step 4: Make predictions and evaluate the model
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Pipeline Model Accuracy: {accuracy:.2f}")

These practical exercises allow you to apply the concepts covered in Chapter 2, giving you hands-on experience with advanced data manipulation, performance enhancement using NumPy, and creating efficient workflows using Scikit-learn. Keep practicing to deepen your understanding!

2.4 Practical Exercises for Chapter 2: Optimizing Data Workflows

Now that you've completed Chapter 2, it's time to practice what you’ve learned with these exercises. The following exercises are designed to help you apply advanced data manipulation techniques using Pandas, enhance performance with NumPy, and combine tools for efficient analysis. Each exercise includes a solution block of code to help you check your work.

Exercise 1: Advanced Data Manipulation with Pandas

You are given a dataset of online orders from an e-commerce store. Your task is to:

  1. Filter the dataset to include only orders where the order amount is greater than $200.
  2. Group the dataset by both Category and CustomerID to calculate the total and average order amounts for each group.
  3. Pivot the dataset so that each Category is a column, and the rows represent each CustomerID.
import pandas as pd

# Sample data: Online orders
data = {'OrderID': [1, 2, 3, 4, 5],
        'CustomerID': [101, 102, 103, 101, 104],
        'Category': ['Electronics', 'Clothing', 'Electronics', 'Furniture', 'Furniture'],
        'OrderAmount': [250, 120, 300, 400, 500]}

df = pd.DataFrame(data)

# Solution
# Step 1: Filter orders where OrderAmount > 200
filtered_df = df[df['OrderAmount'] > 200]

# Step 2: Group by Category and CustomerID, and calculate total and average order amounts
grouped_df = filtered_df.groupby(['Category', 'CustomerID']).agg(
    TotalAmount=('OrderAmount', 'sum'),
    AvgAmount=('OrderAmount', 'mean')
).reset_index()

# Step 3: Pivot the dataset so that Category is a column
pivot_df = grouped_df.pivot(index='CustomerID', columns='Category', values='TotalAmount').fillna(0)

print(pivot_df)

Exercise 2: Enhancing Performance with NumPy

Given an array of product prices, your task is to:

  1. Apply a logarithmic transformation to normalize the prices.
  2. Use broadcasting to apply a 20% discount to each price.
  3. Calculate the average discounted price using NumPy's vectorized functions.
import numpy as np

# Sample data: Product prices
prices = np.array([100, 150, 200, 250, 300])

# Solution
# Step 1: Apply a logarithmic transformation
log_prices = np.log(prices)

# Step 2: Apply a 20% discount using broadcasting
discounted_prices = prices * 0.80

# Step 3: Calculate the average discounted price
average_discounted_price = np.mean(discounted_prices)

print("Logarithmic Prices:", log_prices)
print("Discounted Prices:", discounted_prices)
print("Average Discounted Price:", average_discounted_price)

Exercise 3: Combining Pandas and NumPy for Feature Engineering

You have a dataset of customer transactions, including the purchase amount and discount received. Your task is to:

  1. Fill missing values in the PurchaseAmount and Discount columns with the mean of each column.
  2. Create a new feature, NetPurchase, which is the purchase amount after applying the discount.
  3. Use NumPy to create an interaction feature by multiplying the PurchaseAmount and Discount columns.
import pandas as pd
import numpy as np

# Sample data: Customer transactions
data = {'CustomerID': [1, 2, 3, 4, 5],
        'PurchaseAmount': [250, np.nan, 300, 400, np.nan],
        'Discount': [10, 15, 20, np.nan, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Fill missing values
df['PurchaseAmount'].fillna(df['PurchaseAmount'].mean(), inplace=True)
df['Discount'].fillna(df['Discount'].mean(), inplace=True)

# Step 2: Create NetPurchase feature
df['NetPurchase'] = df['PurchaseAmount'] - df['Discount']

# Step 3: Create an interaction feature using NumPy
df['Interaction_Purchase_Discount'] = df['PurchaseAmount'] * df['Discount']

print(df)

Exercise 4: Building a Classification Model with Scikit-learn

You have a dataset of customer transactions. Your task is to:

  1. Create a target variable that flags purchases greater than $300 as high value.
  2. Use Scikit-learn to split the data into training and testing sets.
  3. Build a Random Forest classification model to predict high-value purchases.
  4. Evaluate the model by calculating the accuracy on the test set.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Sample data: Customer transactions
data = {'CustomerID': [1, 2, 3, 4, 5],
        'PurchaseAmount': [250, 350, 300, 400, 150],
        'Discount': [10, 15, 20, 5, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Create target variable (high value if PurchaseAmount > 300)
df['HighValue'] = (df['PurchaseAmount'] > 300).astype(int)

# Step 2: Define features and target
X = df[['PurchaseAmount', 'Discount']]
y = df['HighValue']

# Step 3: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Build and train a Random Forest model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 5: Predict and evaluate accuracy on the test set
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.2f}")

Exercise 5: Using Scikit-learn Pipelines for Streamlined Workflows

You are tasked with creating a streamlined workflow for customer transaction data. Your task is to:

  1. Create a Scikit-learn pipeline that imputes missing values, scales the features, and trains a Random Forest model.
  2. Train the pipeline on the dataset and evaluate the model's performance.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Sample data: Customer transactions
data = {'CustomerID': [1, 2, 3, 4, 5],
        'PurchaseAmount': [250, np.nan, 300, 400, np.nan],
        'Discount': [10, 15, 20, np.nan, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Define features and target
df['HighValue'] = (df['PurchaseAmount'] > 300).astype(int)
X = df[['PurchaseAmount', 'Discount']]
y = df['HighValue']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 2: Create the pipeline
pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values
    ('scaler', StandardScaler()),  # Scale features
    ('classifier', RandomForestClassifier(random_state=42))  # Train Random Forest model
])

# Step 3: Train the pipeline
pipeline.fit(X_train, y_train)

# Step 4: Make predictions and evaluate the model
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Pipeline Model Accuracy: {accuracy:.2f}")

These practical exercises allow you to apply the concepts covered in Chapter 2, giving you hands-on experience with advanced data manipulation, performance enhancement using NumPy, and creating efficient workflows using Scikit-learn. Keep practicing to deepen your understanding!

2.4 Practical Exercises for Chapter 2: Optimizing Data Workflows

Now that you've completed Chapter 2, it's time to practice what you’ve learned with these exercises. The following exercises are designed to help you apply advanced data manipulation techniques using Pandas, enhance performance with NumPy, and combine tools for efficient analysis. Each exercise includes a solution block of code to help you check your work.

Exercise 1: Advanced Data Manipulation with Pandas

You are given a dataset of online orders from an e-commerce store. Your task is to:

  1. Filter the dataset to include only orders where the order amount is greater than $200.
  2. Group the dataset by both Category and CustomerID to calculate the total and average order amounts for each group.
  3. Pivot the dataset so that each Category is a column, and the rows represent each CustomerID.
import pandas as pd

# Sample data: Online orders
data = {'OrderID': [1, 2, 3, 4, 5],
        'CustomerID': [101, 102, 103, 101, 104],
        'Category': ['Electronics', 'Clothing', 'Electronics', 'Furniture', 'Furniture'],
        'OrderAmount': [250, 120, 300, 400, 500]}

df = pd.DataFrame(data)

# Solution
# Step 1: Filter orders where OrderAmount > 200
filtered_df = df[df['OrderAmount'] > 200]

# Step 2: Group by Category and CustomerID, and calculate total and average order amounts
grouped_df = filtered_df.groupby(['Category', 'CustomerID']).agg(
    TotalAmount=('OrderAmount', 'sum'),
    AvgAmount=('OrderAmount', 'mean')
).reset_index()

# Step 3: Pivot the dataset so that Category is a column
pivot_df = grouped_df.pivot(index='CustomerID', columns='Category', values='TotalAmount').fillna(0)

print(pivot_df)

Exercise 2: Enhancing Performance with NumPy

Given an array of product prices, your task is to:

  1. Apply a logarithmic transformation to normalize the prices.
  2. Use broadcasting to apply a 20% discount to each price.
  3. Calculate the average discounted price using NumPy's vectorized functions.
import numpy as np

# Sample data: Product prices
prices = np.array([100, 150, 200, 250, 300])

# Solution
# Step 1: Apply a logarithmic transformation
log_prices = np.log(prices)

# Step 2: Apply a 20% discount using broadcasting
discounted_prices = prices * 0.80

# Step 3: Calculate the average discounted price
average_discounted_price = np.mean(discounted_prices)

print("Logarithmic Prices:", log_prices)
print("Discounted Prices:", discounted_prices)
print("Average Discounted Price:", average_discounted_price)

Exercise 3: Combining Pandas and NumPy for Feature Engineering

You have a dataset of customer transactions, including the purchase amount and discount received. Your task is to:

  1. Fill missing values in the PurchaseAmount and Discount columns with the mean of each column.
  2. Create a new feature, NetPurchase, which is the purchase amount after applying the discount.
  3. Use NumPy to create an interaction feature by multiplying the PurchaseAmount and Discount columns.
import pandas as pd
import numpy as np

# Sample data: Customer transactions
data = {'CustomerID': [1, 2, 3, 4, 5],
        'PurchaseAmount': [250, np.nan, 300, 400, np.nan],
        'Discount': [10, 15, 20, np.nan, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Fill missing values
df['PurchaseAmount'].fillna(df['PurchaseAmount'].mean(), inplace=True)
df['Discount'].fillna(df['Discount'].mean(), inplace=True)

# Step 2: Create NetPurchase feature
df['NetPurchase'] = df['PurchaseAmount'] - df['Discount']

# Step 3: Create an interaction feature using NumPy
df['Interaction_Purchase_Discount'] = df['PurchaseAmount'] * df['Discount']

print(df)

Exercise 4: Building a Classification Model with Scikit-learn

You have a dataset of customer transactions. Your task is to:

  1. Create a target variable that flags purchases greater than $300 as high value.
  2. Use Scikit-learn to split the data into training and testing sets.
  3. Build a Random Forest classification model to predict high-value purchases.
  4. Evaluate the model by calculating the accuracy on the test set.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Sample data: Customer transactions
data = {'CustomerID': [1, 2, 3, 4, 5],
        'PurchaseAmount': [250, 350, 300, 400, 150],
        'Discount': [10, 15, 20, 5, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Create target variable (high value if PurchaseAmount > 300)
df['HighValue'] = (df['PurchaseAmount'] > 300).astype(int)

# Step 2: Define features and target
X = df[['PurchaseAmount', 'Discount']]
y = df['HighValue']

# Step 3: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Build and train a Random Forest model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 5: Predict and evaluate accuracy on the test set
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.2f}")

Exercise 5: Using Scikit-learn Pipelines for Streamlined Workflows

You are tasked with creating a streamlined workflow for customer transaction data. Your task is to:

  1. Create a Scikit-learn pipeline that imputes missing values, scales the features, and trains a Random Forest model.
  2. Train the pipeline on the dataset and evaluate the model's performance.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Sample data: Customer transactions
data = {'CustomerID': [1, 2, 3, 4, 5],
        'PurchaseAmount': [250, np.nan, 300, 400, np.nan],
        'Discount': [10, 15, 20, np.nan, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Define features and target
df['HighValue'] = (df['PurchaseAmount'] > 300).astype(int)
X = df[['PurchaseAmount', 'Discount']]
y = df['HighValue']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 2: Create the pipeline
pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values
    ('scaler', StandardScaler()),  # Scale features
    ('classifier', RandomForestClassifier(random_state=42))  # Train Random Forest model
])

# Step 3: Train the pipeline
pipeline.fit(X_train, y_train)

# Step 4: Make predictions and evaluate the model
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Pipeline Model Accuracy: {accuracy:.2f}")

These practical exercises allow you to apply the concepts covered in Chapter 2, giving you hands-on experience with advanced data manipulation, performance enhancement using NumPy, and creating efficient workflows using Scikit-learn. Keep practicing to deepen your understanding!

2.4 Practical Exercises for Chapter 2: Optimizing Data Workflows

Now that you've completed Chapter 2, it's time to practice what you’ve learned with these exercises. The following exercises are designed to help you apply advanced data manipulation techniques using Pandas, enhance performance with NumPy, and combine tools for efficient analysis. Each exercise includes a solution block of code to help you check your work.

Exercise 1: Advanced Data Manipulation with Pandas

You are given a dataset of online orders from an e-commerce store. Your task is to:

  1. Filter the dataset to include only orders where the order amount is greater than $200.
  2. Group the dataset by both Category and CustomerID to calculate the total and average order amounts for each group.
  3. Pivot the dataset so that each Category is a column, and the rows represent each CustomerID.
import pandas as pd

# Sample data: Online orders
data = {'OrderID': [1, 2, 3, 4, 5],
        'CustomerID': [101, 102, 103, 101, 104],
        'Category': ['Electronics', 'Clothing', 'Electronics', 'Furniture', 'Furniture'],
        'OrderAmount': [250, 120, 300, 400, 500]}

df = pd.DataFrame(data)

# Solution
# Step 1: Filter orders where OrderAmount > 200
filtered_df = df[df['OrderAmount'] > 200]

# Step 2: Group by Category and CustomerID, and calculate total and average order amounts
grouped_df = filtered_df.groupby(['Category', 'CustomerID']).agg(
    TotalAmount=('OrderAmount', 'sum'),
    AvgAmount=('OrderAmount', 'mean')
).reset_index()

# Step 3: Pivot the dataset so that Category is a column
pivot_df = grouped_df.pivot(index='CustomerID', columns='Category', values='TotalAmount').fillna(0)

print(pivot_df)

Exercise 2: Enhancing Performance with NumPy

Given an array of product prices, your task is to:

  1. Apply a logarithmic transformation to normalize the prices.
  2. Use broadcasting to apply a 20% discount to each price.
  3. Calculate the average discounted price using NumPy's vectorized functions.
import numpy as np

# Sample data: Product prices
prices = np.array([100, 150, 200, 250, 300])

# Solution
# Step 1: Apply a logarithmic transformation
log_prices = np.log(prices)

# Step 2: Apply a 20% discount using broadcasting
discounted_prices = prices * 0.80

# Step 3: Calculate the average discounted price
average_discounted_price = np.mean(discounted_prices)

print("Logarithmic Prices:", log_prices)
print("Discounted Prices:", discounted_prices)
print("Average Discounted Price:", average_discounted_price)

Exercise 3: Combining Pandas and NumPy for Feature Engineering

You have a dataset of customer transactions, including the purchase amount and discount received. Your task is to:

  1. Fill missing values in the PurchaseAmount and Discount columns with the mean of each column.
  2. Create a new feature, NetPurchase, which is the purchase amount after applying the discount.
  3. Use NumPy to create an interaction feature by multiplying the PurchaseAmount and Discount columns.
import pandas as pd
import numpy as np

# Sample data: Customer transactions
data = {'CustomerID': [1, 2, 3, 4, 5],
        'PurchaseAmount': [250, np.nan, 300, 400, np.nan],
        'Discount': [10, 15, 20, np.nan, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Fill missing values
df['PurchaseAmount'].fillna(df['PurchaseAmount'].mean(), inplace=True)
df['Discount'].fillna(df['Discount'].mean(), inplace=True)

# Step 2: Create NetPurchase feature
df['NetPurchase'] = df['PurchaseAmount'] - df['Discount']

# Step 3: Create an interaction feature using NumPy
df['Interaction_Purchase_Discount'] = df['PurchaseAmount'] * df['Discount']

print(df)

Exercise 4: Building a Classification Model with Scikit-learn

You have a dataset of customer transactions. Your task is to:

  1. Create a target variable that flags purchases greater than $300 as high value.
  2. Use Scikit-learn to split the data into training and testing sets.
  3. Build a Random Forest classification model to predict high-value purchases.
  4. Evaluate the model by calculating the accuracy on the test set.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Sample data: Customer transactions
data = {'CustomerID': [1, 2, 3, 4, 5],
        'PurchaseAmount': [250, 350, 300, 400, 150],
        'Discount': [10, 15, 20, 5, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Create target variable (high value if PurchaseAmount > 300)
df['HighValue'] = (df['PurchaseAmount'] > 300).astype(int)

# Step 2: Define features and target
X = df[['PurchaseAmount', 'Discount']]
y = df['HighValue']

# Step 3: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Build and train a Random Forest model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 5: Predict and evaluate accuracy on the test set
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.2f}")

Exercise 5: Using Scikit-learn Pipelines for Streamlined Workflows

You are tasked with creating a streamlined workflow for customer transaction data. Your task is to:

  1. Create a Scikit-learn pipeline that imputes missing values, scales the features, and trains a Random Forest model.
  2. Train the pipeline on the dataset and evaluate the model's performance.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Sample data: Customer transactions
data = {'CustomerID': [1, 2, 3, 4, 5],
        'PurchaseAmount': [250, np.nan, 300, 400, np.nan],
        'Discount': [10, 15, 20, np.nan, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Define features and target
df['HighValue'] = (df['PurchaseAmount'] > 300).astype(int)
X = df[['PurchaseAmount', 'Discount']]
y = df['HighValue']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 2: Create the pipeline
pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values
    ('scaler', StandardScaler()),  # Scale features
    ('classifier', RandomForestClassifier(random_state=42))  # Train Random Forest model
])

# Step 3: Train the pipeline
pipeline.fit(X_train, y_train)

# Step 4: Make predictions and evaluate the model
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Pipeline Model Accuracy: {accuracy:.2f}")

These practical exercises allow you to apply the concepts covered in Chapter 2, giving you hands-on experience with advanced data manipulation, performance enhancement using NumPy, and creating efficient workflows using Scikit-learn. Keep practicing to deepen your understanding!