Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconData Engineering Foundations
Data Engineering Foundations

Chapter 1: Introduction: Moving Beyond the Basics

1.4 Practical Exercises for Chapter 1: Introduction: Moving Beyond the Basics

Now that you've completed Chapter 1, it’s time to apply what you've learned. These exercises are designed to help you reinforce the concepts discussed and put them into practice. Each exercise includes a problem, and a solution block of code is provided where necessary. Try to work through the exercises on your own first before checking the solutions.

Exercise 1: Data Filtering and Aggregation with Pandas

You are given a dataset of customer purchases at different stores. Your task is to:

  1. Filter the dataset to show only transactions where the purchase amount exceeds $200.
  2. Group the transactions by store and calculate the total and average purchase amount per store.
# Sample data
data = {'TransactionID': [101, 102, 103, 104, 105],
        'Store': ['A', 'B', 'A', 'C', 'B'],
        'PurchaseAmount': [250, 120, 340, 400, 200],
        'Discount': [10, 15, 20, 25, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Filter transactions where PurchaseAmount > 200
filtered_df = df[df['PurchaseAmount'] > 200]

# Step 2: Group by Store and calculate total and average purchase amounts
df['NetPurchase'] = df['PurchaseAmount'] - df['Discount']
agg_purchases = df.groupby('Store').agg(
    TotalPurchase=('NetPurchase', 'sum'),
    AvgPurchase=('NetPurchase', 'mean')
)

print(filtered_df)
print(agg_purchases)

Exercise 2: Applying a Logarithmic Transformation with NumPy

You have a dataset of product sales with the following values: [100, 200, 50, 400, 300].

  1. Use NumPy to calculate the logarithmic transformation of the sales values.
  2. Print the transformed values.
import numpy as np

# Sales data
sales = [100, 200, 50, 400, 300]

# Solution
# Step 1: Apply logarithmic transformation
log_sales = np.log(sales)

print(log_sales)

Exercise 3: Standardizing Sales Data with NumPy

Given the same sales data from Exercise 2, standardize the values by calculating the Z-score for each sales amount.

  1. Calculate the mean and standard deviation of the sales data.
  2. Use NumPy to compute the Z-score for each sale.
# Sales data
sales = [100, 200, 50, 400, 300]

# Solution
# Step 1: Calculate mean and standard deviation
mean_sales = np.mean(sales)
std_sales = np.std(sales)

# Step 2: Calculate Z-scores
z_scores = (sales - mean_sales) / std_sales

print(z_scores)

Exercise 4: Building a Classification Model with Scikit-learn

You have a dataset of transactions where each transaction has a sales amount and a discount. Your goal is to build a simple classification model to predict whether a transaction has a high sales value (above $250) or not.

  1. Create a target variable (HighSales) where a sale is classified as 1 if the sales amount is above 250, otherwise 0.
  2. Use Scikit-learn to build a Random Forest model that predicts HighSales based on SalesAmount and Discount.
  3. Split the dataset into training and testing sets.
  4. Train the model and display the predictions for the test set.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Sample data
data = {'TransactionID': [101, 102, 103, 104, 105],
        'SalesAmount': [250, np.nan, 340, 400, 200],
        'Discount': [10, 15, 20, np.nan, 5],
        'Store': ['A', 'B', 'A', 'C', 'B']}

df = pd.DataFrame(data)

# Solution
# Step 1: Handle missing values
df['SalesAmount'].fillna(df['SalesAmount'].mean(), inplace=True)
df['Discount'].fillna(df['Discount'].mean(), inplace=True)

# Step 2: Create target variable 'HighSales'
df['HighSales'] = (df['SalesAmount'] > 250).astype(int)

# Step 3: Define features and target
X = df[['SalesAmount', 'Discount']]
y = df['HighSales']

# Step 4: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 5: Train a Random Forest model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 6: Predict on test set
y_pred = clf.predict(X_test)

print("Predictions:", y_pred)

Exercise 5: Combining Pandas, NumPy, and Scikit-learn in a Workflow

You are working with a dataset of customer transactions. Your task is to:

  1. Handle missing values in the SalesAmount and Discount columns.
  2. Apply a logarithmic transformation to the SalesAmount using NumPy.
  3. Build a classification model with Scikit-learn to predict if a transaction is high value (HighSales).
  4. Split the dataset into training and testing sets.
  5. Train the model and make predictions on the test set.
# Sample data
data = {'TransactionID': [101, 102, 103, 104, 105],
        'SalesAmount': [250, np.nan, 340, 400, 200],
        'Discount': [10, 15, 20, np.nan, 5],
        'Store': ['A', 'B', 'A', 'C', 'B']}

df = pd.DataFrame(data)

# Solution
# Step 1: Handle missing values using Pandas
df['SalesAmount'].fillna(df['SalesAmount'].mean(), inplace=True)
df['Discount'].fillna(df['Discount'].mean(), inplace=True)

# Step 2: Apply logarithmic transformation to SalesAmount
df['LogSales'] = np.log(df['SalesAmount'])

# Step 3: Create target variable 'HighSales'
df['HighSales'] = (df['SalesAmount'] > 250).astype(int)

# Step 4: Define features and target
X = df[['SalesAmount', 'Discount', 'LogSales']]
y = df['HighSales']

# Step 5: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 6: Train a Random Forest model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 7: Predict on test set
y_pred = clf.predict(X_test)

print("Predictions:", y_pred)

These practical exercises cover the essential concepts discussed in Chapter 1, giving you the opportunity to practice filtering data, transforming features, and building machine learning models. The provided solutions help reinforce your understanding and ensure that you’re on the right track. Keep practicing, and feel free to explore different datasets and variations of these tasks!

1.4 Practical Exercises for Chapter 1: Introduction: Moving Beyond the Basics

Now that you've completed Chapter 1, it’s time to apply what you've learned. These exercises are designed to help you reinforce the concepts discussed and put them into practice. Each exercise includes a problem, and a solution block of code is provided where necessary. Try to work through the exercises on your own first before checking the solutions.

Exercise 1: Data Filtering and Aggregation with Pandas

You are given a dataset of customer purchases at different stores. Your task is to:

  1. Filter the dataset to show only transactions where the purchase amount exceeds $200.
  2. Group the transactions by store and calculate the total and average purchase amount per store.
# Sample data
data = {'TransactionID': [101, 102, 103, 104, 105],
        'Store': ['A', 'B', 'A', 'C', 'B'],
        'PurchaseAmount': [250, 120, 340, 400, 200],
        'Discount': [10, 15, 20, 25, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Filter transactions where PurchaseAmount > 200
filtered_df = df[df['PurchaseAmount'] > 200]

# Step 2: Group by Store and calculate total and average purchase amounts
df['NetPurchase'] = df['PurchaseAmount'] - df['Discount']
agg_purchases = df.groupby('Store').agg(
    TotalPurchase=('NetPurchase', 'sum'),
    AvgPurchase=('NetPurchase', 'mean')
)

print(filtered_df)
print(agg_purchases)

Exercise 2: Applying a Logarithmic Transformation with NumPy

You have a dataset of product sales with the following values: [100, 200, 50, 400, 300].

  1. Use NumPy to calculate the logarithmic transformation of the sales values.
  2. Print the transformed values.
import numpy as np

# Sales data
sales = [100, 200, 50, 400, 300]

# Solution
# Step 1: Apply logarithmic transformation
log_sales = np.log(sales)

print(log_sales)

Exercise 3: Standardizing Sales Data with NumPy

Given the same sales data from Exercise 2, standardize the values by calculating the Z-score for each sales amount.

  1. Calculate the mean and standard deviation of the sales data.
  2. Use NumPy to compute the Z-score for each sale.
# Sales data
sales = [100, 200, 50, 400, 300]

# Solution
# Step 1: Calculate mean and standard deviation
mean_sales = np.mean(sales)
std_sales = np.std(sales)

# Step 2: Calculate Z-scores
z_scores = (sales - mean_sales) / std_sales

print(z_scores)

Exercise 4: Building a Classification Model with Scikit-learn

You have a dataset of transactions where each transaction has a sales amount and a discount. Your goal is to build a simple classification model to predict whether a transaction has a high sales value (above $250) or not.

  1. Create a target variable (HighSales) where a sale is classified as 1 if the sales amount is above 250, otherwise 0.
  2. Use Scikit-learn to build a Random Forest model that predicts HighSales based on SalesAmount and Discount.
  3. Split the dataset into training and testing sets.
  4. Train the model and display the predictions for the test set.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Sample data
data = {'TransactionID': [101, 102, 103, 104, 105],
        'SalesAmount': [250, np.nan, 340, 400, 200],
        'Discount': [10, 15, 20, np.nan, 5],
        'Store': ['A', 'B', 'A', 'C', 'B']}

df = pd.DataFrame(data)

# Solution
# Step 1: Handle missing values
df['SalesAmount'].fillna(df['SalesAmount'].mean(), inplace=True)
df['Discount'].fillna(df['Discount'].mean(), inplace=True)

# Step 2: Create target variable 'HighSales'
df['HighSales'] = (df['SalesAmount'] > 250).astype(int)

# Step 3: Define features and target
X = df[['SalesAmount', 'Discount']]
y = df['HighSales']

# Step 4: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 5: Train a Random Forest model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 6: Predict on test set
y_pred = clf.predict(X_test)

print("Predictions:", y_pred)

Exercise 5: Combining Pandas, NumPy, and Scikit-learn in a Workflow

You are working with a dataset of customer transactions. Your task is to:

  1. Handle missing values in the SalesAmount and Discount columns.
  2. Apply a logarithmic transformation to the SalesAmount using NumPy.
  3. Build a classification model with Scikit-learn to predict if a transaction is high value (HighSales).
  4. Split the dataset into training and testing sets.
  5. Train the model and make predictions on the test set.
# Sample data
data = {'TransactionID': [101, 102, 103, 104, 105],
        'SalesAmount': [250, np.nan, 340, 400, 200],
        'Discount': [10, 15, 20, np.nan, 5],
        'Store': ['A', 'B', 'A', 'C', 'B']}

df = pd.DataFrame(data)

# Solution
# Step 1: Handle missing values using Pandas
df['SalesAmount'].fillna(df['SalesAmount'].mean(), inplace=True)
df['Discount'].fillna(df['Discount'].mean(), inplace=True)

# Step 2: Apply logarithmic transformation to SalesAmount
df['LogSales'] = np.log(df['SalesAmount'])

# Step 3: Create target variable 'HighSales'
df['HighSales'] = (df['SalesAmount'] > 250).astype(int)

# Step 4: Define features and target
X = df[['SalesAmount', 'Discount', 'LogSales']]
y = df['HighSales']

# Step 5: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 6: Train a Random Forest model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 7: Predict on test set
y_pred = clf.predict(X_test)

print("Predictions:", y_pred)

These practical exercises cover the essential concepts discussed in Chapter 1, giving you the opportunity to practice filtering data, transforming features, and building machine learning models. The provided solutions help reinforce your understanding and ensure that you’re on the right track. Keep practicing, and feel free to explore different datasets and variations of these tasks!

1.4 Practical Exercises for Chapter 1: Introduction: Moving Beyond the Basics

Now that you've completed Chapter 1, it’s time to apply what you've learned. These exercises are designed to help you reinforce the concepts discussed and put them into practice. Each exercise includes a problem, and a solution block of code is provided where necessary. Try to work through the exercises on your own first before checking the solutions.

Exercise 1: Data Filtering and Aggregation with Pandas

You are given a dataset of customer purchases at different stores. Your task is to:

  1. Filter the dataset to show only transactions where the purchase amount exceeds $200.
  2. Group the transactions by store and calculate the total and average purchase amount per store.
# Sample data
data = {'TransactionID': [101, 102, 103, 104, 105],
        'Store': ['A', 'B', 'A', 'C', 'B'],
        'PurchaseAmount': [250, 120, 340, 400, 200],
        'Discount': [10, 15, 20, 25, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Filter transactions where PurchaseAmount > 200
filtered_df = df[df['PurchaseAmount'] > 200]

# Step 2: Group by Store and calculate total and average purchase amounts
df['NetPurchase'] = df['PurchaseAmount'] - df['Discount']
agg_purchases = df.groupby('Store').agg(
    TotalPurchase=('NetPurchase', 'sum'),
    AvgPurchase=('NetPurchase', 'mean')
)

print(filtered_df)
print(agg_purchases)

Exercise 2: Applying a Logarithmic Transformation with NumPy

You have a dataset of product sales with the following values: [100, 200, 50, 400, 300].

  1. Use NumPy to calculate the logarithmic transformation of the sales values.
  2. Print the transformed values.
import numpy as np

# Sales data
sales = [100, 200, 50, 400, 300]

# Solution
# Step 1: Apply logarithmic transformation
log_sales = np.log(sales)

print(log_sales)

Exercise 3: Standardizing Sales Data with NumPy

Given the same sales data from Exercise 2, standardize the values by calculating the Z-score for each sales amount.

  1. Calculate the mean and standard deviation of the sales data.
  2. Use NumPy to compute the Z-score for each sale.
# Sales data
sales = [100, 200, 50, 400, 300]

# Solution
# Step 1: Calculate mean and standard deviation
mean_sales = np.mean(sales)
std_sales = np.std(sales)

# Step 2: Calculate Z-scores
z_scores = (sales - mean_sales) / std_sales

print(z_scores)

Exercise 4: Building a Classification Model with Scikit-learn

You have a dataset of transactions where each transaction has a sales amount and a discount. Your goal is to build a simple classification model to predict whether a transaction has a high sales value (above $250) or not.

  1. Create a target variable (HighSales) where a sale is classified as 1 if the sales amount is above 250, otherwise 0.
  2. Use Scikit-learn to build a Random Forest model that predicts HighSales based on SalesAmount and Discount.
  3. Split the dataset into training and testing sets.
  4. Train the model and display the predictions for the test set.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Sample data
data = {'TransactionID': [101, 102, 103, 104, 105],
        'SalesAmount': [250, np.nan, 340, 400, 200],
        'Discount': [10, 15, 20, np.nan, 5],
        'Store': ['A', 'B', 'A', 'C', 'B']}

df = pd.DataFrame(data)

# Solution
# Step 1: Handle missing values
df['SalesAmount'].fillna(df['SalesAmount'].mean(), inplace=True)
df['Discount'].fillna(df['Discount'].mean(), inplace=True)

# Step 2: Create target variable 'HighSales'
df['HighSales'] = (df['SalesAmount'] > 250).astype(int)

# Step 3: Define features and target
X = df[['SalesAmount', 'Discount']]
y = df['HighSales']

# Step 4: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 5: Train a Random Forest model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 6: Predict on test set
y_pred = clf.predict(X_test)

print("Predictions:", y_pred)

Exercise 5: Combining Pandas, NumPy, and Scikit-learn in a Workflow

You are working with a dataset of customer transactions. Your task is to:

  1. Handle missing values in the SalesAmount and Discount columns.
  2. Apply a logarithmic transformation to the SalesAmount using NumPy.
  3. Build a classification model with Scikit-learn to predict if a transaction is high value (HighSales).
  4. Split the dataset into training and testing sets.
  5. Train the model and make predictions on the test set.
# Sample data
data = {'TransactionID': [101, 102, 103, 104, 105],
        'SalesAmount': [250, np.nan, 340, 400, 200],
        'Discount': [10, 15, 20, np.nan, 5],
        'Store': ['A', 'B', 'A', 'C', 'B']}

df = pd.DataFrame(data)

# Solution
# Step 1: Handle missing values using Pandas
df['SalesAmount'].fillna(df['SalesAmount'].mean(), inplace=True)
df['Discount'].fillna(df['Discount'].mean(), inplace=True)

# Step 2: Apply logarithmic transformation to SalesAmount
df['LogSales'] = np.log(df['SalesAmount'])

# Step 3: Create target variable 'HighSales'
df['HighSales'] = (df['SalesAmount'] > 250).astype(int)

# Step 4: Define features and target
X = df[['SalesAmount', 'Discount', 'LogSales']]
y = df['HighSales']

# Step 5: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 6: Train a Random Forest model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 7: Predict on test set
y_pred = clf.predict(X_test)

print("Predictions:", y_pred)

These practical exercises cover the essential concepts discussed in Chapter 1, giving you the opportunity to practice filtering data, transforming features, and building machine learning models. The provided solutions help reinforce your understanding and ensure that you’re on the right track. Keep practicing, and feel free to explore different datasets and variations of these tasks!

1.4 Practical Exercises for Chapter 1: Introduction: Moving Beyond the Basics

Now that you've completed Chapter 1, it’s time to apply what you've learned. These exercises are designed to help you reinforce the concepts discussed and put them into practice. Each exercise includes a problem, and a solution block of code is provided where necessary. Try to work through the exercises on your own first before checking the solutions.

Exercise 1: Data Filtering and Aggregation with Pandas

You are given a dataset of customer purchases at different stores. Your task is to:

  1. Filter the dataset to show only transactions where the purchase amount exceeds $200.
  2. Group the transactions by store and calculate the total and average purchase amount per store.
# Sample data
data = {'TransactionID': [101, 102, 103, 104, 105],
        'Store': ['A', 'B', 'A', 'C', 'B'],
        'PurchaseAmount': [250, 120, 340, 400, 200],
        'Discount': [10, 15, 20, 25, 5]}

df = pd.DataFrame(data)

# Solution
# Step 1: Filter transactions where PurchaseAmount > 200
filtered_df = df[df['PurchaseAmount'] > 200]

# Step 2: Group by Store and calculate total and average purchase amounts
df['NetPurchase'] = df['PurchaseAmount'] - df['Discount']
agg_purchases = df.groupby('Store').agg(
    TotalPurchase=('NetPurchase', 'sum'),
    AvgPurchase=('NetPurchase', 'mean')
)

print(filtered_df)
print(agg_purchases)

Exercise 2: Applying a Logarithmic Transformation with NumPy

You have a dataset of product sales with the following values: [100, 200, 50, 400, 300].

  1. Use NumPy to calculate the logarithmic transformation of the sales values.
  2. Print the transformed values.
import numpy as np

# Sales data
sales = [100, 200, 50, 400, 300]

# Solution
# Step 1: Apply logarithmic transformation
log_sales = np.log(sales)

print(log_sales)

Exercise 3: Standardizing Sales Data with NumPy

Given the same sales data from Exercise 2, standardize the values by calculating the Z-score for each sales amount.

  1. Calculate the mean and standard deviation of the sales data.
  2. Use NumPy to compute the Z-score for each sale.
# Sales data
sales = [100, 200, 50, 400, 300]

# Solution
# Step 1: Calculate mean and standard deviation
mean_sales = np.mean(sales)
std_sales = np.std(sales)

# Step 2: Calculate Z-scores
z_scores = (sales - mean_sales) / std_sales

print(z_scores)

Exercise 4: Building a Classification Model with Scikit-learn

You have a dataset of transactions where each transaction has a sales amount and a discount. Your goal is to build a simple classification model to predict whether a transaction has a high sales value (above $250) or not.

  1. Create a target variable (HighSales) where a sale is classified as 1 if the sales amount is above 250, otherwise 0.
  2. Use Scikit-learn to build a Random Forest model that predicts HighSales based on SalesAmount and Discount.
  3. Split the dataset into training and testing sets.
  4. Train the model and display the predictions for the test set.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Sample data
data = {'TransactionID': [101, 102, 103, 104, 105],
        'SalesAmount': [250, np.nan, 340, 400, 200],
        'Discount': [10, 15, 20, np.nan, 5],
        'Store': ['A', 'B', 'A', 'C', 'B']}

df = pd.DataFrame(data)

# Solution
# Step 1: Handle missing values
df['SalesAmount'].fillna(df['SalesAmount'].mean(), inplace=True)
df['Discount'].fillna(df['Discount'].mean(), inplace=True)

# Step 2: Create target variable 'HighSales'
df['HighSales'] = (df['SalesAmount'] > 250).astype(int)

# Step 3: Define features and target
X = df[['SalesAmount', 'Discount']]
y = df['HighSales']

# Step 4: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 5: Train a Random Forest model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 6: Predict on test set
y_pred = clf.predict(X_test)

print("Predictions:", y_pred)

Exercise 5: Combining Pandas, NumPy, and Scikit-learn in a Workflow

You are working with a dataset of customer transactions. Your task is to:

  1. Handle missing values in the SalesAmount and Discount columns.
  2. Apply a logarithmic transformation to the SalesAmount using NumPy.
  3. Build a classification model with Scikit-learn to predict if a transaction is high value (HighSales).
  4. Split the dataset into training and testing sets.
  5. Train the model and make predictions on the test set.
# Sample data
data = {'TransactionID': [101, 102, 103, 104, 105],
        'SalesAmount': [250, np.nan, 340, 400, 200],
        'Discount': [10, 15, 20, np.nan, 5],
        'Store': ['A', 'B', 'A', 'C', 'B']}

df = pd.DataFrame(data)

# Solution
# Step 1: Handle missing values using Pandas
df['SalesAmount'].fillna(df['SalesAmount'].mean(), inplace=True)
df['Discount'].fillna(df['Discount'].mean(), inplace=True)

# Step 2: Apply logarithmic transformation to SalesAmount
df['LogSales'] = np.log(df['SalesAmount'])

# Step 3: Create target variable 'HighSales'
df['HighSales'] = (df['SalesAmount'] > 250).astype(int)

# Step 4: Define features and target
X = df[['SalesAmount', 'Discount', 'LogSales']]
y = df['HighSales']

# Step 5: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 6: Train a Random Forest model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 7: Predict on test set
y_pred = clf.predict(X_test)

print("Predictions:", y_pred)

These practical exercises cover the essential concepts discussed in Chapter 1, giving you the opportunity to practice filtering data, transforming features, and building machine learning models. The provided solutions help reinforce your understanding and ensure that you’re on the right track. Keep practicing, and feel free to explore different datasets and variations of these tasks!