Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconFeature Engineering for Modern Machine Learning with Scikit-Learn
Feature Engineering for Modern Machine Learning with Scikit-Learn

Chapter 4: Feature Engineering for Model Improvement

4.3 Practical Exercises for Chapter 4

These exercises will help you practice feature selection and model tuning using Recursive Feature Elimination (RFE) and feature importance. Each exercise includes a solution with code for guidance.

Exercise 1: Identify Important Features with Random Forests

Use a Random Forest Classifier to identify the most important features in a dataset. Focus on understanding the feature importance scores and select the top features based on these scores.

  1. Load the dataset and split it into training and testing sets.
  2. Train a Random Forest Classifier and display the feature importance scores.
  3. Select the top 5 features based on their importance and re-evaluate the model.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Generate a sample dataset
X, y = make_classification(n_samples=200, n_features=10, n_informative=6, random_state=42)
df = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(1, 11)])
df['Target'] = y

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Target']), df['Target'], test_size=0.3, random_state=42)

# Solution: Train Random Forest model and calculate feature importance
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
feature_importances = pd.DataFrame({'Feature': X_train.columns, 'Importance': rf_model.feature_importances_})
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

print("Feature Importance Ranking:")
print(feature_importances)

# Select top 5 features
top_features = feature_importances['Feature'].head(5).values
X_train_top = X_train[top_features]
X_test_top = X_test[top_features]

# Train Random Forest with top 5 features and evaluate
rf_model.fit(X_train_top, y_train)
y_pred = rf_model.predict(X_test_top)
print("Accuracy with Top 5 Features:", accuracy_score(y_test, y_pred))

In this solution:

  • We calculate the importance of each feature and select the top 5 based on their scores.
  • The model’s accuracy is evaluated using only the top features to assess the impact on performance.

Exercise 2: Apply Recursive Feature Elimination (RFE) with Logistic Regression

Use RFE with a Logistic Regression model to identify the best features for a classification task. Select the top 6 features and evaluate model accuracy.

  1. Train RFE with Logistic Regression on the training data.
  2. Select the top 6 features and retrain the model using only these features.
  3. Compare the model’s performance with the full feature set.
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

# Solution: Initialize Logistic Regression and RFE
log_reg = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=log_reg, n_features_to_select=6)

# Fit RFE and select top features
rfe.fit(X_train, y_train)

# Display selected features
selected_features = [f'Feature_{i+1}' for i, selected in enumerate(rfe.support_) if selected]
print("Selected Features with RFE:", selected_features)

# Evaluate accuracy with selected features
X_train_rfe = X_train[selected_features]
X_test_rfe = X_test[selected_features]
log_reg.fit(X_train_rfe, y_train)
y_pred_rfe = log_reg.predict(X_test_rfe)
print("Model Accuracy with RFE-Selected Features:", accuracy_score(y_test, y_pred_rfe))

In this solution:

RFE identifies the top 6 features, and we retrain Logistic Regression with only those features, comparing accuracy to the full feature set.

Exercise 3: Perform Hyperparameter Tuning with RFE and Random Forest

Combine RFE and GridSearchCV to perform feature selection and model tuning for a Random Forest Classifier. Tune parameters such as the number of selected features, n_estimators, and max_depth.

  1. Define a parameter grid for RFE and the Random Forest model.
  2. Use GridSearchCV to find the best combination of features and model parameters.
  3. Display the optimal number of features and model parameters, along with the accuracy.
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Solution: Initialize RFE with Random Forest
rf = RandomForestClassifier(random_state=42)
rfe = RFE(estimator=rf)

# Define parameter grid
param_grid = {
    'n_features_to_select': [5, 7, 9],
    'estimator__n_estimators': [50, 100],
    'estimator__max_depth': [None, 10]
}

# GridSearchCV for tuning RFE and Random Forest parameters
grid_search = GridSearchCV(estimator=rfe, param_grid=param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Display best parameters and accuracy
print("Best Parameters from GridSearch:", grid_search.best_params_)
print("Best Model Accuracy:", grid_search.best_score_)

In this solution:

We perform a grid search to tune the number of features selected by RFE and the Random Forest parameters simultaneously, identifying the best configuration for accuracy.

Exercise 4: Engineering Features Based on Feature Importance

Using feature importance scores, create an interaction term between two highly important features. Then, retrain the model and evaluate the accuracy to see if the interaction term improves performance.

  1. Identify two high-importance features from the previous Random Forest model.
  2. Create a new interaction feature by multiplying these two features.
  3. Train a model with the new feature and compare performance to the original feature set.
# Identify top two features from feature importance
top_two_features = feature_importances['Feature'].head(2).values
print("Top Two Features for Interaction:", top_two_features)

# Create interaction term
X_train['Interaction_Term'] = X_train[top_two_features[0]] * X_train[top_two_features[1]]
X_test['Interaction_Term'] = X_test[top_two_features[0]] * X_test[top_two_features[1]]

# Train Random Forest with interaction feature
rf_model.fit(X_train, y_train)
y_pred_interaction = rf_model.predict(X_test)
print("Accuracy with Interaction Feature:", accuracy_score(y_test, y_pred_interaction))

In this solution:

  • We create an interaction term between the top two features and add it to the dataset before training the model.
  • The new accuracy is then compared to the model without the interaction term to evaluate improvement.

These exercises cover practical applications of feature importance, RFE, and hyperparameter tuning. By mastering these techniques, you’ll enhance your ability to select and engineer features that improve model accuracy and generalizability. 

4.3 Practical Exercises for Chapter 4

These exercises will help you practice feature selection and model tuning using Recursive Feature Elimination (RFE) and feature importance. Each exercise includes a solution with code for guidance.

Exercise 1: Identify Important Features with Random Forests

Use a Random Forest Classifier to identify the most important features in a dataset. Focus on understanding the feature importance scores and select the top features based on these scores.

  1. Load the dataset and split it into training and testing sets.
  2. Train a Random Forest Classifier and display the feature importance scores.
  3. Select the top 5 features based on their importance and re-evaluate the model.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Generate a sample dataset
X, y = make_classification(n_samples=200, n_features=10, n_informative=6, random_state=42)
df = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(1, 11)])
df['Target'] = y

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Target']), df['Target'], test_size=0.3, random_state=42)

# Solution: Train Random Forest model and calculate feature importance
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
feature_importances = pd.DataFrame({'Feature': X_train.columns, 'Importance': rf_model.feature_importances_})
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

print("Feature Importance Ranking:")
print(feature_importances)

# Select top 5 features
top_features = feature_importances['Feature'].head(5).values
X_train_top = X_train[top_features]
X_test_top = X_test[top_features]

# Train Random Forest with top 5 features and evaluate
rf_model.fit(X_train_top, y_train)
y_pred = rf_model.predict(X_test_top)
print("Accuracy with Top 5 Features:", accuracy_score(y_test, y_pred))

In this solution:

  • We calculate the importance of each feature and select the top 5 based on their scores.
  • The model’s accuracy is evaluated using only the top features to assess the impact on performance.

Exercise 2: Apply Recursive Feature Elimination (RFE) with Logistic Regression

Use RFE with a Logistic Regression model to identify the best features for a classification task. Select the top 6 features and evaluate model accuracy.

  1. Train RFE with Logistic Regression on the training data.
  2. Select the top 6 features and retrain the model using only these features.
  3. Compare the model’s performance with the full feature set.
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

# Solution: Initialize Logistic Regression and RFE
log_reg = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=log_reg, n_features_to_select=6)

# Fit RFE and select top features
rfe.fit(X_train, y_train)

# Display selected features
selected_features = [f'Feature_{i+1}' for i, selected in enumerate(rfe.support_) if selected]
print("Selected Features with RFE:", selected_features)

# Evaluate accuracy with selected features
X_train_rfe = X_train[selected_features]
X_test_rfe = X_test[selected_features]
log_reg.fit(X_train_rfe, y_train)
y_pred_rfe = log_reg.predict(X_test_rfe)
print("Model Accuracy with RFE-Selected Features:", accuracy_score(y_test, y_pred_rfe))

In this solution:

RFE identifies the top 6 features, and we retrain Logistic Regression with only those features, comparing accuracy to the full feature set.

Exercise 3: Perform Hyperparameter Tuning with RFE and Random Forest

Combine RFE and GridSearchCV to perform feature selection and model tuning for a Random Forest Classifier. Tune parameters such as the number of selected features, n_estimators, and max_depth.

  1. Define a parameter grid for RFE and the Random Forest model.
  2. Use GridSearchCV to find the best combination of features and model parameters.
  3. Display the optimal number of features and model parameters, along with the accuracy.
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Solution: Initialize RFE with Random Forest
rf = RandomForestClassifier(random_state=42)
rfe = RFE(estimator=rf)

# Define parameter grid
param_grid = {
    'n_features_to_select': [5, 7, 9],
    'estimator__n_estimators': [50, 100],
    'estimator__max_depth': [None, 10]
}

# GridSearchCV for tuning RFE and Random Forest parameters
grid_search = GridSearchCV(estimator=rfe, param_grid=param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Display best parameters and accuracy
print("Best Parameters from GridSearch:", grid_search.best_params_)
print("Best Model Accuracy:", grid_search.best_score_)

In this solution:

We perform a grid search to tune the number of features selected by RFE and the Random Forest parameters simultaneously, identifying the best configuration for accuracy.

Exercise 4: Engineering Features Based on Feature Importance

Using feature importance scores, create an interaction term between two highly important features. Then, retrain the model and evaluate the accuracy to see if the interaction term improves performance.

  1. Identify two high-importance features from the previous Random Forest model.
  2. Create a new interaction feature by multiplying these two features.
  3. Train a model with the new feature and compare performance to the original feature set.
# Identify top two features from feature importance
top_two_features = feature_importances['Feature'].head(2).values
print("Top Two Features for Interaction:", top_two_features)

# Create interaction term
X_train['Interaction_Term'] = X_train[top_two_features[0]] * X_train[top_two_features[1]]
X_test['Interaction_Term'] = X_test[top_two_features[0]] * X_test[top_two_features[1]]

# Train Random Forest with interaction feature
rf_model.fit(X_train, y_train)
y_pred_interaction = rf_model.predict(X_test)
print("Accuracy with Interaction Feature:", accuracy_score(y_test, y_pred_interaction))

In this solution:

  • We create an interaction term between the top two features and add it to the dataset before training the model.
  • The new accuracy is then compared to the model without the interaction term to evaluate improvement.

These exercises cover practical applications of feature importance, RFE, and hyperparameter tuning. By mastering these techniques, you’ll enhance your ability to select and engineer features that improve model accuracy and generalizability. 

4.3 Practical Exercises for Chapter 4

These exercises will help you practice feature selection and model tuning using Recursive Feature Elimination (RFE) and feature importance. Each exercise includes a solution with code for guidance.

Exercise 1: Identify Important Features with Random Forests

Use a Random Forest Classifier to identify the most important features in a dataset. Focus on understanding the feature importance scores and select the top features based on these scores.

  1. Load the dataset and split it into training and testing sets.
  2. Train a Random Forest Classifier and display the feature importance scores.
  3. Select the top 5 features based on their importance and re-evaluate the model.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Generate a sample dataset
X, y = make_classification(n_samples=200, n_features=10, n_informative=6, random_state=42)
df = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(1, 11)])
df['Target'] = y

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Target']), df['Target'], test_size=0.3, random_state=42)

# Solution: Train Random Forest model and calculate feature importance
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
feature_importances = pd.DataFrame({'Feature': X_train.columns, 'Importance': rf_model.feature_importances_})
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

print("Feature Importance Ranking:")
print(feature_importances)

# Select top 5 features
top_features = feature_importances['Feature'].head(5).values
X_train_top = X_train[top_features]
X_test_top = X_test[top_features]

# Train Random Forest with top 5 features and evaluate
rf_model.fit(X_train_top, y_train)
y_pred = rf_model.predict(X_test_top)
print("Accuracy with Top 5 Features:", accuracy_score(y_test, y_pred))

In this solution:

  • We calculate the importance of each feature and select the top 5 based on their scores.
  • The model’s accuracy is evaluated using only the top features to assess the impact on performance.

Exercise 2: Apply Recursive Feature Elimination (RFE) with Logistic Regression

Use RFE with a Logistic Regression model to identify the best features for a classification task. Select the top 6 features and evaluate model accuracy.

  1. Train RFE with Logistic Regression on the training data.
  2. Select the top 6 features and retrain the model using only these features.
  3. Compare the model’s performance with the full feature set.
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

# Solution: Initialize Logistic Regression and RFE
log_reg = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=log_reg, n_features_to_select=6)

# Fit RFE and select top features
rfe.fit(X_train, y_train)

# Display selected features
selected_features = [f'Feature_{i+1}' for i, selected in enumerate(rfe.support_) if selected]
print("Selected Features with RFE:", selected_features)

# Evaluate accuracy with selected features
X_train_rfe = X_train[selected_features]
X_test_rfe = X_test[selected_features]
log_reg.fit(X_train_rfe, y_train)
y_pred_rfe = log_reg.predict(X_test_rfe)
print("Model Accuracy with RFE-Selected Features:", accuracy_score(y_test, y_pred_rfe))

In this solution:

RFE identifies the top 6 features, and we retrain Logistic Regression with only those features, comparing accuracy to the full feature set.

Exercise 3: Perform Hyperparameter Tuning with RFE and Random Forest

Combine RFE and GridSearchCV to perform feature selection and model tuning for a Random Forest Classifier. Tune parameters such as the number of selected features, n_estimators, and max_depth.

  1. Define a parameter grid for RFE and the Random Forest model.
  2. Use GridSearchCV to find the best combination of features and model parameters.
  3. Display the optimal number of features and model parameters, along with the accuracy.
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Solution: Initialize RFE with Random Forest
rf = RandomForestClassifier(random_state=42)
rfe = RFE(estimator=rf)

# Define parameter grid
param_grid = {
    'n_features_to_select': [5, 7, 9],
    'estimator__n_estimators': [50, 100],
    'estimator__max_depth': [None, 10]
}

# GridSearchCV for tuning RFE and Random Forest parameters
grid_search = GridSearchCV(estimator=rfe, param_grid=param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Display best parameters and accuracy
print("Best Parameters from GridSearch:", grid_search.best_params_)
print("Best Model Accuracy:", grid_search.best_score_)

In this solution:

We perform a grid search to tune the number of features selected by RFE and the Random Forest parameters simultaneously, identifying the best configuration for accuracy.

Exercise 4: Engineering Features Based on Feature Importance

Using feature importance scores, create an interaction term between two highly important features. Then, retrain the model and evaluate the accuracy to see if the interaction term improves performance.

  1. Identify two high-importance features from the previous Random Forest model.
  2. Create a new interaction feature by multiplying these two features.
  3. Train a model with the new feature and compare performance to the original feature set.
# Identify top two features from feature importance
top_two_features = feature_importances['Feature'].head(2).values
print("Top Two Features for Interaction:", top_two_features)

# Create interaction term
X_train['Interaction_Term'] = X_train[top_two_features[0]] * X_train[top_two_features[1]]
X_test['Interaction_Term'] = X_test[top_two_features[0]] * X_test[top_two_features[1]]

# Train Random Forest with interaction feature
rf_model.fit(X_train, y_train)
y_pred_interaction = rf_model.predict(X_test)
print("Accuracy with Interaction Feature:", accuracy_score(y_test, y_pred_interaction))

In this solution:

  • We create an interaction term between the top two features and add it to the dataset before training the model.
  • The new accuracy is then compared to the model without the interaction term to evaluate improvement.

These exercises cover practical applications of feature importance, RFE, and hyperparameter tuning. By mastering these techniques, you’ll enhance your ability to select and engineer features that improve model accuracy and generalizability. 

4.3 Practical Exercises for Chapter 4

These exercises will help you practice feature selection and model tuning using Recursive Feature Elimination (RFE) and feature importance. Each exercise includes a solution with code for guidance.

Exercise 1: Identify Important Features with Random Forests

Use a Random Forest Classifier to identify the most important features in a dataset. Focus on understanding the feature importance scores and select the top features based on these scores.

  1. Load the dataset and split it into training and testing sets.
  2. Train a Random Forest Classifier and display the feature importance scores.
  3. Select the top 5 features based on their importance and re-evaluate the model.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Generate a sample dataset
X, y = make_classification(n_samples=200, n_features=10, n_informative=6, random_state=42)
df = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(1, 11)])
df['Target'] = y

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Target']), df['Target'], test_size=0.3, random_state=42)

# Solution: Train Random Forest model and calculate feature importance
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
feature_importances = pd.DataFrame({'Feature': X_train.columns, 'Importance': rf_model.feature_importances_})
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

print("Feature Importance Ranking:")
print(feature_importances)

# Select top 5 features
top_features = feature_importances['Feature'].head(5).values
X_train_top = X_train[top_features]
X_test_top = X_test[top_features]

# Train Random Forest with top 5 features and evaluate
rf_model.fit(X_train_top, y_train)
y_pred = rf_model.predict(X_test_top)
print("Accuracy with Top 5 Features:", accuracy_score(y_test, y_pred))

In this solution:

  • We calculate the importance of each feature and select the top 5 based on their scores.
  • The model’s accuracy is evaluated using only the top features to assess the impact on performance.

Exercise 2: Apply Recursive Feature Elimination (RFE) with Logistic Regression

Use RFE with a Logistic Regression model to identify the best features for a classification task. Select the top 6 features and evaluate model accuracy.

  1. Train RFE with Logistic Regression on the training data.
  2. Select the top 6 features and retrain the model using only these features.
  3. Compare the model’s performance with the full feature set.
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

# Solution: Initialize Logistic Regression and RFE
log_reg = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=log_reg, n_features_to_select=6)

# Fit RFE and select top features
rfe.fit(X_train, y_train)

# Display selected features
selected_features = [f'Feature_{i+1}' for i, selected in enumerate(rfe.support_) if selected]
print("Selected Features with RFE:", selected_features)

# Evaluate accuracy with selected features
X_train_rfe = X_train[selected_features]
X_test_rfe = X_test[selected_features]
log_reg.fit(X_train_rfe, y_train)
y_pred_rfe = log_reg.predict(X_test_rfe)
print("Model Accuracy with RFE-Selected Features:", accuracy_score(y_test, y_pred_rfe))

In this solution:

RFE identifies the top 6 features, and we retrain Logistic Regression with only those features, comparing accuracy to the full feature set.

Exercise 3: Perform Hyperparameter Tuning with RFE and Random Forest

Combine RFE and GridSearchCV to perform feature selection and model tuning for a Random Forest Classifier. Tune parameters such as the number of selected features, n_estimators, and max_depth.

  1. Define a parameter grid for RFE and the Random Forest model.
  2. Use GridSearchCV to find the best combination of features and model parameters.
  3. Display the optimal number of features and model parameters, along with the accuracy.
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Solution: Initialize RFE with Random Forest
rf = RandomForestClassifier(random_state=42)
rfe = RFE(estimator=rf)

# Define parameter grid
param_grid = {
    'n_features_to_select': [5, 7, 9],
    'estimator__n_estimators': [50, 100],
    'estimator__max_depth': [None, 10]
}

# GridSearchCV for tuning RFE and Random Forest parameters
grid_search = GridSearchCV(estimator=rfe, param_grid=param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Display best parameters and accuracy
print("Best Parameters from GridSearch:", grid_search.best_params_)
print("Best Model Accuracy:", grid_search.best_score_)

In this solution:

We perform a grid search to tune the number of features selected by RFE and the Random Forest parameters simultaneously, identifying the best configuration for accuracy.

Exercise 4: Engineering Features Based on Feature Importance

Using feature importance scores, create an interaction term between two highly important features. Then, retrain the model and evaluate the accuracy to see if the interaction term improves performance.

  1. Identify two high-importance features from the previous Random Forest model.
  2. Create a new interaction feature by multiplying these two features.
  3. Train a model with the new feature and compare performance to the original feature set.
# Identify top two features from feature importance
top_two_features = feature_importances['Feature'].head(2).values
print("Top Two Features for Interaction:", top_two_features)

# Create interaction term
X_train['Interaction_Term'] = X_train[top_two_features[0]] * X_train[top_two_features[1]]
X_test['Interaction_Term'] = X_test[top_two_features[0]] * X_test[top_two_features[1]]

# Train Random Forest with interaction feature
rf_model.fit(X_train, y_train)
y_pred_interaction = rf_model.predict(X_test)
print("Accuracy with Interaction Feature:", accuracy_score(y_test, y_pred_interaction))

In this solution:

  • We create an interaction term between the top two features and add it to the dataset before training the model.
  • The new accuracy is then compared to the model without the interaction term to evaluate improvement.

These exercises cover practical applications of feature importance, RFE, and hyperparameter tuning. By mastering these techniques, you’ll enhance your ability to select and engineer features that improve model accuracy and generalizability.