Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconFeature Engineering for Modern Machine Learning with Scikit-Learn
Feature Engineering for Modern Machine Learning with Scikit-Learn

Chapter 8: AutoML and Automated Feature Engineering

8.3 Practical Exercises: Chapter 8

This practical exercise section provides hands-on experience with AutoML tools, focusing on automated feature engineering, model selection, and pipeline optimization. By working through these exercises, you’ll develop familiarity with using libraries like Featuretools, Auto-sklearn, and TPOT to streamline feature engineering and model building.

Exercise 1: Using Featuretools for Deep Feature Synthesis

Objective: Create new features from relational data using Featuretools’ deep feature synthesis.

Instructions:

  1. Define a set of related dataframes, including a customers table and a transactions table.
  2. Use Featuretools to generate features that aggregate transaction details at the customer level.
  3. Display the feature matrix to verify the generated features.

Solution:

import pandas as pd
import featuretools as ft

# Sample data
customers_df = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'signup_date': pd.to_datetime(['2022-01-01', '2022-02-01', '2022-03-01'])
})

transactions_df = pd.DataFrame({
    'transaction_id': [1, 2, 3, 4, 5],
    'customer_id': [1, 2, 1, 3, 2],
    'amount': [100, 200, 50, 300, 120],
    'transaction_date': pd.to_datetime(['2022-01-10', '2022-02-15', '2022-01-20', '2022-03-10', '2022-02-25'])
})

# Create an EntitySet and add dataframes
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe_name="customers", dataframe=customers_df, index="customer_id")
es = es.add_dataframe(dataframe_name="transactions", dataframe=transactions_df, index="transaction_id",
                      time_index="transaction_date")

# Define relationship and generate features
es = es.add_relationship("customers", "customer_id", "transactions", "customer_id")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="customers", agg_primitives=["mean", "sum", "count"])

# Display the feature matrix
print(feature_matrix.head())

In this exercise:

  • We define two tables and establish a relationship between them.
  • Featuretools automatically generates customer-level features, like average and total transaction amounts, using deep feature synthesis.

Exercise 2: Running Auto-sklearn for Automated Model Selection and Feature Engineering

Objective: Use Auto-sklearn to automate model selection, feature engineering, and hyperparameter tuning.

Instructions:

  1. Load a sample dataset and split it into training and test sets.
  2. Initialize an Auto-sklearn classifier with a limited time budget and fit it on the training data.
  3. Evaluate the model accuracy on the test set.

Solution:

import autosklearn.classification
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load sample dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Initialize Auto-sklearn classifier with time constraints
automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=300, per_run_time_limit=30)
automl.fit(X_train, y_train)

# Predict and evaluate
y_pred = automl.predict(X_test)
print("Auto-sklearn Accuracy:", accuracy_score(y_test, y_pred))

In this example:

Auto-sklearn automates data preprocessing, model selection, and hyperparameter tuning, generating the best-performing model within the specified time constraints.

Exercise 3: Optimizing a Machine Learning Pipeline with TPOT

Objective: Use TPOT to build and optimize a complete machine learning pipeline, including feature transformations and model selection.

Instructions:

  1. Load a dataset and split it into training and test sets.
  2. Use TPOT to automatically search for the best feature transformations, model selection, and hyperparameters.
  3. Evaluate TPOT’s recommended model on the test set.

Solution:

from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score

# Load sample dataset
data = load_digits()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Initialize TPOT classifier and fit the pipeline
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)

# Predict and evaluate
y_pred = tpot.predict(X_test)
print("TPOT Accuracy:", accuracy_score(y_test, y_pred))

# Export optimized pipeline code
tpot.export("optimized_pipeline.py")

In this exercise:

  • TPOT performs pipeline optimization, including feature selection and model choice, using genetic programming.
  • The best-performing pipeline is saved as Python code, allowing you to reuse it in future projects.

Exercise 4: Using MLBox for Data Cleaning and Model Building

Objective: Use MLBox to automatically clean data, perform feature selection, and build an optimized model.

Instructions:

  1. Load a dataset with missing values or imbalanced classes.
  2. Use MLBox to preprocess, clean, and transform the data.
  3. Build and evaluate an optimized model on the preprocessed data.

Solution:

# Install MLBox if not already installed
pip install mlbox
from mlbox.preprocessing import Reader, Drift_thresholder
from mlbox.optimisation import Optimiser
from mlbox.prediction import Predictor

# Load data (MLBox requires the dataset in CSV format)
paths = ["train.csv", "test.csv"]
target_name = "target"

# Step 1: Read and preprocess data
reader = Reader(sep=",")
df = reader.train_test_split(paths, target_name)

# Step 2: Remove features with data drift
drift_thresholder = Drift_thresholder()
df = drift_thresholder.fit_transform(df)

# Step 3: Optimize model and hyperparameters
optimiser = Optimiser()
space = {
    "ne__numerical_strategy": {"search": "choice", "space": ["mean", "median"]},
    "fs__threshold": {"search": "uniform", "space": [0.01, 0.3]},
    "est__strategy": {"search": "choice", "space": ["RandomForest"]}
}
best_params = optimiser.optimise(space, df)

# Step 4: Train and evaluate the model
predictor = Predictor()
predictor.fit_predict(best_params, df)

Train.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c20a9ec127e908b552_train.csv

test.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c2fa55b9c51c5ec4cd_test.csv

In this exercise:

  • MLBox handles data cleaning, feature selection, and model optimization with minimal manual input.
  • The Drift_thresholder removes features that show data drift, improving generalization on unseen data.

These exercises showcase various AutoML and automated feature engineering tools, from Featuretools for complex feature synthesis to Auto-sklearn, TPOT, and MLBox for automated pipeline optimization. By leveraging these tools, you can streamline the feature engineering process, allowing models to improve performance with minimal manual effort.

8.3 Practical Exercises: Chapter 8

This practical exercise section provides hands-on experience with AutoML tools, focusing on automated feature engineering, model selection, and pipeline optimization. By working through these exercises, you’ll develop familiarity with using libraries like Featuretools, Auto-sklearn, and TPOT to streamline feature engineering and model building.

Exercise 1: Using Featuretools for Deep Feature Synthesis

Objective: Create new features from relational data using Featuretools’ deep feature synthesis.

Instructions:

  1. Define a set of related dataframes, including a customers table and a transactions table.
  2. Use Featuretools to generate features that aggregate transaction details at the customer level.
  3. Display the feature matrix to verify the generated features.

Solution:

import pandas as pd
import featuretools as ft

# Sample data
customers_df = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'signup_date': pd.to_datetime(['2022-01-01', '2022-02-01', '2022-03-01'])
})

transactions_df = pd.DataFrame({
    'transaction_id': [1, 2, 3, 4, 5],
    'customer_id': [1, 2, 1, 3, 2],
    'amount': [100, 200, 50, 300, 120],
    'transaction_date': pd.to_datetime(['2022-01-10', '2022-02-15', '2022-01-20', '2022-03-10', '2022-02-25'])
})

# Create an EntitySet and add dataframes
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe_name="customers", dataframe=customers_df, index="customer_id")
es = es.add_dataframe(dataframe_name="transactions", dataframe=transactions_df, index="transaction_id",
                      time_index="transaction_date")

# Define relationship and generate features
es = es.add_relationship("customers", "customer_id", "transactions", "customer_id")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="customers", agg_primitives=["mean", "sum", "count"])

# Display the feature matrix
print(feature_matrix.head())

In this exercise:

  • We define two tables and establish a relationship between them.
  • Featuretools automatically generates customer-level features, like average and total transaction amounts, using deep feature synthesis.

Exercise 2: Running Auto-sklearn for Automated Model Selection and Feature Engineering

Objective: Use Auto-sklearn to automate model selection, feature engineering, and hyperparameter tuning.

Instructions:

  1. Load a sample dataset and split it into training and test sets.
  2. Initialize an Auto-sklearn classifier with a limited time budget and fit it on the training data.
  3. Evaluate the model accuracy on the test set.

Solution:

import autosklearn.classification
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load sample dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Initialize Auto-sklearn classifier with time constraints
automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=300, per_run_time_limit=30)
automl.fit(X_train, y_train)

# Predict and evaluate
y_pred = automl.predict(X_test)
print("Auto-sklearn Accuracy:", accuracy_score(y_test, y_pred))

In this example:

Auto-sklearn automates data preprocessing, model selection, and hyperparameter tuning, generating the best-performing model within the specified time constraints.

Exercise 3: Optimizing a Machine Learning Pipeline with TPOT

Objective: Use TPOT to build and optimize a complete machine learning pipeline, including feature transformations and model selection.

Instructions:

  1. Load a dataset and split it into training and test sets.
  2. Use TPOT to automatically search for the best feature transformations, model selection, and hyperparameters.
  3. Evaluate TPOT’s recommended model on the test set.

Solution:

from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score

# Load sample dataset
data = load_digits()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Initialize TPOT classifier and fit the pipeline
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)

# Predict and evaluate
y_pred = tpot.predict(X_test)
print("TPOT Accuracy:", accuracy_score(y_test, y_pred))

# Export optimized pipeline code
tpot.export("optimized_pipeline.py")

In this exercise:

  • TPOT performs pipeline optimization, including feature selection and model choice, using genetic programming.
  • The best-performing pipeline is saved as Python code, allowing you to reuse it in future projects.

Exercise 4: Using MLBox for Data Cleaning and Model Building

Objective: Use MLBox to automatically clean data, perform feature selection, and build an optimized model.

Instructions:

  1. Load a dataset with missing values or imbalanced classes.
  2. Use MLBox to preprocess, clean, and transform the data.
  3. Build and evaluate an optimized model on the preprocessed data.

Solution:

# Install MLBox if not already installed
pip install mlbox
from mlbox.preprocessing import Reader, Drift_thresholder
from mlbox.optimisation import Optimiser
from mlbox.prediction import Predictor

# Load data (MLBox requires the dataset in CSV format)
paths = ["train.csv", "test.csv"]
target_name = "target"

# Step 1: Read and preprocess data
reader = Reader(sep=",")
df = reader.train_test_split(paths, target_name)

# Step 2: Remove features with data drift
drift_thresholder = Drift_thresholder()
df = drift_thresholder.fit_transform(df)

# Step 3: Optimize model and hyperparameters
optimiser = Optimiser()
space = {
    "ne__numerical_strategy": {"search": "choice", "space": ["mean", "median"]},
    "fs__threshold": {"search": "uniform", "space": [0.01, 0.3]},
    "est__strategy": {"search": "choice", "space": ["RandomForest"]}
}
best_params = optimiser.optimise(space, df)

# Step 4: Train and evaluate the model
predictor = Predictor()
predictor.fit_predict(best_params, df)

Train.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c20a9ec127e908b552_train.csv

test.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c2fa55b9c51c5ec4cd_test.csv

In this exercise:

  • MLBox handles data cleaning, feature selection, and model optimization with minimal manual input.
  • The Drift_thresholder removes features that show data drift, improving generalization on unseen data.

These exercises showcase various AutoML and automated feature engineering tools, from Featuretools for complex feature synthesis to Auto-sklearn, TPOT, and MLBox for automated pipeline optimization. By leveraging these tools, you can streamline the feature engineering process, allowing models to improve performance with minimal manual effort.

8.3 Practical Exercises: Chapter 8

This practical exercise section provides hands-on experience with AutoML tools, focusing on automated feature engineering, model selection, and pipeline optimization. By working through these exercises, you’ll develop familiarity with using libraries like Featuretools, Auto-sklearn, and TPOT to streamline feature engineering and model building.

Exercise 1: Using Featuretools for Deep Feature Synthesis

Objective: Create new features from relational data using Featuretools’ deep feature synthesis.

Instructions:

  1. Define a set of related dataframes, including a customers table and a transactions table.
  2. Use Featuretools to generate features that aggregate transaction details at the customer level.
  3. Display the feature matrix to verify the generated features.

Solution:

import pandas as pd
import featuretools as ft

# Sample data
customers_df = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'signup_date': pd.to_datetime(['2022-01-01', '2022-02-01', '2022-03-01'])
})

transactions_df = pd.DataFrame({
    'transaction_id': [1, 2, 3, 4, 5],
    'customer_id': [1, 2, 1, 3, 2],
    'amount': [100, 200, 50, 300, 120],
    'transaction_date': pd.to_datetime(['2022-01-10', '2022-02-15', '2022-01-20', '2022-03-10', '2022-02-25'])
})

# Create an EntitySet and add dataframes
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe_name="customers", dataframe=customers_df, index="customer_id")
es = es.add_dataframe(dataframe_name="transactions", dataframe=transactions_df, index="transaction_id",
                      time_index="transaction_date")

# Define relationship and generate features
es = es.add_relationship("customers", "customer_id", "transactions", "customer_id")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="customers", agg_primitives=["mean", "sum", "count"])

# Display the feature matrix
print(feature_matrix.head())

In this exercise:

  • We define two tables and establish a relationship between them.
  • Featuretools automatically generates customer-level features, like average and total transaction amounts, using deep feature synthesis.

Exercise 2: Running Auto-sklearn for Automated Model Selection and Feature Engineering

Objective: Use Auto-sklearn to automate model selection, feature engineering, and hyperparameter tuning.

Instructions:

  1. Load a sample dataset and split it into training and test sets.
  2. Initialize an Auto-sklearn classifier with a limited time budget and fit it on the training data.
  3. Evaluate the model accuracy on the test set.

Solution:

import autosklearn.classification
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load sample dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Initialize Auto-sklearn classifier with time constraints
automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=300, per_run_time_limit=30)
automl.fit(X_train, y_train)

# Predict and evaluate
y_pred = automl.predict(X_test)
print("Auto-sklearn Accuracy:", accuracy_score(y_test, y_pred))

In this example:

Auto-sklearn automates data preprocessing, model selection, and hyperparameter tuning, generating the best-performing model within the specified time constraints.

Exercise 3: Optimizing a Machine Learning Pipeline with TPOT

Objective: Use TPOT to build and optimize a complete machine learning pipeline, including feature transformations and model selection.

Instructions:

  1. Load a dataset and split it into training and test sets.
  2. Use TPOT to automatically search for the best feature transformations, model selection, and hyperparameters.
  3. Evaluate TPOT’s recommended model on the test set.

Solution:

from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score

# Load sample dataset
data = load_digits()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Initialize TPOT classifier and fit the pipeline
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)

# Predict and evaluate
y_pred = tpot.predict(X_test)
print("TPOT Accuracy:", accuracy_score(y_test, y_pred))

# Export optimized pipeline code
tpot.export("optimized_pipeline.py")

In this exercise:

  • TPOT performs pipeline optimization, including feature selection and model choice, using genetic programming.
  • The best-performing pipeline is saved as Python code, allowing you to reuse it in future projects.

Exercise 4: Using MLBox for Data Cleaning and Model Building

Objective: Use MLBox to automatically clean data, perform feature selection, and build an optimized model.

Instructions:

  1. Load a dataset with missing values or imbalanced classes.
  2. Use MLBox to preprocess, clean, and transform the data.
  3. Build and evaluate an optimized model on the preprocessed data.

Solution:

# Install MLBox if not already installed
pip install mlbox
from mlbox.preprocessing import Reader, Drift_thresholder
from mlbox.optimisation import Optimiser
from mlbox.prediction import Predictor

# Load data (MLBox requires the dataset in CSV format)
paths = ["train.csv", "test.csv"]
target_name = "target"

# Step 1: Read and preprocess data
reader = Reader(sep=",")
df = reader.train_test_split(paths, target_name)

# Step 2: Remove features with data drift
drift_thresholder = Drift_thresholder()
df = drift_thresholder.fit_transform(df)

# Step 3: Optimize model and hyperparameters
optimiser = Optimiser()
space = {
    "ne__numerical_strategy": {"search": "choice", "space": ["mean", "median"]},
    "fs__threshold": {"search": "uniform", "space": [0.01, 0.3]},
    "est__strategy": {"search": "choice", "space": ["RandomForest"]}
}
best_params = optimiser.optimise(space, df)

# Step 4: Train and evaluate the model
predictor = Predictor()
predictor.fit_predict(best_params, df)

Train.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c20a9ec127e908b552_train.csv

test.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c2fa55b9c51c5ec4cd_test.csv

In this exercise:

  • MLBox handles data cleaning, feature selection, and model optimization with minimal manual input.
  • The Drift_thresholder removes features that show data drift, improving generalization on unseen data.

These exercises showcase various AutoML and automated feature engineering tools, from Featuretools for complex feature synthesis to Auto-sklearn, TPOT, and MLBox for automated pipeline optimization. By leveraging these tools, you can streamline the feature engineering process, allowing models to improve performance with minimal manual effort.

8.3 Practical Exercises: Chapter 8

This practical exercise section provides hands-on experience with AutoML tools, focusing on automated feature engineering, model selection, and pipeline optimization. By working through these exercises, you’ll develop familiarity with using libraries like Featuretools, Auto-sklearn, and TPOT to streamline feature engineering and model building.

Exercise 1: Using Featuretools for Deep Feature Synthesis

Objective: Create new features from relational data using Featuretools’ deep feature synthesis.

Instructions:

  1. Define a set of related dataframes, including a customers table and a transactions table.
  2. Use Featuretools to generate features that aggregate transaction details at the customer level.
  3. Display the feature matrix to verify the generated features.

Solution:

import pandas as pd
import featuretools as ft

# Sample data
customers_df = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'signup_date': pd.to_datetime(['2022-01-01', '2022-02-01', '2022-03-01'])
})

transactions_df = pd.DataFrame({
    'transaction_id': [1, 2, 3, 4, 5],
    'customer_id': [1, 2, 1, 3, 2],
    'amount': [100, 200, 50, 300, 120],
    'transaction_date': pd.to_datetime(['2022-01-10', '2022-02-15', '2022-01-20', '2022-03-10', '2022-02-25'])
})

# Create an EntitySet and add dataframes
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe_name="customers", dataframe=customers_df, index="customer_id")
es = es.add_dataframe(dataframe_name="transactions", dataframe=transactions_df, index="transaction_id",
                      time_index="transaction_date")

# Define relationship and generate features
es = es.add_relationship("customers", "customer_id", "transactions", "customer_id")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="customers", agg_primitives=["mean", "sum", "count"])

# Display the feature matrix
print(feature_matrix.head())

In this exercise:

  • We define two tables and establish a relationship between them.
  • Featuretools automatically generates customer-level features, like average and total transaction amounts, using deep feature synthesis.

Exercise 2: Running Auto-sklearn for Automated Model Selection and Feature Engineering

Objective: Use Auto-sklearn to automate model selection, feature engineering, and hyperparameter tuning.

Instructions:

  1. Load a sample dataset and split it into training and test sets.
  2. Initialize an Auto-sklearn classifier with a limited time budget and fit it on the training data.
  3. Evaluate the model accuracy on the test set.

Solution:

import autosklearn.classification
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load sample dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Initialize Auto-sklearn classifier with time constraints
automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=300, per_run_time_limit=30)
automl.fit(X_train, y_train)

# Predict and evaluate
y_pred = automl.predict(X_test)
print("Auto-sklearn Accuracy:", accuracy_score(y_test, y_pred))

In this example:

Auto-sklearn automates data preprocessing, model selection, and hyperparameter tuning, generating the best-performing model within the specified time constraints.

Exercise 3: Optimizing a Machine Learning Pipeline with TPOT

Objective: Use TPOT to build and optimize a complete machine learning pipeline, including feature transformations and model selection.

Instructions:

  1. Load a dataset and split it into training and test sets.
  2. Use TPOT to automatically search for the best feature transformations, model selection, and hyperparameters.
  3. Evaluate TPOT’s recommended model on the test set.

Solution:

from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score

# Load sample dataset
data = load_digits()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Initialize TPOT classifier and fit the pipeline
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)

# Predict and evaluate
y_pred = tpot.predict(X_test)
print("TPOT Accuracy:", accuracy_score(y_test, y_pred))

# Export optimized pipeline code
tpot.export("optimized_pipeline.py")

In this exercise:

  • TPOT performs pipeline optimization, including feature selection and model choice, using genetic programming.
  • The best-performing pipeline is saved as Python code, allowing you to reuse it in future projects.

Exercise 4: Using MLBox for Data Cleaning and Model Building

Objective: Use MLBox to automatically clean data, perform feature selection, and build an optimized model.

Instructions:

  1. Load a dataset with missing values or imbalanced classes.
  2. Use MLBox to preprocess, clean, and transform the data.
  3. Build and evaluate an optimized model on the preprocessed data.

Solution:

# Install MLBox if not already installed
pip install mlbox
from mlbox.preprocessing import Reader, Drift_thresholder
from mlbox.optimisation import Optimiser
from mlbox.prediction import Predictor

# Load data (MLBox requires the dataset in CSV format)
paths = ["train.csv", "test.csv"]
target_name = "target"

# Step 1: Read and preprocess data
reader = Reader(sep=",")
df = reader.train_test_split(paths, target_name)

# Step 2: Remove features with data drift
drift_thresholder = Drift_thresholder()
df = drift_thresholder.fit_transform(df)

# Step 3: Optimize model and hyperparameters
optimiser = Optimiser()
space = {
    "ne__numerical_strategy": {"search": "choice", "space": ["mean", "median"]},
    "fs__threshold": {"search": "uniform", "space": [0.01, 0.3]},
    "est__strategy": {"search": "choice", "space": ["RandomForest"]}
}
best_params = optimiser.optimise(space, df)

# Step 4: Train and evaluate the model
predictor = Predictor()
predictor.fit_predict(best_params, df)

Train.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c20a9ec127e908b552_train.csv

test.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c2fa55b9c51c5ec4cd_test.csv

In this exercise:

  • MLBox handles data cleaning, feature selection, and model optimization with minimal manual input.
  • The Drift_thresholder removes features that show data drift, improving generalization on unseen data.

These exercises showcase various AutoML and automated feature engineering tools, from Featuretools for complex feature synthesis to Auto-sklearn, TPOT, and MLBox for automated pipeline optimization. By leveraging these tools, you can streamline the feature engineering process, allowing models to improve performance with minimal manual effort.