Chapter 8: AutoML and Automated Feature Engineering
8.3 Practical Exercises: Chapter 8
This practical exercise section provides hands-on experience with AutoML tools, focusing on automated feature engineering, model selection, and pipeline optimization. By working through these exercises, you’ll develop familiarity with using libraries like Featuretools, Auto-sklearn, and TPOT to streamline feature engineering and model building.
Exercise 1: Using Featuretools for Deep Feature Synthesis
Objective: Create new features from relational data using Featuretools’ deep feature synthesis.
Instructions:
- Define a set of related dataframes, including a customers table and a transactions table.
- Use Featuretools to generate features that aggregate transaction details at the customer level.
- Display the feature matrix to verify the generated features.
Solution:
import pandas as pd
import featuretools as ft
# Sample data
customers_df = pd.DataFrame({
'customer_id': [1, 2, 3],
'signup_date': pd.to_datetime(['2022-01-01', '2022-02-01', '2022-03-01'])
})
transactions_df = pd.DataFrame({
'transaction_id': [1, 2, 3, 4, 5],
'customer_id': [1, 2, 1, 3, 2],
'amount': [100, 200, 50, 300, 120],
'transaction_date': pd.to_datetime(['2022-01-10', '2022-02-15', '2022-01-20', '2022-03-10', '2022-02-25'])
})
# Create an EntitySet and add dataframes
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe_name="customers", dataframe=customers_df, index="customer_id")
es = es.add_dataframe(dataframe_name="transactions", dataframe=transactions_df, index="transaction_id",
time_index="transaction_date")
# Define relationship and generate features
es = es.add_relationship("customers", "customer_id", "transactions", "customer_id")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="customers", agg_primitives=["mean", "sum", "count"])
# Display the feature matrix
print(feature_matrix.head())
In this exercise:
- We define two tables and establish a relationship between them.
- Featuretools automatically generates customer-level features, like average and total transaction amounts, using deep feature synthesis.
Exercise 2: Running Auto-sklearn for Automated Model Selection and Feature Engineering
Objective: Use Auto-sklearn to automate model selection, feature engineering, and hyperparameter tuning.
Instructions:
- Load a sample dataset and split it into training and test sets.
- Initialize an Auto-sklearn classifier with a limited time budget and fit it on the training data.
- Evaluate the model accuracy on the test set.
Solution:
import autosklearn.classification
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
# Load sample dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# Initialize Auto-sklearn classifier with time constraints
automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=300, per_run_time_limit=30)
automl.fit(X_train, y_train)
# Predict and evaluate
y_pred = automl.predict(X_test)
print("Auto-sklearn Accuracy:", accuracy_score(y_test, y_pred))
In this example:
Auto-sklearn automates data preprocessing, model selection, and hyperparameter tuning, generating the best-performing model within the specified time constraints.
Exercise 3: Optimizing a Machine Learning Pipeline with TPOT
Objective: Use TPOT to build and optimize a complete machine learning pipeline, including feature transformations and model selection.
Instructions:
- Load a dataset and split it into training and test sets.
- Use TPOT to automatically search for the best feature transformations, model selection, and hyperparameters.
- Evaluate TPOT’s recommended model on the test set.
Solution:
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
# Load sample dataset
data = load_digits()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# Initialize TPOT classifier and fit the pipeline
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
# Predict and evaluate
y_pred = tpot.predict(X_test)
print("TPOT Accuracy:", accuracy_score(y_test, y_pred))
# Export optimized pipeline code
tpot.export("optimized_pipeline.py")
In this exercise:
- TPOT performs pipeline optimization, including feature selection and model choice, using genetic programming.
- The best-performing pipeline is saved as Python code, allowing you to reuse it in future projects.
Exercise 4: Using MLBox for Data Cleaning and Model Building
Objective: Use MLBox to automatically clean data, perform feature selection, and build an optimized model.
Instructions:
- Load a dataset with missing values or imbalanced classes.
- Use MLBox to preprocess, clean, and transform the data.
- Build and evaluate an optimized model on the preprocessed data.
Solution:
# Install MLBox if not already installed
pip install mlbox
from mlbox.preprocessing import Reader, Drift_thresholder
from mlbox.optimisation import Optimiser
from mlbox.prediction import Predictor
# Load data (MLBox requires the dataset in CSV format)
paths = ["train.csv", "test.csv"]
target_name = "target"
# Step 1: Read and preprocess data
reader = Reader(sep=",")
df = reader.train_test_split(paths, target_name)
# Step 2: Remove features with data drift
drift_thresholder = Drift_thresholder()
df = drift_thresholder.fit_transform(df)
# Step 3: Optimize model and hyperparameters
optimiser = Optimiser()
space = {
"ne__numerical_strategy": {"search": "choice", "space": ["mean", "median"]},
"fs__threshold": {"search": "uniform", "space": [0.01, 0.3]},
"est__strategy": {"search": "choice", "space": ["RandomForest"]}
}
best_params = optimiser.optimise(space, df)
# Step 4: Train and evaluate the model
predictor = Predictor()
predictor.fit_predict(best_params, df)
Train.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c20a9ec127e908b552_train.csv
test.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c2fa55b9c51c5ec4cd_test.csv
In this exercise:
- MLBox handles data cleaning, feature selection, and model optimization with minimal manual input.
- The
Drift_thresholder
removes features that show data drift, improving generalization on unseen data.
These exercises showcase various AutoML and automated feature engineering tools, from Featuretools for complex feature synthesis to Auto-sklearn, TPOT, and MLBox for automated pipeline optimization. By leveraging these tools, you can streamline the feature engineering process, allowing models to improve performance with minimal manual effort.
8.3 Practical Exercises: Chapter 8
This practical exercise section provides hands-on experience with AutoML tools, focusing on automated feature engineering, model selection, and pipeline optimization. By working through these exercises, you’ll develop familiarity with using libraries like Featuretools, Auto-sklearn, and TPOT to streamline feature engineering and model building.
Exercise 1: Using Featuretools for Deep Feature Synthesis
Objective: Create new features from relational data using Featuretools’ deep feature synthesis.
Instructions:
- Define a set of related dataframes, including a customers table and a transactions table.
- Use Featuretools to generate features that aggregate transaction details at the customer level.
- Display the feature matrix to verify the generated features.
Solution:
import pandas as pd
import featuretools as ft
# Sample data
customers_df = pd.DataFrame({
'customer_id': [1, 2, 3],
'signup_date': pd.to_datetime(['2022-01-01', '2022-02-01', '2022-03-01'])
})
transactions_df = pd.DataFrame({
'transaction_id': [1, 2, 3, 4, 5],
'customer_id': [1, 2, 1, 3, 2],
'amount': [100, 200, 50, 300, 120],
'transaction_date': pd.to_datetime(['2022-01-10', '2022-02-15', '2022-01-20', '2022-03-10', '2022-02-25'])
})
# Create an EntitySet and add dataframes
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe_name="customers", dataframe=customers_df, index="customer_id")
es = es.add_dataframe(dataframe_name="transactions", dataframe=transactions_df, index="transaction_id",
time_index="transaction_date")
# Define relationship and generate features
es = es.add_relationship("customers", "customer_id", "transactions", "customer_id")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="customers", agg_primitives=["mean", "sum", "count"])
# Display the feature matrix
print(feature_matrix.head())
In this exercise:
- We define two tables and establish a relationship between them.
- Featuretools automatically generates customer-level features, like average and total transaction amounts, using deep feature synthesis.
Exercise 2: Running Auto-sklearn for Automated Model Selection and Feature Engineering
Objective: Use Auto-sklearn to automate model selection, feature engineering, and hyperparameter tuning.
Instructions:
- Load a sample dataset and split it into training and test sets.
- Initialize an Auto-sklearn classifier with a limited time budget and fit it on the training data.
- Evaluate the model accuracy on the test set.
Solution:
import autosklearn.classification
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
# Load sample dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# Initialize Auto-sklearn classifier with time constraints
automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=300, per_run_time_limit=30)
automl.fit(X_train, y_train)
# Predict and evaluate
y_pred = automl.predict(X_test)
print("Auto-sklearn Accuracy:", accuracy_score(y_test, y_pred))
In this example:
Auto-sklearn automates data preprocessing, model selection, and hyperparameter tuning, generating the best-performing model within the specified time constraints.
Exercise 3: Optimizing a Machine Learning Pipeline with TPOT
Objective: Use TPOT to build and optimize a complete machine learning pipeline, including feature transformations and model selection.
Instructions:
- Load a dataset and split it into training and test sets.
- Use TPOT to automatically search for the best feature transformations, model selection, and hyperparameters.
- Evaluate TPOT’s recommended model on the test set.
Solution:
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
# Load sample dataset
data = load_digits()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# Initialize TPOT classifier and fit the pipeline
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
# Predict and evaluate
y_pred = tpot.predict(X_test)
print("TPOT Accuracy:", accuracy_score(y_test, y_pred))
# Export optimized pipeline code
tpot.export("optimized_pipeline.py")
In this exercise:
- TPOT performs pipeline optimization, including feature selection and model choice, using genetic programming.
- The best-performing pipeline is saved as Python code, allowing you to reuse it in future projects.
Exercise 4: Using MLBox for Data Cleaning and Model Building
Objective: Use MLBox to automatically clean data, perform feature selection, and build an optimized model.
Instructions:
- Load a dataset with missing values or imbalanced classes.
- Use MLBox to preprocess, clean, and transform the data.
- Build and evaluate an optimized model on the preprocessed data.
Solution:
# Install MLBox if not already installed
pip install mlbox
from mlbox.preprocessing import Reader, Drift_thresholder
from mlbox.optimisation import Optimiser
from mlbox.prediction import Predictor
# Load data (MLBox requires the dataset in CSV format)
paths = ["train.csv", "test.csv"]
target_name = "target"
# Step 1: Read and preprocess data
reader = Reader(sep=",")
df = reader.train_test_split(paths, target_name)
# Step 2: Remove features with data drift
drift_thresholder = Drift_thresholder()
df = drift_thresholder.fit_transform(df)
# Step 3: Optimize model and hyperparameters
optimiser = Optimiser()
space = {
"ne__numerical_strategy": {"search": "choice", "space": ["mean", "median"]},
"fs__threshold": {"search": "uniform", "space": [0.01, 0.3]},
"est__strategy": {"search": "choice", "space": ["RandomForest"]}
}
best_params = optimiser.optimise(space, df)
# Step 4: Train and evaluate the model
predictor = Predictor()
predictor.fit_predict(best_params, df)
Train.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c20a9ec127e908b552_train.csv
test.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c2fa55b9c51c5ec4cd_test.csv
In this exercise:
- MLBox handles data cleaning, feature selection, and model optimization with minimal manual input.
- The
Drift_thresholder
removes features that show data drift, improving generalization on unseen data.
These exercises showcase various AutoML and automated feature engineering tools, from Featuretools for complex feature synthesis to Auto-sklearn, TPOT, and MLBox for automated pipeline optimization. By leveraging these tools, you can streamline the feature engineering process, allowing models to improve performance with minimal manual effort.
8.3 Practical Exercises: Chapter 8
This practical exercise section provides hands-on experience with AutoML tools, focusing on automated feature engineering, model selection, and pipeline optimization. By working through these exercises, you’ll develop familiarity with using libraries like Featuretools, Auto-sklearn, and TPOT to streamline feature engineering and model building.
Exercise 1: Using Featuretools for Deep Feature Synthesis
Objective: Create new features from relational data using Featuretools’ deep feature synthesis.
Instructions:
- Define a set of related dataframes, including a customers table and a transactions table.
- Use Featuretools to generate features that aggregate transaction details at the customer level.
- Display the feature matrix to verify the generated features.
Solution:
import pandas as pd
import featuretools as ft
# Sample data
customers_df = pd.DataFrame({
'customer_id': [1, 2, 3],
'signup_date': pd.to_datetime(['2022-01-01', '2022-02-01', '2022-03-01'])
})
transactions_df = pd.DataFrame({
'transaction_id': [1, 2, 3, 4, 5],
'customer_id': [1, 2, 1, 3, 2],
'amount': [100, 200, 50, 300, 120],
'transaction_date': pd.to_datetime(['2022-01-10', '2022-02-15', '2022-01-20', '2022-03-10', '2022-02-25'])
})
# Create an EntitySet and add dataframes
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe_name="customers", dataframe=customers_df, index="customer_id")
es = es.add_dataframe(dataframe_name="transactions", dataframe=transactions_df, index="transaction_id",
time_index="transaction_date")
# Define relationship and generate features
es = es.add_relationship("customers", "customer_id", "transactions", "customer_id")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="customers", agg_primitives=["mean", "sum", "count"])
# Display the feature matrix
print(feature_matrix.head())
In this exercise:
- We define two tables and establish a relationship between them.
- Featuretools automatically generates customer-level features, like average and total transaction amounts, using deep feature synthesis.
Exercise 2: Running Auto-sklearn for Automated Model Selection and Feature Engineering
Objective: Use Auto-sklearn to automate model selection, feature engineering, and hyperparameter tuning.
Instructions:
- Load a sample dataset and split it into training and test sets.
- Initialize an Auto-sklearn classifier with a limited time budget and fit it on the training data.
- Evaluate the model accuracy on the test set.
Solution:
import autosklearn.classification
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
# Load sample dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# Initialize Auto-sklearn classifier with time constraints
automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=300, per_run_time_limit=30)
automl.fit(X_train, y_train)
# Predict and evaluate
y_pred = automl.predict(X_test)
print("Auto-sklearn Accuracy:", accuracy_score(y_test, y_pred))
In this example:
Auto-sklearn automates data preprocessing, model selection, and hyperparameter tuning, generating the best-performing model within the specified time constraints.
Exercise 3: Optimizing a Machine Learning Pipeline with TPOT
Objective: Use TPOT to build and optimize a complete machine learning pipeline, including feature transformations and model selection.
Instructions:
- Load a dataset and split it into training and test sets.
- Use TPOT to automatically search for the best feature transformations, model selection, and hyperparameters.
- Evaluate TPOT’s recommended model on the test set.
Solution:
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
# Load sample dataset
data = load_digits()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# Initialize TPOT classifier and fit the pipeline
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
# Predict and evaluate
y_pred = tpot.predict(X_test)
print("TPOT Accuracy:", accuracy_score(y_test, y_pred))
# Export optimized pipeline code
tpot.export("optimized_pipeline.py")
In this exercise:
- TPOT performs pipeline optimization, including feature selection and model choice, using genetic programming.
- The best-performing pipeline is saved as Python code, allowing you to reuse it in future projects.
Exercise 4: Using MLBox for Data Cleaning and Model Building
Objective: Use MLBox to automatically clean data, perform feature selection, and build an optimized model.
Instructions:
- Load a dataset with missing values or imbalanced classes.
- Use MLBox to preprocess, clean, and transform the data.
- Build and evaluate an optimized model on the preprocessed data.
Solution:
# Install MLBox if not already installed
pip install mlbox
from mlbox.preprocessing import Reader, Drift_thresholder
from mlbox.optimisation import Optimiser
from mlbox.prediction import Predictor
# Load data (MLBox requires the dataset in CSV format)
paths = ["train.csv", "test.csv"]
target_name = "target"
# Step 1: Read and preprocess data
reader = Reader(sep=",")
df = reader.train_test_split(paths, target_name)
# Step 2: Remove features with data drift
drift_thresholder = Drift_thresholder()
df = drift_thresholder.fit_transform(df)
# Step 3: Optimize model and hyperparameters
optimiser = Optimiser()
space = {
"ne__numerical_strategy": {"search": "choice", "space": ["mean", "median"]},
"fs__threshold": {"search": "uniform", "space": [0.01, 0.3]},
"est__strategy": {"search": "choice", "space": ["RandomForest"]}
}
best_params = optimiser.optimise(space, df)
# Step 4: Train and evaluate the model
predictor = Predictor()
predictor.fit_predict(best_params, df)
Train.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c20a9ec127e908b552_train.csv
test.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c2fa55b9c51c5ec4cd_test.csv
In this exercise:
- MLBox handles data cleaning, feature selection, and model optimization with minimal manual input.
- The
Drift_thresholder
removes features that show data drift, improving generalization on unseen data.
These exercises showcase various AutoML and automated feature engineering tools, from Featuretools for complex feature synthesis to Auto-sklearn, TPOT, and MLBox for automated pipeline optimization. By leveraging these tools, you can streamline the feature engineering process, allowing models to improve performance with minimal manual effort.
8.3 Practical Exercises: Chapter 8
This practical exercise section provides hands-on experience with AutoML tools, focusing on automated feature engineering, model selection, and pipeline optimization. By working through these exercises, you’ll develop familiarity with using libraries like Featuretools, Auto-sklearn, and TPOT to streamline feature engineering and model building.
Exercise 1: Using Featuretools for Deep Feature Synthesis
Objective: Create new features from relational data using Featuretools’ deep feature synthesis.
Instructions:
- Define a set of related dataframes, including a customers table and a transactions table.
- Use Featuretools to generate features that aggregate transaction details at the customer level.
- Display the feature matrix to verify the generated features.
Solution:
import pandas as pd
import featuretools as ft
# Sample data
customers_df = pd.DataFrame({
'customer_id': [1, 2, 3],
'signup_date': pd.to_datetime(['2022-01-01', '2022-02-01', '2022-03-01'])
})
transactions_df = pd.DataFrame({
'transaction_id': [1, 2, 3, 4, 5],
'customer_id': [1, 2, 1, 3, 2],
'amount': [100, 200, 50, 300, 120],
'transaction_date': pd.to_datetime(['2022-01-10', '2022-02-15', '2022-01-20', '2022-03-10', '2022-02-25'])
})
# Create an EntitySet and add dataframes
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe_name="customers", dataframe=customers_df, index="customer_id")
es = es.add_dataframe(dataframe_name="transactions", dataframe=transactions_df, index="transaction_id",
time_index="transaction_date")
# Define relationship and generate features
es = es.add_relationship("customers", "customer_id", "transactions", "customer_id")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="customers", agg_primitives=["mean", "sum", "count"])
# Display the feature matrix
print(feature_matrix.head())
In this exercise:
- We define two tables and establish a relationship between them.
- Featuretools automatically generates customer-level features, like average and total transaction amounts, using deep feature synthesis.
Exercise 2: Running Auto-sklearn for Automated Model Selection and Feature Engineering
Objective: Use Auto-sklearn to automate model selection, feature engineering, and hyperparameter tuning.
Instructions:
- Load a sample dataset and split it into training and test sets.
- Initialize an Auto-sklearn classifier with a limited time budget and fit it on the training data.
- Evaluate the model accuracy on the test set.
Solution:
import autosklearn.classification
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
# Load sample dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# Initialize Auto-sklearn classifier with time constraints
automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=300, per_run_time_limit=30)
automl.fit(X_train, y_train)
# Predict and evaluate
y_pred = automl.predict(X_test)
print("Auto-sklearn Accuracy:", accuracy_score(y_test, y_pred))
In this example:
Auto-sklearn automates data preprocessing, model selection, and hyperparameter tuning, generating the best-performing model within the specified time constraints.
Exercise 3: Optimizing a Machine Learning Pipeline with TPOT
Objective: Use TPOT to build and optimize a complete machine learning pipeline, including feature transformations and model selection.
Instructions:
- Load a dataset and split it into training and test sets.
- Use TPOT to automatically search for the best feature transformations, model selection, and hyperparameters.
- Evaluate TPOT’s recommended model on the test set.
Solution:
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
# Load sample dataset
data = load_digits()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# Initialize TPOT classifier and fit the pipeline
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
# Predict and evaluate
y_pred = tpot.predict(X_test)
print("TPOT Accuracy:", accuracy_score(y_test, y_pred))
# Export optimized pipeline code
tpot.export("optimized_pipeline.py")
In this exercise:
- TPOT performs pipeline optimization, including feature selection and model choice, using genetic programming.
- The best-performing pipeline is saved as Python code, allowing you to reuse it in future projects.
Exercise 4: Using MLBox for Data Cleaning and Model Building
Objective: Use MLBox to automatically clean data, perform feature selection, and build an optimized model.
Instructions:
- Load a dataset with missing values or imbalanced classes.
- Use MLBox to preprocess, clean, and transform the data.
- Build and evaluate an optimized model on the preprocessed data.
Solution:
# Install MLBox if not already installed
pip install mlbox
from mlbox.preprocessing import Reader, Drift_thresholder
from mlbox.optimisation import Optimiser
from mlbox.prediction import Predictor
# Load data (MLBox requires the dataset in CSV format)
paths = ["train.csv", "test.csv"]
target_name = "target"
# Step 1: Read and preprocess data
reader = Reader(sep=",")
df = reader.train_test_split(paths, target_name)
# Step 2: Remove features with data drift
drift_thresholder = Drift_thresholder()
df = drift_thresholder.fit_transform(df)
# Step 3: Optimize model and hyperparameters
optimiser = Optimiser()
space = {
"ne__numerical_strategy": {"search": "choice", "space": ["mean", "median"]},
"fs__threshold": {"search": "uniform", "space": [0.01, 0.3]},
"est__strategy": {"search": "choice", "space": ["RandomForest"]}
}
best_params = optimiser.optimise(space, df)
# Step 4: Train and evaluate the model
predictor = Predictor()
predictor.fit_predict(best_params, df)
Train.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c20a9ec127e908b552_train.csv
test.csv: https://cdn.prod.website-files.com/661b9e736a74273c4f628d5f/67d1a0c2fa55b9c51c5ec4cd_test.csv
In this exercise:
- MLBox handles data cleaning, feature selection, and model optimization with minimal manual input.
- The
Drift_thresholder
removes features that show data drift, improving generalization on unseen data.
These exercises showcase various AutoML and automated feature engineering tools, from Featuretools for complex feature synthesis to Auto-sklearn, TPOT, and MLBox for automated pipeline optimization. By leveraging these tools, you can streamline the feature engineering process, allowing models to improve performance with minimal manual effort.