Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconFeature Engineering for Modern Machine Learning with Scikit-Learn
Feature Engineering for Modern Machine Learning with Scikit-Learn

Chapter 5: Advanced Model Evaluation Techniques

5.3 Practical Exercises for Chapter 5

These exercises will help you practice handling imbalanced data with class weighting and SMOTE and using appropriate cross-validation methods. Each exercise includes a solution with code for guidance.

Exercise 1: Evaluating a Model with Class Weighting

Train a Logistic Regression model on an imbalanced dataset using class weighting to improve the model’s sensitivity to the minority class. Use Stratified K-Folds Cross-Validation to ensure balanced class representation across each fold.

  1. Create an imbalanced dataset and split it into training and testing sets.
  2. Apply class weighting to a Logistic Regression model.
  3. Evaluate the model using Stratified K-Folds cross-validation.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Initialize Logistic Regression with class weighting
model = LogisticRegression(class_weight='balanced', random_state=42)

# Initialize Stratified K-Folds with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)

# Solution: Evaluate model using Stratified K-Folds cross-validation
scores = cross_val_score(model, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())

# Final model evaluation
model.fit(X, y)
y_pred = model.predict(X)
print("\\nClassification Report with Class Weighting:")
print(classification_report(y, y_pred))

In this solution:

  • The model’s F1 score is calculated using Stratified K-Folds to provide a balanced view of performance on each class.
  • Class weighting (class_weight='balanced') helps the model consider the minority class more effectively.

Exercise 2: Balancing Classes with SMOTE

Use SMOTE to create a balanced dataset for a Random Forest Classifier. Compare the performance on the original imbalanced dataset and the SMOTE-resampled dataset.

  1. Train a Random Forest Classifier on the original imbalanced dataset.
  2. Apply SMOTE to balance the dataset, and retrain the model.
  3. Compare the F1 scores of both models to see the impact of SMOTE.
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Solution: Train model on original imbalanced data
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Classification Report on Original Data:")
print(classification_report(y_test, y_pred))

# Apply SMOTE to create balanced data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Train model on SMOTE-resampled data
model.fit(X_resampled, y_resampled)
y_pred_resampled = model.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_resampled))

In this solution:

  • SMOTE creates synthetic samples to balance the classes, improving performance on the minority class.
  • The classification report compares the model’s F1 score and other metrics before and after using SMOTE, showing the effectiveness of balancing.

Exercise 3: Combining SMOTE with Stratified K-Folds Cross-Validation

Use SMOTE with Stratified K-Folds Cross-Validation to evaluate model performance across multiple folds, balancing classes in each fold for more robust evaluation.

  1. Apply SMOTE within each cross-validation fold.
  2. Train a Random Forest model on each fold and report F1 scores.
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline  # to use SMOTE in cross-validation
from sklearn.ensemble import RandomForestClassifier

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Initialize SMOTE and Random Forest in a pipeline
pipeline = make_pipeline(SMOTE(random_state=42), RandomForestClassifier(random_state=42))

# Initialize Stratified K-Folds Cross-Validation with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)

# Solution: Cross-validation with SMOTE in each fold
scores = cross_val_score(pipeline, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds with SMOTE F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())

In this solution:

  • SMOTE is applied within each fold using a pipeline, ensuring that classes are balanced for every cross-validation iteration.
  • The F1 scores across folds provide a comprehensive measure of model performance on the minority class.

Exercise 4: Compare Class Weighting vs. SMOTE on an Imbalanced Dataset

Train two Logistic Regression models on an imbalanced dataset: one with class weighting and the other with SMOTE. Compare the performance of each approach using F1 score.

  1. Train one Logistic Regression model with class_weight='balanced'.
  2. Train another model using SMOTE to balance the dataset.
  3. Compare F1 scores of both models on a test set.
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Solution 1: Logistic Regression with class weighting
model_weighted = LogisticRegression(class_weight='balanced', random_state=42)
model_weighted.fit(X_train, y_train)
y_pred_weighted = model_weighted.predict(X_test)
print("Classification Report with Class Weighting:")
print(classification_report(y_test, y_pred_weighted))

# Solution 2: Logistic Regression with SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
model_smote = LogisticRegression(random_state=42)
model_smote.fit(X_resampled, y_resampled)
y_pred_smote = model_smote.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_smote))

In this solution:

  • Both class weighting and SMOTE are applied to address imbalanced data, allowing for direct comparison.
  • The classification reports show the F1 scores for both methods, helping determine which technique works better for this dataset.

These exercises provide hands-on experience with handling imbalanced data, including the use of class weighting, SMOTE, and cross-validation strategies. By mastering these techniques, you’ll enhance your ability to evaluate and improve model performance on real-world, imbalanced datasets.

5.3 Practical Exercises for Chapter 5

These exercises will help you practice handling imbalanced data with class weighting and SMOTE and using appropriate cross-validation methods. Each exercise includes a solution with code for guidance.

Exercise 1: Evaluating a Model with Class Weighting

Train a Logistic Regression model on an imbalanced dataset using class weighting to improve the model’s sensitivity to the minority class. Use Stratified K-Folds Cross-Validation to ensure balanced class representation across each fold.

  1. Create an imbalanced dataset and split it into training and testing sets.
  2. Apply class weighting to a Logistic Regression model.
  3. Evaluate the model using Stratified K-Folds cross-validation.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Initialize Logistic Regression with class weighting
model = LogisticRegression(class_weight='balanced', random_state=42)

# Initialize Stratified K-Folds with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)

# Solution: Evaluate model using Stratified K-Folds cross-validation
scores = cross_val_score(model, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())

# Final model evaluation
model.fit(X, y)
y_pred = model.predict(X)
print("\\nClassification Report with Class Weighting:")
print(classification_report(y, y_pred))

In this solution:

  • The model’s F1 score is calculated using Stratified K-Folds to provide a balanced view of performance on each class.
  • Class weighting (class_weight='balanced') helps the model consider the minority class more effectively.

Exercise 2: Balancing Classes with SMOTE

Use SMOTE to create a balanced dataset for a Random Forest Classifier. Compare the performance on the original imbalanced dataset and the SMOTE-resampled dataset.

  1. Train a Random Forest Classifier on the original imbalanced dataset.
  2. Apply SMOTE to balance the dataset, and retrain the model.
  3. Compare the F1 scores of both models to see the impact of SMOTE.
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Solution: Train model on original imbalanced data
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Classification Report on Original Data:")
print(classification_report(y_test, y_pred))

# Apply SMOTE to create balanced data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Train model on SMOTE-resampled data
model.fit(X_resampled, y_resampled)
y_pred_resampled = model.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_resampled))

In this solution:

  • SMOTE creates synthetic samples to balance the classes, improving performance on the minority class.
  • The classification report compares the model’s F1 score and other metrics before and after using SMOTE, showing the effectiveness of balancing.

Exercise 3: Combining SMOTE with Stratified K-Folds Cross-Validation

Use SMOTE with Stratified K-Folds Cross-Validation to evaluate model performance across multiple folds, balancing classes in each fold for more robust evaluation.

  1. Apply SMOTE within each cross-validation fold.
  2. Train a Random Forest model on each fold and report F1 scores.
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline  # to use SMOTE in cross-validation
from sklearn.ensemble import RandomForestClassifier

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Initialize SMOTE and Random Forest in a pipeline
pipeline = make_pipeline(SMOTE(random_state=42), RandomForestClassifier(random_state=42))

# Initialize Stratified K-Folds Cross-Validation with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)

# Solution: Cross-validation with SMOTE in each fold
scores = cross_val_score(pipeline, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds with SMOTE F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())

In this solution:

  • SMOTE is applied within each fold using a pipeline, ensuring that classes are balanced for every cross-validation iteration.
  • The F1 scores across folds provide a comprehensive measure of model performance on the minority class.

Exercise 4: Compare Class Weighting vs. SMOTE on an Imbalanced Dataset

Train two Logistic Regression models on an imbalanced dataset: one with class weighting and the other with SMOTE. Compare the performance of each approach using F1 score.

  1. Train one Logistic Regression model with class_weight='balanced'.
  2. Train another model using SMOTE to balance the dataset.
  3. Compare F1 scores of both models on a test set.
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Solution 1: Logistic Regression with class weighting
model_weighted = LogisticRegression(class_weight='balanced', random_state=42)
model_weighted.fit(X_train, y_train)
y_pred_weighted = model_weighted.predict(X_test)
print("Classification Report with Class Weighting:")
print(classification_report(y_test, y_pred_weighted))

# Solution 2: Logistic Regression with SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
model_smote = LogisticRegression(random_state=42)
model_smote.fit(X_resampled, y_resampled)
y_pred_smote = model_smote.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_smote))

In this solution:

  • Both class weighting and SMOTE are applied to address imbalanced data, allowing for direct comparison.
  • The classification reports show the F1 scores for both methods, helping determine which technique works better for this dataset.

These exercises provide hands-on experience with handling imbalanced data, including the use of class weighting, SMOTE, and cross-validation strategies. By mastering these techniques, you’ll enhance your ability to evaluate and improve model performance on real-world, imbalanced datasets.

5.3 Practical Exercises for Chapter 5

These exercises will help you practice handling imbalanced data with class weighting and SMOTE and using appropriate cross-validation methods. Each exercise includes a solution with code for guidance.

Exercise 1: Evaluating a Model with Class Weighting

Train a Logistic Regression model on an imbalanced dataset using class weighting to improve the model’s sensitivity to the minority class. Use Stratified K-Folds Cross-Validation to ensure balanced class representation across each fold.

  1. Create an imbalanced dataset and split it into training and testing sets.
  2. Apply class weighting to a Logistic Regression model.
  3. Evaluate the model using Stratified K-Folds cross-validation.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Initialize Logistic Regression with class weighting
model = LogisticRegression(class_weight='balanced', random_state=42)

# Initialize Stratified K-Folds with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)

# Solution: Evaluate model using Stratified K-Folds cross-validation
scores = cross_val_score(model, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())

# Final model evaluation
model.fit(X, y)
y_pred = model.predict(X)
print("\\nClassification Report with Class Weighting:")
print(classification_report(y, y_pred))

In this solution:

  • The model’s F1 score is calculated using Stratified K-Folds to provide a balanced view of performance on each class.
  • Class weighting (class_weight='balanced') helps the model consider the minority class more effectively.

Exercise 2: Balancing Classes with SMOTE

Use SMOTE to create a balanced dataset for a Random Forest Classifier. Compare the performance on the original imbalanced dataset and the SMOTE-resampled dataset.

  1. Train a Random Forest Classifier on the original imbalanced dataset.
  2. Apply SMOTE to balance the dataset, and retrain the model.
  3. Compare the F1 scores of both models to see the impact of SMOTE.
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Solution: Train model on original imbalanced data
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Classification Report on Original Data:")
print(classification_report(y_test, y_pred))

# Apply SMOTE to create balanced data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Train model on SMOTE-resampled data
model.fit(X_resampled, y_resampled)
y_pred_resampled = model.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_resampled))

In this solution:

  • SMOTE creates synthetic samples to balance the classes, improving performance on the minority class.
  • The classification report compares the model’s F1 score and other metrics before and after using SMOTE, showing the effectiveness of balancing.

Exercise 3: Combining SMOTE with Stratified K-Folds Cross-Validation

Use SMOTE with Stratified K-Folds Cross-Validation to evaluate model performance across multiple folds, balancing classes in each fold for more robust evaluation.

  1. Apply SMOTE within each cross-validation fold.
  2. Train a Random Forest model on each fold and report F1 scores.
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline  # to use SMOTE in cross-validation
from sklearn.ensemble import RandomForestClassifier

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Initialize SMOTE and Random Forest in a pipeline
pipeline = make_pipeline(SMOTE(random_state=42), RandomForestClassifier(random_state=42))

# Initialize Stratified K-Folds Cross-Validation with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)

# Solution: Cross-validation with SMOTE in each fold
scores = cross_val_score(pipeline, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds with SMOTE F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())

In this solution:

  • SMOTE is applied within each fold using a pipeline, ensuring that classes are balanced for every cross-validation iteration.
  • The F1 scores across folds provide a comprehensive measure of model performance on the minority class.

Exercise 4: Compare Class Weighting vs. SMOTE on an Imbalanced Dataset

Train two Logistic Regression models on an imbalanced dataset: one with class weighting and the other with SMOTE. Compare the performance of each approach using F1 score.

  1. Train one Logistic Regression model with class_weight='balanced'.
  2. Train another model using SMOTE to balance the dataset.
  3. Compare F1 scores of both models on a test set.
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Solution 1: Logistic Regression with class weighting
model_weighted = LogisticRegression(class_weight='balanced', random_state=42)
model_weighted.fit(X_train, y_train)
y_pred_weighted = model_weighted.predict(X_test)
print("Classification Report with Class Weighting:")
print(classification_report(y_test, y_pred_weighted))

# Solution 2: Logistic Regression with SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
model_smote = LogisticRegression(random_state=42)
model_smote.fit(X_resampled, y_resampled)
y_pred_smote = model_smote.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_smote))

In this solution:

  • Both class weighting and SMOTE are applied to address imbalanced data, allowing for direct comparison.
  • The classification reports show the F1 scores for both methods, helping determine which technique works better for this dataset.

These exercises provide hands-on experience with handling imbalanced data, including the use of class weighting, SMOTE, and cross-validation strategies. By mastering these techniques, you’ll enhance your ability to evaluate and improve model performance on real-world, imbalanced datasets.

5.3 Practical Exercises for Chapter 5

These exercises will help you practice handling imbalanced data with class weighting and SMOTE and using appropriate cross-validation methods. Each exercise includes a solution with code for guidance.

Exercise 1: Evaluating a Model with Class Weighting

Train a Logistic Regression model on an imbalanced dataset using class weighting to improve the model’s sensitivity to the minority class. Use Stratified K-Folds Cross-Validation to ensure balanced class representation across each fold.

  1. Create an imbalanced dataset and split it into training and testing sets.
  2. Apply class weighting to a Logistic Regression model.
  3. Evaluate the model using Stratified K-Folds cross-validation.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Initialize Logistic Regression with class weighting
model = LogisticRegression(class_weight='balanced', random_state=42)

# Initialize Stratified K-Folds with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)

# Solution: Evaluate model using Stratified K-Folds cross-validation
scores = cross_val_score(model, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())

# Final model evaluation
model.fit(X, y)
y_pred = model.predict(X)
print("\\nClassification Report with Class Weighting:")
print(classification_report(y, y_pred))

In this solution:

  • The model’s F1 score is calculated using Stratified K-Folds to provide a balanced view of performance on each class.
  • Class weighting (class_weight='balanced') helps the model consider the minority class more effectively.

Exercise 2: Balancing Classes with SMOTE

Use SMOTE to create a balanced dataset for a Random Forest Classifier. Compare the performance on the original imbalanced dataset and the SMOTE-resampled dataset.

  1. Train a Random Forest Classifier on the original imbalanced dataset.
  2. Apply SMOTE to balance the dataset, and retrain the model.
  3. Compare the F1 scores of both models to see the impact of SMOTE.
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Solution: Train model on original imbalanced data
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Classification Report on Original Data:")
print(classification_report(y_test, y_pred))

# Apply SMOTE to create balanced data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Train model on SMOTE-resampled data
model.fit(X_resampled, y_resampled)
y_pred_resampled = model.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_resampled))

In this solution:

  • SMOTE creates synthetic samples to balance the classes, improving performance on the minority class.
  • The classification report compares the model’s F1 score and other metrics before and after using SMOTE, showing the effectiveness of balancing.

Exercise 3: Combining SMOTE with Stratified K-Folds Cross-Validation

Use SMOTE with Stratified K-Folds Cross-Validation to evaluate model performance across multiple folds, balancing classes in each fold for more robust evaluation.

  1. Apply SMOTE within each cross-validation fold.
  2. Train a Random Forest model on each fold and report F1 scores.
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline  # to use SMOTE in cross-validation
from sklearn.ensemble import RandomForestClassifier

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Initialize SMOTE and Random Forest in a pipeline
pipeline = make_pipeline(SMOTE(random_state=42), RandomForestClassifier(random_state=42))

# Initialize Stratified K-Folds Cross-Validation with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)

# Solution: Cross-validation with SMOTE in each fold
scores = cross_val_score(pipeline, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds with SMOTE F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())

In this solution:

  • SMOTE is applied within each fold using a pipeline, ensuring that classes are balanced for every cross-validation iteration.
  • The F1 scores across folds provide a comprehensive measure of model performance on the minority class.

Exercise 4: Compare Class Weighting vs. SMOTE on an Imbalanced Dataset

Train two Logistic Regression models on an imbalanced dataset: one with class weighting and the other with SMOTE. Compare the performance of each approach using F1 score.

  1. Train one Logistic Regression model with class_weight='balanced'.
  2. Train another model using SMOTE to balance the dataset.
  3. Compare F1 scores of both models on a test set.
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Solution 1: Logistic Regression with class weighting
model_weighted = LogisticRegression(class_weight='balanced', random_state=42)
model_weighted.fit(X_train, y_train)
y_pred_weighted = model_weighted.predict(X_test)
print("Classification Report with Class Weighting:")
print(classification_report(y_test, y_pred_weighted))

# Solution 2: Logistic Regression with SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
model_smote = LogisticRegression(random_state=42)
model_smote.fit(X_resampled, y_resampled)
y_pred_smote = model_smote.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_smote))

In this solution:

  • Both class weighting and SMOTE are applied to address imbalanced data, allowing for direct comparison.
  • The classification reports show the F1 scores for both methods, helping determine which technique works better for this dataset.

These exercises provide hands-on experience with handling imbalanced data, including the use of class weighting, SMOTE, and cross-validation strategies. By mastering these techniques, you’ll enhance your ability to evaluate and improve model performance on real-world, imbalanced datasets.