Chapter 5: Advanced Model Evaluation Techniques
5.3 Practical Exercises for Chapter 5
These exercises will help you practice handling imbalanced data with class weighting and SMOTE and using appropriate cross-validation methods. Each exercise includes a solution with code for guidance.
Exercise 1: Evaluating a Model with Class Weighting
Train a Logistic Regression model on an imbalanced dataset using class weighting to improve the model’s sensitivity to the minority class. Use Stratified K-Folds Cross-Validation to ensure balanced class representation across each fold.
- Create an imbalanced dataset and split it into training and testing sets.
- Apply class weighting to a Logistic Regression model.
- Evaluate the model using Stratified K-Folds cross-validation.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Initialize Logistic Regression with class weighting
model = LogisticRegression(class_weight='balanced', random_state=42)
# Initialize Stratified K-Folds with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)
# Solution: Evaluate model using Stratified K-Folds cross-validation
scores = cross_val_score(model, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())
# Final model evaluation
model.fit(X, y)
y_pred = model.predict(X)
print("\\nClassification Report with Class Weighting:")
print(classification_report(y, y_pred))
In this solution:
- The model’s F1 score is calculated using Stratified K-Folds to provide a balanced view of performance on each class.
- Class weighting (
class_weight='balanced'
) helps the model consider the minority class more effectively.
Exercise 2: Balancing Classes with SMOTE
Use SMOTE to create a balanced dataset for a Random Forest Classifier. Compare the performance on the original imbalanced dataset and the SMOTE-resampled dataset.
- Train a Random Forest Classifier on the original imbalanced dataset.
- Apply SMOTE to balance the dataset, and retrain the model.
- Compare the F1 scores of both models to see the impact of SMOTE.
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Solution: Train model on original imbalanced data
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Classification Report on Original Data:")
print(classification_report(y_test, y_pred))
# Apply SMOTE to create balanced data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# Train model on SMOTE-resampled data
model.fit(X_resampled, y_resampled)
y_pred_resampled = model.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_resampled))
In this solution:
- SMOTE creates synthetic samples to balance the classes, improving performance on the minority class.
- The classification report compares the model’s F1 score and other metrics before and after using SMOTE, showing the effectiveness of balancing.
Exercise 3: Combining SMOTE with Stratified K-Folds Cross-Validation
Use SMOTE with Stratified K-Folds Cross-Validation to evaluate model performance across multiple folds, balancing classes in each fold for more robust evaluation.
- Apply SMOTE within each cross-validation fold.
- Train a Random Forest model on each fold and report F1 scores.
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline # to use SMOTE in cross-validation
from sklearn.ensemble import RandomForestClassifier
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Initialize SMOTE and Random Forest in a pipeline
pipeline = make_pipeline(SMOTE(random_state=42), RandomForestClassifier(random_state=42))
# Initialize Stratified K-Folds Cross-Validation with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)
# Solution: Cross-validation with SMOTE in each fold
scores = cross_val_score(pipeline, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds with SMOTE F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())
In this solution:
- SMOTE is applied within each fold using a pipeline, ensuring that classes are balanced for every cross-validation iteration.
- The F1 scores across folds provide a comprehensive measure of model performance on the minority class.
Exercise 4: Compare Class Weighting vs. SMOTE on an Imbalanced Dataset
Train two Logistic Regression models on an imbalanced dataset: one with class weighting and the other with SMOTE. Compare the performance of each approach using F1 score.
- Train one Logistic Regression model with
class_weight='balanced'
. - Train another model using SMOTE to balance the dataset.
- Compare F1 scores of both models on a test set.
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Solution 1: Logistic Regression with class weighting
model_weighted = LogisticRegression(class_weight='balanced', random_state=42)
model_weighted.fit(X_train, y_train)
y_pred_weighted = model_weighted.predict(X_test)
print("Classification Report with Class Weighting:")
print(classification_report(y_test, y_pred_weighted))
# Solution 2: Logistic Regression with SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
model_smote = LogisticRegression(random_state=42)
model_smote.fit(X_resampled, y_resampled)
y_pred_smote = model_smote.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_smote))
In this solution:
- Both class weighting and SMOTE are applied to address imbalanced data, allowing for direct comparison.
- The classification reports show the F1 scores for both methods, helping determine which technique works better for this dataset.
These exercises provide hands-on experience with handling imbalanced data, including the use of class weighting, SMOTE, and cross-validation strategies. By mastering these techniques, you’ll enhance your ability to evaluate and improve model performance on real-world, imbalanced datasets.
5.3 Practical Exercises for Chapter 5
These exercises will help you practice handling imbalanced data with class weighting and SMOTE and using appropriate cross-validation methods. Each exercise includes a solution with code for guidance.
Exercise 1: Evaluating a Model with Class Weighting
Train a Logistic Regression model on an imbalanced dataset using class weighting to improve the model’s sensitivity to the minority class. Use Stratified K-Folds Cross-Validation to ensure balanced class representation across each fold.
- Create an imbalanced dataset and split it into training and testing sets.
- Apply class weighting to a Logistic Regression model.
- Evaluate the model using Stratified K-Folds cross-validation.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Initialize Logistic Regression with class weighting
model = LogisticRegression(class_weight='balanced', random_state=42)
# Initialize Stratified K-Folds with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)
# Solution: Evaluate model using Stratified K-Folds cross-validation
scores = cross_val_score(model, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())
# Final model evaluation
model.fit(X, y)
y_pred = model.predict(X)
print("\\nClassification Report with Class Weighting:")
print(classification_report(y, y_pred))
In this solution:
- The model’s F1 score is calculated using Stratified K-Folds to provide a balanced view of performance on each class.
- Class weighting (
class_weight='balanced'
) helps the model consider the minority class more effectively.
Exercise 2: Balancing Classes with SMOTE
Use SMOTE to create a balanced dataset for a Random Forest Classifier. Compare the performance on the original imbalanced dataset and the SMOTE-resampled dataset.
- Train a Random Forest Classifier on the original imbalanced dataset.
- Apply SMOTE to balance the dataset, and retrain the model.
- Compare the F1 scores of both models to see the impact of SMOTE.
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Solution: Train model on original imbalanced data
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Classification Report on Original Data:")
print(classification_report(y_test, y_pred))
# Apply SMOTE to create balanced data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# Train model on SMOTE-resampled data
model.fit(X_resampled, y_resampled)
y_pred_resampled = model.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_resampled))
In this solution:
- SMOTE creates synthetic samples to balance the classes, improving performance on the minority class.
- The classification report compares the model’s F1 score and other metrics before and after using SMOTE, showing the effectiveness of balancing.
Exercise 3: Combining SMOTE with Stratified K-Folds Cross-Validation
Use SMOTE with Stratified K-Folds Cross-Validation to evaluate model performance across multiple folds, balancing classes in each fold for more robust evaluation.
- Apply SMOTE within each cross-validation fold.
- Train a Random Forest model on each fold and report F1 scores.
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline # to use SMOTE in cross-validation
from sklearn.ensemble import RandomForestClassifier
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Initialize SMOTE and Random Forest in a pipeline
pipeline = make_pipeline(SMOTE(random_state=42), RandomForestClassifier(random_state=42))
# Initialize Stratified K-Folds Cross-Validation with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)
# Solution: Cross-validation with SMOTE in each fold
scores = cross_val_score(pipeline, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds with SMOTE F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())
In this solution:
- SMOTE is applied within each fold using a pipeline, ensuring that classes are balanced for every cross-validation iteration.
- The F1 scores across folds provide a comprehensive measure of model performance on the minority class.
Exercise 4: Compare Class Weighting vs. SMOTE on an Imbalanced Dataset
Train two Logistic Regression models on an imbalanced dataset: one with class weighting and the other with SMOTE. Compare the performance of each approach using F1 score.
- Train one Logistic Regression model with
class_weight='balanced'
. - Train another model using SMOTE to balance the dataset.
- Compare F1 scores of both models on a test set.
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Solution 1: Logistic Regression with class weighting
model_weighted = LogisticRegression(class_weight='balanced', random_state=42)
model_weighted.fit(X_train, y_train)
y_pred_weighted = model_weighted.predict(X_test)
print("Classification Report with Class Weighting:")
print(classification_report(y_test, y_pred_weighted))
# Solution 2: Logistic Regression with SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
model_smote = LogisticRegression(random_state=42)
model_smote.fit(X_resampled, y_resampled)
y_pred_smote = model_smote.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_smote))
In this solution:
- Both class weighting and SMOTE are applied to address imbalanced data, allowing for direct comparison.
- The classification reports show the F1 scores for both methods, helping determine which technique works better for this dataset.
These exercises provide hands-on experience with handling imbalanced data, including the use of class weighting, SMOTE, and cross-validation strategies. By mastering these techniques, you’ll enhance your ability to evaluate and improve model performance on real-world, imbalanced datasets.
5.3 Practical Exercises for Chapter 5
These exercises will help you practice handling imbalanced data with class weighting and SMOTE and using appropriate cross-validation methods. Each exercise includes a solution with code for guidance.
Exercise 1: Evaluating a Model with Class Weighting
Train a Logistic Regression model on an imbalanced dataset using class weighting to improve the model’s sensitivity to the minority class. Use Stratified K-Folds Cross-Validation to ensure balanced class representation across each fold.
- Create an imbalanced dataset and split it into training and testing sets.
- Apply class weighting to a Logistic Regression model.
- Evaluate the model using Stratified K-Folds cross-validation.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Initialize Logistic Regression with class weighting
model = LogisticRegression(class_weight='balanced', random_state=42)
# Initialize Stratified K-Folds with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)
# Solution: Evaluate model using Stratified K-Folds cross-validation
scores = cross_val_score(model, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())
# Final model evaluation
model.fit(X, y)
y_pred = model.predict(X)
print("\\nClassification Report with Class Weighting:")
print(classification_report(y, y_pred))
In this solution:
- The model’s F1 score is calculated using Stratified K-Folds to provide a balanced view of performance on each class.
- Class weighting (
class_weight='balanced'
) helps the model consider the minority class more effectively.
Exercise 2: Balancing Classes with SMOTE
Use SMOTE to create a balanced dataset for a Random Forest Classifier. Compare the performance on the original imbalanced dataset and the SMOTE-resampled dataset.
- Train a Random Forest Classifier on the original imbalanced dataset.
- Apply SMOTE to balance the dataset, and retrain the model.
- Compare the F1 scores of both models to see the impact of SMOTE.
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Solution: Train model on original imbalanced data
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Classification Report on Original Data:")
print(classification_report(y_test, y_pred))
# Apply SMOTE to create balanced data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# Train model on SMOTE-resampled data
model.fit(X_resampled, y_resampled)
y_pred_resampled = model.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_resampled))
In this solution:
- SMOTE creates synthetic samples to balance the classes, improving performance on the minority class.
- The classification report compares the model’s F1 score and other metrics before and after using SMOTE, showing the effectiveness of balancing.
Exercise 3: Combining SMOTE with Stratified K-Folds Cross-Validation
Use SMOTE with Stratified K-Folds Cross-Validation to evaluate model performance across multiple folds, balancing classes in each fold for more robust evaluation.
- Apply SMOTE within each cross-validation fold.
- Train a Random Forest model on each fold and report F1 scores.
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline # to use SMOTE in cross-validation
from sklearn.ensemble import RandomForestClassifier
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Initialize SMOTE and Random Forest in a pipeline
pipeline = make_pipeline(SMOTE(random_state=42), RandomForestClassifier(random_state=42))
# Initialize Stratified K-Folds Cross-Validation with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)
# Solution: Cross-validation with SMOTE in each fold
scores = cross_val_score(pipeline, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds with SMOTE F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())
In this solution:
- SMOTE is applied within each fold using a pipeline, ensuring that classes are balanced for every cross-validation iteration.
- The F1 scores across folds provide a comprehensive measure of model performance on the minority class.
Exercise 4: Compare Class Weighting vs. SMOTE on an Imbalanced Dataset
Train two Logistic Regression models on an imbalanced dataset: one with class weighting and the other with SMOTE. Compare the performance of each approach using F1 score.
- Train one Logistic Regression model with
class_weight='balanced'
. - Train another model using SMOTE to balance the dataset.
- Compare F1 scores of both models on a test set.
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Solution 1: Logistic Regression with class weighting
model_weighted = LogisticRegression(class_weight='balanced', random_state=42)
model_weighted.fit(X_train, y_train)
y_pred_weighted = model_weighted.predict(X_test)
print("Classification Report with Class Weighting:")
print(classification_report(y_test, y_pred_weighted))
# Solution 2: Logistic Regression with SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
model_smote = LogisticRegression(random_state=42)
model_smote.fit(X_resampled, y_resampled)
y_pred_smote = model_smote.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_smote))
In this solution:
- Both class weighting and SMOTE are applied to address imbalanced data, allowing for direct comparison.
- The classification reports show the F1 scores for both methods, helping determine which technique works better for this dataset.
These exercises provide hands-on experience with handling imbalanced data, including the use of class weighting, SMOTE, and cross-validation strategies. By mastering these techniques, you’ll enhance your ability to evaluate and improve model performance on real-world, imbalanced datasets.
5.3 Practical Exercises for Chapter 5
These exercises will help you practice handling imbalanced data with class weighting and SMOTE and using appropriate cross-validation methods. Each exercise includes a solution with code for guidance.
Exercise 1: Evaluating a Model with Class Weighting
Train a Logistic Regression model on an imbalanced dataset using class weighting to improve the model’s sensitivity to the minority class. Use Stratified K-Folds Cross-Validation to ensure balanced class representation across each fold.
- Create an imbalanced dataset and split it into training and testing sets.
- Apply class weighting to a Logistic Regression model.
- Evaluate the model using Stratified K-Folds cross-validation.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Initialize Logistic Regression with class weighting
model = LogisticRegression(class_weight='balanced', random_state=42)
# Initialize Stratified K-Folds with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)
# Solution: Evaluate model using Stratified K-Folds cross-validation
scores = cross_val_score(model, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())
# Final model evaluation
model.fit(X, y)
y_pred = model.predict(X)
print("\\nClassification Report with Class Weighting:")
print(classification_report(y, y_pred))
In this solution:
- The model’s F1 score is calculated using Stratified K-Folds to provide a balanced view of performance on each class.
- Class weighting (
class_weight='balanced'
) helps the model consider the minority class more effectively.
Exercise 2: Balancing Classes with SMOTE
Use SMOTE to create a balanced dataset for a Random Forest Classifier. Compare the performance on the original imbalanced dataset and the SMOTE-resampled dataset.
- Train a Random Forest Classifier on the original imbalanced dataset.
- Apply SMOTE to balance the dataset, and retrain the model.
- Compare the F1 scores of both models to see the impact of SMOTE.
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Solution: Train model on original imbalanced data
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Classification Report on Original Data:")
print(classification_report(y_test, y_pred))
# Apply SMOTE to create balanced data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# Train model on SMOTE-resampled data
model.fit(X_resampled, y_resampled)
y_pred_resampled = model.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_resampled))
In this solution:
- SMOTE creates synthetic samples to balance the classes, improving performance on the minority class.
- The classification report compares the model’s F1 score and other metrics before and after using SMOTE, showing the effectiveness of balancing.
Exercise 3: Combining SMOTE with Stratified K-Folds Cross-Validation
Use SMOTE with Stratified K-Folds Cross-Validation to evaluate model performance across multiple folds, balancing classes in each fold for more robust evaluation.
- Apply SMOTE within each cross-validation fold.
- Train a Random Forest model on each fold and report F1 scores.
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline # to use SMOTE in cross-validation
from sklearn.ensemble import RandomForestClassifier
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Initialize SMOTE and Random Forest in a pipeline
pipeline = make_pipeline(SMOTE(random_state=42), RandomForestClassifier(random_state=42))
# Initialize Stratified K-Folds Cross-Validation with 5 splits
strat_kfold = StratifiedKFold(n_splits=5)
# Solution: Cross-validation with SMOTE in each fold
scores = cross_val_score(pipeline, X, y, cv=strat_kfold, scoring='f1')
print("Stratified K-Folds with SMOTE F1 Scores:", scores)
print("Mean F1 Score:", scores.mean())
In this solution:
- SMOTE is applied within each fold using a pipeline, ensuring that classes are balanced for every cross-validation iteration.
- The F1 scores across folds provide a comprehensive measure of model performance on the minority class.
Exercise 4: Compare Class Weighting vs. SMOTE on an Imbalanced Dataset
Train two Logistic Regression models on an imbalanced dataset: one with class weighting and the other with SMOTE. Compare the performance of each approach using F1 score.
- Train one Logistic Regression model with
class_weight='balanced'
. - Train another model using SMOTE to balance the dataset.
- Compare F1 scores of both models on a test set.
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Solution 1: Logistic Regression with class weighting
model_weighted = LogisticRegression(class_weight='balanced', random_state=42)
model_weighted.fit(X_train, y_train)
y_pred_weighted = model_weighted.predict(X_test)
print("Classification Report with Class Weighting:")
print(classification_report(y_test, y_pred_weighted))
# Solution 2: Logistic Regression with SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
model_smote = LogisticRegression(random_state=42)
model_smote.fit(X_resampled, y_resampled)
y_pred_smote = model_smote.predict(X_test)
print("\\nClassification Report with SMOTE:")
print(classification_report(y_test, y_pred_smote))
In this solution:
- Both class weighting and SMOTE are applied to address imbalanced data, allowing for direct comparison.
- The classification reports show the F1 scores for both methods, helping determine which technique works better for this dataset.
These exercises provide hands-on experience with handling imbalanced data, including the use of class weighting, SMOTE, and cross-validation strategies. By mastering these techniques, you’ll enhance your ability to evaluate and improve model performance on real-world, imbalanced datasets.