Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconData Engineering Foundations
Data Engineering Foundations

Chapter 10: Dimensionality Reduction

10.3 Practical Exercises for Chapter 10

These exercises provide hands-on practice with different feature selection techniques, helping you understand how to apply them effectively to optimize your datasets. Solutions with code are included to guide you through each step.

Exercise 1: Variance Thresholding

You are given a dataset with multiple features. Apply variance thresholding to remove features with variance below 0.1.

from sklearn.feature_selection import VarianceThreshold
import pandas as pd

# Sample dataset with low-variance features
data = {'Feature1': [1, 1, 1, 1, 1],
        'Feature2': [0.5, 0.5, 0.5, 0.5, 0.5],
        'Feature3': [0, 1, 0, 1, 0],
        'Feature4': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)

# Solution: Apply variance thresholding
selector = VarianceThreshold(threshold=0.1)
reduced_data = selector.fit_transform(df)

print("Reduced dataset with high-variance features:")
print(reduced_data)

In this solution:

Variance thresholding removes features that don’t meet the 0.1 variance threshold, resulting in a reduced dataset with higher information density.

Exercise 2: Correlation Thresholding

Given a dataset, identify pairs of features with a correlation above 0.8 and remove one from each highly correlated pair.

# Sample dataset with correlated features
data = {'Feature1': [1, 2, 3, 4, 5],
        'Feature2': [2, 4, 6, 8, 10],  # Perfectly correlated with Feature1
        'Feature3': [5, 3, 6, 2, 1],
        'Feature4': [10, 12, 15, 20, 25]}
df = pd.DataFrame(data)

# Solution: Correlation thresholding
correlation_matrix = df.corr()
threshold = 0.8
corr_features = set()

for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            colname = correlation_matrix.columns[i]
            corr_features.add(colname)

# Remove correlated features
df_reduced = df.drop(columns=corr_features)

print("Reduced dataset with correlated features removed:")
print(df_reduced)

In this solution:

We calculate correlations between features and remove one feature from each highly correlated pair (e.g., removing Feature2 if highly correlated with Feature1).

Exercise 3: Recursive Feature Elimination (RFE)

Using the Iris dataset, apply Recursive Feature Elimination (RFE) with a logistic regression model to select the top 2 features.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Solution: Apply RFE with logistic regression
model = LogisticRegression(max_iter=200)
rfe = RFE(model, n_features_to_select=2)
X_rfe = rfe.fit_transform(X, y)

print("Selected features after RFE:", rfe.support_)
print("Feature ranking:", rfe.ranking_)

In this solution:

RFE ranks features by importance for a logistic regression model, selecting the top 2 based on their impact on model performance.

Exercise 4: Feature Selection with Lasso Regression

Using the Boston housing dataset, apply Lasso regression for feature selection. Print the selected features with non-zero coefficients.

from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
import numpy as np

# Load the Boston housing dataset
X, y = load_boston(return_X_y=True)

# Solution: Apply Lasso regression for feature selection
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

# Identify non-zero coefficients
selected_features = np.where(lasso.coef_ != 0)[0]

print("Selected features with Lasso:", selected_features)

In this solution:

Lasso regression performs feature selection by shrinking coefficients of less important features to zero, retaining only the most impactful features.

Exercise 5: Implementing PCA for Dimensionality Reduction

Using the Iris dataset, apply Principal Component Analysis (PCA) to reduce the dataset to two dimensions for visualization.

from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Solution: Apply PCA with 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Convert PCA results to DataFrame
df_pca = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
df_pca['target'] = y

# Plot the PCA result
plt.figure(figsize=(8, 6))
for label in df_pca['target'].unique():
    subset = df_pca[df_pca['target'] == label]
    plt.scatter(subset['PC1'], subset['PC2'], label=iris.target_names[label])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset')
plt.legend()
plt.show()

In this solution:

PCA reduces the Iris dataset to two dimensions, allowing us to visualize the dataset’s structure in a lower-dimensional space.

These exercises guide you through practical applications of variance thresholding, correlation thresholding, RFE, Lasso regression, and PCA, giving you a comprehensive understanding of feature selection and dimensionality reduction techniques. These skills are essential for handling complex datasets, simplifying models, and enhancing interpretability.

10.3 Practical Exercises for Chapter 10

These exercises provide hands-on practice with different feature selection techniques, helping you understand how to apply them effectively to optimize your datasets. Solutions with code are included to guide you through each step.

Exercise 1: Variance Thresholding

You are given a dataset with multiple features. Apply variance thresholding to remove features with variance below 0.1.

from sklearn.feature_selection import VarianceThreshold
import pandas as pd

# Sample dataset with low-variance features
data = {'Feature1': [1, 1, 1, 1, 1],
        'Feature2': [0.5, 0.5, 0.5, 0.5, 0.5],
        'Feature3': [0, 1, 0, 1, 0],
        'Feature4': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)

# Solution: Apply variance thresholding
selector = VarianceThreshold(threshold=0.1)
reduced_data = selector.fit_transform(df)

print("Reduced dataset with high-variance features:")
print(reduced_data)

In this solution:

Variance thresholding removes features that don’t meet the 0.1 variance threshold, resulting in a reduced dataset with higher information density.

Exercise 2: Correlation Thresholding

Given a dataset, identify pairs of features with a correlation above 0.8 and remove one from each highly correlated pair.

# Sample dataset with correlated features
data = {'Feature1': [1, 2, 3, 4, 5],
        'Feature2': [2, 4, 6, 8, 10],  # Perfectly correlated with Feature1
        'Feature3': [5, 3, 6, 2, 1],
        'Feature4': [10, 12, 15, 20, 25]}
df = pd.DataFrame(data)

# Solution: Correlation thresholding
correlation_matrix = df.corr()
threshold = 0.8
corr_features = set()

for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            colname = correlation_matrix.columns[i]
            corr_features.add(colname)

# Remove correlated features
df_reduced = df.drop(columns=corr_features)

print("Reduced dataset with correlated features removed:")
print(df_reduced)

In this solution:

We calculate correlations between features and remove one feature from each highly correlated pair (e.g., removing Feature2 if highly correlated with Feature1).

Exercise 3: Recursive Feature Elimination (RFE)

Using the Iris dataset, apply Recursive Feature Elimination (RFE) with a logistic regression model to select the top 2 features.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Solution: Apply RFE with logistic regression
model = LogisticRegression(max_iter=200)
rfe = RFE(model, n_features_to_select=2)
X_rfe = rfe.fit_transform(X, y)

print("Selected features after RFE:", rfe.support_)
print("Feature ranking:", rfe.ranking_)

In this solution:

RFE ranks features by importance for a logistic regression model, selecting the top 2 based on their impact on model performance.

Exercise 4: Feature Selection with Lasso Regression

Using the Boston housing dataset, apply Lasso regression for feature selection. Print the selected features with non-zero coefficients.

from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
import numpy as np

# Load the Boston housing dataset
X, y = load_boston(return_X_y=True)

# Solution: Apply Lasso regression for feature selection
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

# Identify non-zero coefficients
selected_features = np.where(lasso.coef_ != 0)[0]

print("Selected features with Lasso:", selected_features)

In this solution:

Lasso regression performs feature selection by shrinking coefficients of less important features to zero, retaining only the most impactful features.

Exercise 5: Implementing PCA for Dimensionality Reduction

Using the Iris dataset, apply Principal Component Analysis (PCA) to reduce the dataset to two dimensions for visualization.

from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Solution: Apply PCA with 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Convert PCA results to DataFrame
df_pca = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
df_pca['target'] = y

# Plot the PCA result
plt.figure(figsize=(8, 6))
for label in df_pca['target'].unique():
    subset = df_pca[df_pca['target'] == label]
    plt.scatter(subset['PC1'], subset['PC2'], label=iris.target_names[label])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset')
plt.legend()
plt.show()

In this solution:

PCA reduces the Iris dataset to two dimensions, allowing us to visualize the dataset’s structure in a lower-dimensional space.

These exercises guide you through practical applications of variance thresholding, correlation thresholding, RFE, Lasso regression, and PCA, giving you a comprehensive understanding of feature selection and dimensionality reduction techniques. These skills are essential for handling complex datasets, simplifying models, and enhancing interpretability.

10.3 Practical Exercises for Chapter 10

These exercises provide hands-on practice with different feature selection techniques, helping you understand how to apply them effectively to optimize your datasets. Solutions with code are included to guide you through each step.

Exercise 1: Variance Thresholding

You are given a dataset with multiple features. Apply variance thresholding to remove features with variance below 0.1.

from sklearn.feature_selection import VarianceThreshold
import pandas as pd

# Sample dataset with low-variance features
data = {'Feature1': [1, 1, 1, 1, 1],
        'Feature2': [0.5, 0.5, 0.5, 0.5, 0.5],
        'Feature3': [0, 1, 0, 1, 0],
        'Feature4': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)

# Solution: Apply variance thresholding
selector = VarianceThreshold(threshold=0.1)
reduced_data = selector.fit_transform(df)

print("Reduced dataset with high-variance features:")
print(reduced_data)

In this solution:

Variance thresholding removes features that don’t meet the 0.1 variance threshold, resulting in a reduced dataset with higher information density.

Exercise 2: Correlation Thresholding

Given a dataset, identify pairs of features with a correlation above 0.8 and remove one from each highly correlated pair.

# Sample dataset with correlated features
data = {'Feature1': [1, 2, 3, 4, 5],
        'Feature2': [2, 4, 6, 8, 10],  # Perfectly correlated with Feature1
        'Feature3': [5, 3, 6, 2, 1],
        'Feature4': [10, 12, 15, 20, 25]}
df = pd.DataFrame(data)

# Solution: Correlation thresholding
correlation_matrix = df.corr()
threshold = 0.8
corr_features = set()

for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            colname = correlation_matrix.columns[i]
            corr_features.add(colname)

# Remove correlated features
df_reduced = df.drop(columns=corr_features)

print("Reduced dataset with correlated features removed:")
print(df_reduced)

In this solution:

We calculate correlations between features and remove one feature from each highly correlated pair (e.g., removing Feature2 if highly correlated with Feature1).

Exercise 3: Recursive Feature Elimination (RFE)

Using the Iris dataset, apply Recursive Feature Elimination (RFE) with a logistic regression model to select the top 2 features.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Solution: Apply RFE with logistic regression
model = LogisticRegression(max_iter=200)
rfe = RFE(model, n_features_to_select=2)
X_rfe = rfe.fit_transform(X, y)

print("Selected features after RFE:", rfe.support_)
print("Feature ranking:", rfe.ranking_)

In this solution:

RFE ranks features by importance for a logistic regression model, selecting the top 2 based on their impact on model performance.

Exercise 4: Feature Selection with Lasso Regression

Using the Boston housing dataset, apply Lasso regression for feature selection. Print the selected features with non-zero coefficients.

from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
import numpy as np

# Load the Boston housing dataset
X, y = load_boston(return_X_y=True)

# Solution: Apply Lasso regression for feature selection
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

# Identify non-zero coefficients
selected_features = np.where(lasso.coef_ != 0)[0]

print("Selected features with Lasso:", selected_features)

In this solution:

Lasso regression performs feature selection by shrinking coefficients of less important features to zero, retaining only the most impactful features.

Exercise 5: Implementing PCA for Dimensionality Reduction

Using the Iris dataset, apply Principal Component Analysis (PCA) to reduce the dataset to two dimensions for visualization.

from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Solution: Apply PCA with 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Convert PCA results to DataFrame
df_pca = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
df_pca['target'] = y

# Plot the PCA result
plt.figure(figsize=(8, 6))
for label in df_pca['target'].unique():
    subset = df_pca[df_pca['target'] == label]
    plt.scatter(subset['PC1'], subset['PC2'], label=iris.target_names[label])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset')
plt.legend()
plt.show()

In this solution:

PCA reduces the Iris dataset to two dimensions, allowing us to visualize the dataset’s structure in a lower-dimensional space.

These exercises guide you through practical applications of variance thresholding, correlation thresholding, RFE, Lasso regression, and PCA, giving you a comprehensive understanding of feature selection and dimensionality reduction techniques. These skills are essential for handling complex datasets, simplifying models, and enhancing interpretability.

10.3 Practical Exercises for Chapter 10

These exercises provide hands-on practice with different feature selection techniques, helping you understand how to apply them effectively to optimize your datasets. Solutions with code are included to guide you through each step.

Exercise 1: Variance Thresholding

You are given a dataset with multiple features. Apply variance thresholding to remove features with variance below 0.1.

from sklearn.feature_selection import VarianceThreshold
import pandas as pd

# Sample dataset with low-variance features
data = {'Feature1': [1, 1, 1, 1, 1],
        'Feature2': [0.5, 0.5, 0.5, 0.5, 0.5],
        'Feature3': [0, 1, 0, 1, 0],
        'Feature4': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)

# Solution: Apply variance thresholding
selector = VarianceThreshold(threshold=0.1)
reduced_data = selector.fit_transform(df)

print("Reduced dataset with high-variance features:")
print(reduced_data)

In this solution:

Variance thresholding removes features that don’t meet the 0.1 variance threshold, resulting in a reduced dataset with higher information density.

Exercise 2: Correlation Thresholding

Given a dataset, identify pairs of features with a correlation above 0.8 and remove one from each highly correlated pair.

# Sample dataset with correlated features
data = {'Feature1': [1, 2, 3, 4, 5],
        'Feature2': [2, 4, 6, 8, 10],  # Perfectly correlated with Feature1
        'Feature3': [5, 3, 6, 2, 1],
        'Feature4': [10, 12, 15, 20, 25]}
df = pd.DataFrame(data)

# Solution: Correlation thresholding
correlation_matrix = df.corr()
threshold = 0.8
corr_features = set()

for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            colname = correlation_matrix.columns[i]
            corr_features.add(colname)

# Remove correlated features
df_reduced = df.drop(columns=corr_features)

print("Reduced dataset with correlated features removed:")
print(df_reduced)

In this solution:

We calculate correlations between features and remove one feature from each highly correlated pair (e.g., removing Feature2 if highly correlated with Feature1).

Exercise 3: Recursive Feature Elimination (RFE)

Using the Iris dataset, apply Recursive Feature Elimination (RFE) with a logistic regression model to select the top 2 features.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Solution: Apply RFE with logistic regression
model = LogisticRegression(max_iter=200)
rfe = RFE(model, n_features_to_select=2)
X_rfe = rfe.fit_transform(X, y)

print("Selected features after RFE:", rfe.support_)
print("Feature ranking:", rfe.ranking_)

In this solution:

RFE ranks features by importance for a logistic regression model, selecting the top 2 based on their impact on model performance.

Exercise 4: Feature Selection with Lasso Regression

Using the Boston housing dataset, apply Lasso regression for feature selection. Print the selected features with non-zero coefficients.

from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
import numpy as np

# Load the Boston housing dataset
X, y = load_boston(return_X_y=True)

# Solution: Apply Lasso regression for feature selection
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

# Identify non-zero coefficients
selected_features = np.where(lasso.coef_ != 0)[0]

print("Selected features with Lasso:", selected_features)

In this solution:

Lasso regression performs feature selection by shrinking coefficients of less important features to zero, retaining only the most impactful features.

Exercise 5: Implementing PCA for Dimensionality Reduction

Using the Iris dataset, apply Principal Component Analysis (PCA) to reduce the dataset to two dimensions for visualization.

from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Solution: Apply PCA with 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Convert PCA results to DataFrame
df_pca = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
df_pca['target'] = y

# Plot the PCA result
plt.figure(figsize=(8, 6))
for label in df_pca['target'].unique():
    subset = df_pca[df_pca['target'] == label]
    plt.scatter(subset['PC1'], subset['PC2'], label=iris.target_names[label])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset')
plt.legend()
plt.show()

In this solution:

PCA reduces the Iris dataset to two dimensions, allowing us to visualize the dataset’s structure in a lower-dimensional space.

These exercises guide you through practical applications of variance thresholding, correlation thresholding, RFE, Lasso regression, and PCA, giving you a comprehensive understanding of feature selection and dimensionality reduction techniques. These skills are essential for handling complex datasets, simplifying models, and enhancing interpretability.