Click here to view the next lesson.

Chapter 8: Advanced Data Cleaning Techniques

8.3 Practical Exercises for Chapter 8

These exercises will give you hands-on practice with identifying and handling data anomalies using the techniques covered in this chapter. Each exercise addresses a different aspect of data cleaning, and solutions with code are provided where necessary.

Exercise 1: Standardizing Date Formats

You are given a dataset where dates are represented in different formats. Your task is to:

Convert all dates to the YYYY-MM-DD format.
Identify any invalid dates.

import pandas as pd

# Sample data with inconsistent date formats
data = {'OrderDate': ['2022-01-15', '01/20/2022', 'February 5, 2022', '2022/02/10', '2022-31-12']}
df = pd.DataFrame(data)

# Solution: Convert all dates to a consistent format
df['OrderDate'] = pd.to_datetime(df['OrderDate'], errors='coerce')

# Identify invalid dates (converted to NaT)
invalid_dates = df[df['OrderDate'].isna()]

print("Dataset with standardized dates:")
print(df)
print("\\nInvalid dates:")
print(invalid_dates)

Exercise 2: Removing Duplicate Rows

Given a dataset with duplicate entries, your task is to:

Identify duplicate rows.
Remove duplicates and keep only unique records.

# Sample data with duplicate rows
data = {'CustomerID': [101, 102, 103, 101],
        'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
        'PurchaseAmount': [150, 200, 300, 150]}
df = pd.DataFrame(data)

# Solution: Identify duplicate rows
duplicates = df[df.duplicated()]
print("Duplicate rows:")
print(duplicates)

# Solution: Remove duplicate rows
df = df.drop_duplicates()

print("\\nDataset after removing duplicates:")
print(df)

Exercise 3: Standardizing Text in Categorical Data

You are provided with a dataset that contains inconsistent capitalization in a categorical column. Your task is to:

Standardize all entries in the Category column to lowercase.

# Sample data with inconsistent text entries
data = {'Category': ['Electronics', 'electronics', 'ELECTronics', 'Furniture', 'furniture']}
df = pd.DataFrame(data)

# Solution: Standardize text to lowercase
df['Category'] = df['Category'].str.lower()

print("Dataset with standardized categories:")
print(df)

Exercise 4: Removing Out-of-Range Values

Given a dataset with an Age column, where valid ages range between 0 and 120, your task is to:

Identify any values in the Age column that fall outside this range.
Remove the rows with out-of-range values.

# Sample data with an out-of-range value
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
        'Age': [25, 132, 30, -5]}
df = pd.DataFrame(data)

# Solution: Identify out-of-range values
out_of_range = df[(df['Age'] < 0) | (df['Age'] > 120)]
print("Out-of-range values:")
print(out_of_range)

# Solution: Remove out-of-range values
df = df[(df['Age'] >= 0) & (df['Age'] <= 120)]

print("\\nDataset after removing out-of-range values:")
print(df)

Exercise 5: Imputing Missing Values After Anomaly Correction

You have a dataset where some dates are missing after correcting anomalies in the OrderDate column. Your task is to:

Forward-fill the missing dates to maintain continuity in the data.

# Sample data with missing values
data = {'OrderDate': ['2022-01-15', pd.NaT, '2022-02-05', pd.NaT]}
df = pd.DataFrame(data)

# Solution: Fill missing dates using forward fill
df['OrderDate'] = df['OrderDate'].fillna(method='ffill')

print("Dataset after forward-filling missing dates:")
print(df)

These exercises cover essential data cleaning techniques using Pandas, allowing you to handle inconsistent formats, duplicates, categorical inconsistencies, out-of-range values, and missing values. By mastering these techniques, you can prepare high-quality datasets for analysis and modeling.

8.3 Practical Exercises for Chapter 8

Exercise 1: Standardizing Date Formats

You are given a dataset where dates are represented in different formats. Your task is to:

Convert all dates to the YYYY-MM-DD format.
Identify any invalid dates.

import pandas as pd

# Sample data with inconsistent date formats
data = {'OrderDate': ['2022-01-15', '01/20/2022', 'February 5, 2022', '2022/02/10', '2022-31-12']}
df = pd.DataFrame(data)

# Solution: Convert all dates to a consistent format
df['OrderDate'] = pd.to_datetime(df['OrderDate'], errors='coerce')

# Identify invalid dates (converted to NaT)
invalid_dates = df[df['OrderDate'].isna()]

print("Dataset with standardized dates:")
print(df)
print("\\nInvalid dates:")
print(invalid_dates)

Exercise 2: Removing Duplicate Rows

Given a dataset with duplicate entries, your task is to:

Identify duplicate rows.
Remove duplicates and keep only unique records.

# Sample data with duplicate rows
data = {'CustomerID': [101, 102, 103, 101],
        'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
        'PurchaseAmount': [150, 200, 300, 150]}
df = pd.DataFrame(data)

# Solution: Identify duplicate rows
duplicates = df[df.duplicated()]
print("Duplicate rows:")
print(duplicates)

# Solution: Remove duplicate rows
df = df.drop_duplicates()

print("\\nDataset after removing duplicates:")
print(df)

Exercise 3: Standardizing Text in Categorical Data

You are provided with a dataset that contains inconsistent capitalization in a categorical column. Your task is to:

Standardize all entries in the Category column to lowercase.

# Sample data with inconsistent text entries
data = {'Category': ['Electronics', 'electronics', 'ELECTronics', 'Furniture', 'furniture']}
df = pd.DataFrame(data)

# Solution: Standardize text to lowercase
df['Category'] = df['Category'].str.lower()

print("Dataset with standardized categories:")
print(df)

Exercise 4: Removing Out-of-Range Values

Given a dataset with an Age column, where valid ages range between 0 and 120, your task is to:

Identify any values in the Age column that fall outside this range.
Remove the rows with out-of-range values.

# Sample data with an out-of-range value
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
        'Age': [25, 132, 30, -5]}
df = pd.DataFrame(data)

# Solution: Identify out-of-range values
out_of_range = df[(df['Age'] < 0) | (df['Age'] > 120)]
print("Out-of-range values:")
print(out_of_range)

# Solution: Remove out-of-range values
df = df[(df['Age'] >= 0) & (df['Age'] <= 120)]

print("\\nDataset after removing out-of-range values:")
print(df)

Exercise 5: Imputing Missing Values After Anomaly Correction

You have a dataset where some dates are missing after correcting anomalies in the OrderDate column. Your task is to:

Forward-fill the missing dates to maintain continuity in the data.

# Sample data with missing values
data = {'OrderDate': ['2022-01-15', pd.NaT, '2022-02-05', pd.NaT]}
df = pd.DataFrame(data)

# Solution: Fill missing dates using forward fill
df['OrderDate'] = df['OrderDate'].fillna(method='ffill')

print("Dataset after forward-filling missing dates:")
print(df)

8.3 Practical Exercises for Chapter 8

Exercise 1: Standardizing Date Formats

You are given a dataset where dates are represented in different formats. Your task is to:

Convert all dates to the YYYY-MM-DD format.
Identify any invalid dates.

import pandas as pd

# Sample data with inconsistent date formats
data = {'OrderDate': ['2022-01-15', '01/20/2022', 'February 5, 2022', '2022/02/10', '2022-31-12']}
df = pd.DataFrame(data)

# Solution: Convert all dates to a consistent format
df['OrderDate'] = pd.to_datetime(df['OrderDate'], errors='coerce')

# Identify invalid dates (converted to NaT)
invalid_dates = df[df['OrderDate'].isna()]

print("Dataset with standardized dates:")
print(df)
print("\\nInvalid dates:")
print(invalid_dates)

Exercise 2: Removing Duplicate Rows

Given a dataset with duplicate entries, your task is to:

Identify duplicate rows.
Remove duplicates and keep only unique records.

# Sample data with duplicate rows
data = {'CustomerID': [101, 102, 103, 101],
        'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
        'PurchaseAmount': [150, 200, 300, 150]}
df = pd.DataFrame(data)

# Solution: Identify duplicate rows
duplicates = df[df.duplicated()]
print("Duplicate rows:")
print(duplicates)

# Solution: Remove duplicate rows
df = df.drop_duplicates()

print("\\nDataset after removing duplicates:")
print(df)

Exercise 3: Standardizing Text in Categorical Data

You are provided with a dataset that contains inconsistent capitalization in a categorical column. Your task is to:

Standardize all entries in the Category column to lowercase.

# Sample data with inconsistent text entries
data = {'Category': ['Electronics', 'electronics', 'ELECTronics', 'Furniture', 'furniture']}
df = pd.DataFrame(data)

# Solution: Standardize text to lowercase
df['Category'] = df['Category'].str.lower()

print("Dataset with standardized categories:")
print(df)

Exercise 4: Removing Out-of-Range Values

Given a dataset with an Age column, where valid ages range between 0 and 120, your task is to:

Identify any values in the Age column that fall outside this range.
Remove the rows with out-of-range values.

# Sample data with an out-of-range value
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
        'Age': [25, 132, 30, -5]}
df = pd.DataFrame(data)

# Solution: Identify out-of-range values
out_of_range = df[(df['Age'] < 0) | (df['Age'] > 120)]
print("Out-of-range values:")
print(out_of_range)

# Solution: Remove out-of-range values
df = df[(df['Age'] >= 0) & (df['Age'] <= 120)]

print("\\nDataset after removing out-of-range values:")
print(df)

Exercise 5: Imputing Missing Values After Anomaly Correction

You have a dataset where some dates are missing after correcting anomalies in the OrderDate column. Your task is to:

Forward-fill the missing dates to maintain continuity in the data.

# Sample data with missing values
data = {'OrderDate': ['2022-01-15', pd.NaT, '2022-02-05', pd.NaT]}
df = pd.DataFrame(data)

# Solution: Fill missing dates using forward fill
df['OrderDate'] = df['OrderDate'].fillna(method='ffill')

print("Dataset after forward-filling missing dates:")
print(df)

8.3 Practical Exercises for Chapter 8

Exercise 1: Standardizing Date Formats

You are given a dataset where dates are represented in different formats. Your task is to:

Convert all dates to the YYYY-MM-DD format.
Identify any invalid dates.

import pandas as pd

# Sample data with inconsistent date formats
data = {'OrderDate': ['2022-01-15', '01/20/2022', 'February 5, 2022', '2022/02/10', '2022-31-12']}
df = pd.DataFrame(data)

# Solution: Convert all dates to a consistent format
df['OrderDate'] = pd.to_datetime(df['OrderDate'], errors='coerce')

# Identify invalid dates (converted to NaT)
invalid_dates = df[df['OrderDate'].isna()]

print("Dataset with standardized dates:")
print(df)
print("\\nInvalid dates:")
print(invalid_dates)

Exercise 2: Removing Duplicate Rows

Given a dataset with duplicate entries, your task is to:

Identify duplicate rows.
Remove duplicates and keep only unique records.

# Sample data with duplicate rows
data = {'CustomerID': [101, 102, 103, 101],
        'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
        'PurchaseAmount': [150, 200, 300, 150]}
df = pd.DataFrame(data)

# Solution: Identify duplicate rows
duplicates = df[df.duplicated()]
print("Duplicate rows:")
print(duplicates)

# Solution: Remove duplicate rows
df = df.drop_duplicates()

print("\\nDataset after removing duplicates:")
print(df)

Exercise 3: Standardizing Text in Categorical Data

You are provided with a dataset that contains inconsistent capitalization in a categorical column. Your task is to:

Standardize all entries in the Category column to lowercase.

# Sample data with inconsistent text entries
data = {'Category': ['Electronics', 'electronics', 'ELECTronics', 'Furniture', 'furniture']}
df = pd.DataFrame(data)

# Solution: Standardize text to lowercase
df['Category'] = df['Category'].str.lower()

print("Dataset with standardized categories:")
print(df)

Exercise 4: Removing Out-of-Range Values

Given a dataset with an Age column, where valid ages range between 0 and 120, your task is to:

Identify any values in the Age column that fall outside this range.
Remove the rows with out-of-range values.

# Sample data with an out-of-range value
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
        'Age': [25, 132, 30, -5]}
df = pd.DataFrame(data)

# Solution: Identify out-of-range values
out_of_range = df[(df['Age'] < 0) | (df['Age'] > 120)]
print("Out-of-range values:")
print(out_of_range)

# Solution: Remove out-of-range values
df = df[(df['Age'] >= 0) & (df['Age'] <= 120)]

print("\\nDataset after removing out-of-range values:")
print(df)

Exercise 5: Imputing Missing Values After Anomaly Correction

You have a dataset where some dates are missing after correcting anomalies in the OrderDate column. Your task is to:

Forward-fill the missing dates to maintain continuity in the data.

# Sample data with missing values
data = {'OrderDate': ['2022-01-15', pd.NaT, '2022-02-05', pd.NaT]}
df = pd.DataFrame(data)

# Solution: Fill missing dates using forward fill
df['OrderDate'] = df['OrderDate'].fillna(method='ffill')

print("Dataset after forward-filling missing dates:")
print(df)

Purchase this book

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Chapter 8: Advanced Data Cleaning Techniques

8.3 Practical Exercises for Chapter 8

Exercise 1: Standardizing Date Formats

Exercise 2: Removing Duplicate Rows

Exercise 3: Standardizing Text in Categorical Data

Exercise 4: Removing Out-of-Range Values

Exercise 5: Imputing Missing Values After Anomaly Correction

8.3 Practical Exercises for Chapter 8

Exercise 1: Standardizing Date Formats

Exercise 2: Removing Duplicate Rows

Exercise 3: Standardizing Text in Categorical Data

Exercise 4: Removing Out-of-Range Values

Exercise 5: Imputing Missing Values After Anomaly Correction

8.3 Practical Exercises for Chapter 8

Exercise 1: Standardizing Date Formats

Exercise 2: Removing Duplicate Rows

Exercise 3: Standardizing Text in Categorical Data

Exercise 4: Removing Out-of-Range Values

Exercise 5: Imputing Missing Values After Anomaly Correction

8.3 Practical Exercises for Chapter 8

Exercise 1: Standardizing Date Formats

Exercise 2: Removing Duplicate Rows

Exercise 3: Standardizing Text in Categorical Data

Exercise 4: Removing Out-of-Range Values

Exercise 5: Imputing Missing Values After Anomaly Correction