Chapter 8: Advanced Data Cleaning Techniques
8.3 Practical Exercises for Chapter 8
These exercises will give you hands-on practice with identifying and handling data anomalies using the techniques covered in this chapter. Each exercise addresses a different aspect of data cleaning, and solutions with code are provided where necessary.
Exercise 1: Standardizing Date Formats
You are given a dataset where dates are represented in different formats. Your task is to:
- Convert all dates to the
YYYY-MM-DD
format. - Identify any invalid dates.
import pandas as pd
# Sample data with inconsistent date formats
data = {'OrderDate': ['2022-01-15', '01/20/2022', 'February 5, 2022', '2022/02/10', '2022-31-12']}
df = pd.DataFrame(data)
# Solution: Convert all dates to a consistent format
df['OrderDate'] = pd.to_datetime(df['OrderDate'], errors='coerce')
# Identify invalid dates (converted to NaT)
invalid_dates = df[df['OrderDate'].isna()]
print("Dataset with standardized dates:")
print(df)
print("\\nInvalid dates:")
print(invalid_dates)
Exercise 2: Removing Duplicate Rows
Given a dataset with duplicate entries, your task is to:
- Identify duplicate rows.
- Remove duplicates and keep only unique records.
# Sample data with duplicate rows
data = {'CustomerID': [101, 102, 103, 101],
'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
'PurchaseAmount': [150, 200, 300, 150]}
df = pd.DataFrame(data)
# Solution: Identify duplicate rows
duplicates = df[df.duplicated()]
print("Duplicate rows:")
print(duplicates)
# Solution: Remove duplicate rows
df = df.drop_duplicates()
print("\\nDataset after removing duplicates:")
print(df)
Exercise 3: Standardizing Text in Categorical Data
You are provided with a dataset that contains inconsistent capitalization in a categorical column. Your task is to:
- Standardize all entries in the Category column to lowercase.
# Sample data with inconsistent text entries
data = {'Category': ['Electronics', 'electronics', 'ELECTronics', 'Furniture', 'furniture']}
df = pd.DataFrame(data)
# Solution: Standardize text to lowercase
df['Category'] = df['Category'].str.lower()
print("Dataset with standardized categories:")
print(df)
Exercise 4: Removing Out-of-Range Values
Given a dataset with an Age column, where valid ages range between 0 and 120, your task is to:
- Identify any values in the Age column that fall outside this range.
- Remove the rows with out-of-range values.
# Sample data with an out-of-range value
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 132, 30, -5]}
df = pd.DataFrame(data)
# Solution: Identify out-of-range values
out_of_range = df[(df['Age'] < 0) | (df['Age'] > 120)]
print("Out-of-range values:")
print(out_of_range)
# Solution: Remove out-of-range values
df = df[(df['Age'] >= 0) & (df['Age'] <= 120)]
print("\\nDataset after removing out-of-range values:")
print(df)
Exercise 5: Imputing Missing Values After Anomaly Correction
You have a dataset where some dates are missing after correcting anomalies in the OrderDate column. Your task is to:
Forward-fill the missing dates to maintain continuity in the data.
# Sample data with missing values
data = {'OrderDate': ['2022-01-15', pd.NaT, '2022-02-05', pd.NaT]}
df = pd.DataFrame(data)
# Solution: Fill missing dates using forward fill
df['OrderDate'] = df['OrderDate'].fillna(method='ffill')
print("Dataset after forward-filling missing dates:")
print(df)
These exercises cover essential data cleaning techniques using Pandas, allowing you to handle inconsistent formats, duplicates, categorical inconsistencies, out-of-range values, and missing values. By mastering these techniques, you can prepare high-quality datasets for analysis and modeling.
8.3 Practical Exercises for Chapter 8
These exercises will give you hands-on practice with identifying and handling data anomalies using the techniques covered in this chapter. Each exercise addresses a different aspect of data cleaning, and solutions with code are provided where necessary.
Exercise 1: Standardizing Date Formats
You are given a dataset where dates are represented in different formats. Your task is to:
- Convert all dates to the
YYYY-MM-DD
format. - Identify any invalid dates.
import pandas as pd
# Sample data with inconsistent date formats
data = {'OrderDate': ['2022-01-15', '01/20/2022', 'February 5, 2022', '2022/02/10', '2022-31-12']}
df = pd.DataFrame(data)
# Solution: Convert all dates to a consistent format
df['OrderDate'] = pd.to_datetime(df['OrderDate'], errors='coerce')
# Identify invalid dates (converted to NaT)
invalid_dates = df[df['OrderDate'].isna()]
print("Dataset with standardized dates:")
print(df)
print("\\nInvalid dates:")
print(invalid_dates)
Exercise 2: Removing Duplicate Rows
Given a dataset with duplicate entries, your task is to:
- Identify duplicate rows.
- Remove duplicates and keep only unique records.
# Sample data with duplicate rows
data = {'CustomerID': [101, 102, 103, 101],
'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
'PurchaseAmount': [150, 200, 300, 150]}
df = pd.DataFrame(data)
# Solution: Identify duplicate rows
duplicates = df[df.duplicated()]
print("Duplicate rows:")
print(duplicates)
# Solution: Remove duplicate rows
df = df.drop_duplicates()
print("\\nDataset after removing duplicates:")
print(df)
Exercise 3: Standardizing Text in Categorical Data
You are provided with a dataset that contains inconsistent capitalization in a categorical column. Your task is to:
- Standardize all entries in the Category column to lowercase.
# Sample data with inconsistent text entries
data = {'Category': ['Electronics', 'electronics', 'ELECTronics', 'Furniture', 'furniture']}
df = pd.DataFrame(data)
# Solution: Standardize text to lowercase
df['Category'] = df['Category'].str.lower()
print("Dataset with standardized categories:")
print(df)
Exercise 4: Removing Out-of-Range Values
Given a dataset with an Age column, where valid ages range between 0 and 120, your task is to:
- Identify any values in the Age column that fall outside this range.
- Remove the rows with out-of-range values.
# Sample data with an out-of-range value
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 132, 30, -5]}
df = pd.DataFrame(data)
# Solution: Identify out-of-range values
out_of_range = df[(df['Age'] < 0) | (df['Age'] > 120)]
print("Out-of-range values:")
print(out_of_range)
# Solution: Remove out-of-range values
df = df[(df['Age'] >= 0) & (df['Age'] <= 120)]
print("\\nDataset after removing out-of-range values:")
print(df)
Exercise 5: Imputing Missing Values After Anomaly Correction
You have a dataset where some dates are missing after correcting anomalies in the OrderDate column. Your task is to:
Forward-fill the missing dates to maintain continuity in the data.
# Sample data with missing values
data = {'OrderDate': ['2022-01-15', pd.NaT, '2022-02-05', pd.NaT]}
df = pd.DataFrame(data)
# Solution: Fill missing dates using forward fill
df['OrderDate'] = df['OrderDate'].fillna(method='ffill')
print("Dataset after forward-filling missing dates:")
print(df)
These exercises cover essential data cleaning techniques using Pandas, allowing you to handle inconsistent formats, duplicates, categorical inconsistencies, out-of-range values, and missing values. By mastering these techniques, you can prepare high-quality datasets for analysis and modeling.
8.3 Practical Exercises for Chapter 8
These exercises will give you hands-on practice with identifying and handling data anomalies using the techniques covered in this chapter. Each exercise addresses a different aspect of data cleaning, and solutions with code are provided where necessary.
Exercise 1: Standardizing Date Formats
You are given a dataset where dates are represented in different formats. Your task is to:
- Convert all dates to the
YYYY-MM-DD
format. - Identify any invalid dates.
import pandas as pd
# Sample data with inconsistent date formats
data = {'OrderDate': ['2022-01-15', '01/20/2022', 'February 5, 2022', '2022/02/10', '2022-31-12']}
df = pd.DataFrame(data)
# Solution: Convert all dates to a consistent format
df['OrderDate'] = pd.to_datetime(df['OrderDate'], errors='coerce')
# Identify invalid dates (converted to NaT)
invalid_dates = df[df['OrderDate'].isna()]
print("Dataset with standardized dates:")
print(df)
print("\\nInvalid dates:")
print(invalid_dates)
Exercise 2: Removing Duplicate Rows
Given a dataset with duplicate entries, your task is to:
- Identify duplicate rows.
- Remove duplicates and keep only unique records.
# Sample data with duplicate rows
data = {'CustomerID': [101, 102, 103, 101],
'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
'PurchaseAmount': [150, 200, 300, 150]}
df = pd.DataFrame(data)
# Solution: Identify duplicate rows
duplicates = df[df.duplicated()]
print("Duplicate rows:")
print(duplicates)
# Solution: Remove duplicate rows
df = df.drop_duplicates()
print("\\nDataset after removing duplicates:")
print(df)
Exercise 3: Standardizing Text in Categorical Data
You are provided with a dataset that contains inconsistent capitalization in a categorical column. Your task is to:
- Standardize all entries in the Category column to lowercase.
# Sample data with inconsistent text entries
data = {'Category': ['Electronics', 'electronics', 'ELECTronics', 'Furniture', 'furniture']}
df = pd.DataFrame(data)
# Solution: Standardize text to lowercase
df['Category'] = df['Category'].str.lower()
print("Dataset with standardized categories:")
print(df)
Exercise 4: Removing Out-of-Range Values
Given a dataset with an Age column, where valid ages range between 0 and 120, your task is to:
- Identify any values in the Age column that fall outside this range.
- Remove the rows with out-of-range values.
# Sample data with an out-of-range value
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 132, 30, -5]}
df = pd.DataFrame(data)
# Solution: Identify out-of-range values
out_of_range = df[(df['Age'] < 0) | (df['Age'] > 120)]
print("Out-of-range values:")
print(out_of_range)
# Solution: Remove out-of-range values
df = df[(df['Age'] >= 0) & (df['Age'] <= 120)]
print("\\nDataset after removing out-of-range values:")
print(df)
Exercise 5: Imputing Missing Values After Anomaly Correction
You have a dataset where some dates are missing after correcting anomalies in the OrderDate column. Your task is to:
Forward-fill the missing dates to maintain continuity in the data.
# Sample data with missing values
data = {'OrderDate': ['2022-01-15', pd.NaT, '2022-02-05', pd.NaT]}
df = pd.DataFrame(data)
# Solution: Fill missing dates using forward fill
df['OrderDate'] = df['OrderDate'].fillna(method='ffill')
print("Dataset after forward-filling missing dates:")
print(df)
These exercises cover essential data cleaning techniques using Pandas, allowing you to handle inconsistent formats, duplicates, categorical inconsistencies, out-of-range values, and missing values. By mastering these techniques, you can prepare high-quality datasets for analysis and modeling.
8.3 Practical Exercises for Chapter 8
These exercises will give you hands-on practice with identifying and handling data anomalies using the techniques covered in this chapter. Each exercise addresses a different aspect of data cleaning, and solutions with code are provided where necessary.
Exercise 1: Standardizing Date Formats
You are given a dataset where dates are represented in different formats. Your task is to:
- Convert all dates to the
YYYY-MM-DD
format. - Identify any invalid dates.
import pandas as pd
# Sample data with inconsistent date formats
data = {'OrderDate': ['2022-01-15', '01/20/2022', 'February 5, 2022', '2022/02/10', '2022-31-12']}
df = pd.DataFrame(data)
# Solution: Convert all dates to a consistent format
df['OrderDate'] = pd.to_datetime(df['OrderDate'], errors='coerce')
# Identify invalid dates (converted to NaT)
invalid_dates = df[df['OrderDate'].isna()]
print("Dataset with standardized dates:")
print(df)
print("\\nInvalid dates:")
print(invalid_dates)
Exercise 2: Removing Duplicate Rows
Given a dataset with duplicate entries, your task is to:
- Identify duplicate rows.
- Remove duplicates and keep only unique records.
# Sample data with duplicate rows
data = {'CustomerID': [101, 102, 103, 101],
'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
'PurchaseAmount': [150, 200, 300, 150]}
df = pd.DataFrame(data)
# Solution: Identify duplicate rows
duplicates = df[df.duplicated()]
print("Duplicate rows:")
print(duplicates)
# Solution: Remove duplicate rows
df = df.drop_duplicates()
print("\\nDataset after removing duplicates:")
print(df)
Exercise 3: Standardizing Text in Categorical Data
You are provided with a dataset that contains inconsistent capitalization in a categorical column. Your task is to:
- Standardize all entries in the Category column to lowercase.
# Sample data with inconsistent text entries
data = {'Category': ['Electronics', 'electronics', 'ELECTronics', 'Furniture', 'furniture']}
df = pd.DataFrame(data)
# Solution: Standardize text to lowercase
df['Category'] = df['Category'].str.lower()
print("Dataset with standardized categories:")
print(df)
Exercise 4: Removing Out-of-Range Values
Given a dataset with an Age column, where valid ages range between 0 and 120, your task is to:
- Identify any values in the Age column that fall outside this range.
- Remove the rows with out-of-range values.
# Sample data with an out-of-range value
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 132, 30, -5]}
df = pd.DataFrame(data)
# Solution: Identify out-of-range values
out_of_range = df[(df['Age'] < 0) | (df['Age'] > 120)]
print("Out-of-range values:")
print(out_of_range)
# Solution: Remove out-of-range values
df = df[(df['Age'] >= 0) & (df['Age'] <= 120)]
print("\\nDataset after removing out-of-range values:")
print(df)
Exercise 5: Imputing Missing Values After Anomaly Correction
You have a dataset where some dates are missing after correcting anomalies in the OrderDate column. Your task is to:
Forward-fill the missing dates to maintain continuity in the data.
# Sample data with missing values
data = {'OrderDate': ['2022-01-15', pd.NaT, '2022-02-05', pd.NaT]}
df = pd.DataFrame(data)
# Solution: Fill missing dates using forward fill
df['OrderDate'] = df['OrderDate'].fillna(method='ffill')
print("Dataset after forward-filling missing dates:")
print(df)
These exercises cover essential data cleaning techniques using Pandas, allowing you to handle inconsistent formats, duplicates, categorical inconsistencies, out-of-range values, and missing values. By mastering these techniques, you can prepare high-quality datasets for analysis and modeling.