Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconMachine Learning Hero
Machine Learning Hero

Chapter 6: Practical Machine Learning Projects

6.2 Project 2: Predicting Car Prices Using Linear Regression

In this project, we will develop a predictive model to estimate the prices of used cars based on various features such as mileage, year, make, model, and other relevant factors. This project has significant real-world applications in the automotive industry, particularly for car dealerships, insurance companies, and online marketplaces dealing with used vehicles.

Linear regression is well-suited for this task as our goal is to predict a continuous value (car price) based on multiple input features. Throughout this project, we will:

  1. Explore and preprocess a comprehensive car dataset
  2. Apply linear regression to predict car prices
  3. Evaluate the model's performance using various metrics
  4. Optimize the model through feature engineering and selection
  5. Compare our linear regression model with other algorithms
  6. Analyze feature importance and model interpretability

6.2.1 Load and Explore the Dataset

We'll begin by loading and exploring our comprehensive used car dataset. This crucial step forms the foundation of our analysis, allowing us to gain deep insights into the structure and characteristics of our data.

Through careful examination, we can identify potential issues, such as missing values or outliers, and uncover meaningful patterns that may influence our model's performance.

This initial exploration not only helps us understand the nature of our dataset but also guides our subsequent preprocessing and feature engineering decisions, ultimately leading to a more robust and accurate car price prediction model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

# Load the dataset
car_df = pd.read_csv('/mnt/data/used_car_data.csv')

# Display basic information about the dataset
print(car_df.info())
print(car_df.describe())

# Encode categorical columns
label_encoders = {}
for col in ['make', 'model', 'fuel_type']:
    le = LabelEncoder()
    car_df[col] = le.fit_transform(car_df[col])
    label_encoders[col] = le

# Visualize the distribution of car prices
plt.figure(figsize=(12, 6))
sns.histplot(car_df['price'], bins=50, kde=True)
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(car_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

# Scatter plot of price vs. mileage
plt.figure(figsize=(10, 6))
sns.scatterplot(x='mileage', y='price', data=car_df)
plt.title('Price vs. Mileage')
plt.xlabel('Mileage')
plt.ylabel('Price')
plt.show()

Download the CSV File here: https://files.cuantum.tech/csv/used_car_data.csv

Here's a detailed breakdown:

1. Library Imports:

  • Core data analysis libraries (pandas, numpy)
  • Visualization libraries (matplotlib, seaborn)
  • Machine learning components from scikit-learn for model building and preprocessing

2. Data Loading and Initial Analysis:

  • Loads a car dataset from a CSV file
  • Displays basic information and statistical summaries using info() and describe()

3. Categorical Data Encoding:

  • Uses LabelEncoder to convert categorical variables (make, model, fuel_type) into numerical format
  • Stores the encoders in a dictionary for potential later use

4. Data Visualization:

  • Creates a histogram showing the distribution of car prices
  • Generates a correlation heatmap to show relationships between numerical features
  • Plots a scatter plot comparing mileage vs. price

6.2.2 Data Preprocessing

Before we can construct our regression model, it's crucial to preprocess the data to ensure its quality and suitability for analysis.

This essential step involves several key processes:

  1. Handling missing values: We need to address any gaps in our dataset, either by imputing values or removing incomplete records.
  2. Encoding categorical variables: Since our model works with numerical data, we must convert categorical information (like car makes and models) into a format the algorithm can process.
  3. Scaling numerical features: To ensure all features contribute equally to the model, we'll standardize or normalize numerical variables to a common scale.
  4. Feature engineering: We may create new features or transform existing ones to capture important relationships in the data.

These preprocessing steps are vital for building a robust and accurate car price prediction model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

# Load the dataset
car_df = pd.read_csv('/mnt/data/used_car_data.csv')

# Display basic information about the dataset
print(car_df.info())
print(car_df.describe())

# Encode categorical columns
label_encoders = {}
for col in ['make', 'model', 'fuel_type']:
    le = LabelEncoder()
    car_df[col] = le.fit_transform(car_df[col])
    label_encoders[col] = le

# Handle missing values
car_df.dropna(subset=['price'], inplace=True)
car_df['mileage'].fillna(car_df['mileage'].median(), inplace=True)
car_df['year'].fillna(car_df['year'].mode()[0], inplace=True)

# Encode categorical variables
car_df = pd.get_dummies(car_df, columns=['make', 'model'], drop_first=True)

# Feature engineering
car_df['age'] = 2023 - car_df['year']  # Assuming current year is 2023
car_df['miles_per_year'] = car_df['mileage'] / car_df['age']

# Scale numerical features
scaler = StandardScaler()
numerical_features = ['mileage', 'year', 'age', 'miles_per_year']
car_df[numerical_features] = scaler.fit_transform(car_df[numerical_features])

# Display the updated dataset
print(car_df.head())

# Visualize the distribution of car prices
plt.figure(figsize=(12, 6))
sns.histplot(car_df['price'], bins=50, kde=True)
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(car_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

# Scatter plot of price vs. mileage
plt.figure(figsize=(10, 6))
sns.scatterplot(x='mileage', y='price', data=car_df)
plt.title('Price vs. Mileage')
plt.xlabel('Mileage')
plt.ylabel('Price')
plt.show()

Here's a breakdown of its main components:

1. Initial Setup and Data Loading

  • Imports necessary libraries for data analysis, visualization, and machine learning
  • Loads a car dataset from a CSV file and displays basic information about it

2. Data Preprocessing

  • Handles missing values by:
    • Removing rows with missing prices
    • Filling missing mileage values with median
    • Filling missing year values with mode
  • Performs categorical encoding in two steps:
    • Initially uses LabelEncoder for make, model, and fuel type
    • Later converts make and model to dummy variables

3. Feature Engineering

  • Creates two new features:
    • 'age': calculated as (2023 - car's year)
    • 'miles_per_year': calculated as (mileage/age)

4. Data Scaling

  • Uses StandardScaler to normalize numerical features (mileage, year, age, miles_per_year)

5. Visualization

  • Creates three visualizations:
    • Histogram of car prices
    • Correlation heatmap of numerical features
    • Scatter plot comparing price vs. mileage

This preprocessing pipeline is essential for preparing the data for machine learning modeling and understanding the relationships between different features

6.2.3 Feature Selection

In this crucial step of our model development process, we will employ Recursive Feature Elimination (RFE) to identify and select the most influential features for our car price prediction model.

RFE is an advanced feature selection technique that recursively removes less important features while building the model, allowing us to focus on the variables that have the strongest impact on our target variable.

By implementing RFE, we can streamline our model, improve its performance, and gain valuable insights into which factors are most significant in determining used car prices.

# Prepare features and target
X = car_df.drop('price', axis=1)
y = car_df['price']

# Perform RFE
rfe = RFE(estimator=LinearRegression(), n_features_to_select=10)
rfe = rfe.fit(X, y)

# Get selected features
selected_features = X.columns[rfe.support_]
print("Selected features:", selected_features)

# Update X with selected features
X = X[selected_features]

Here's a breakdown of what the code does:

  • First, it separates the features (X) and the target variable (y) from the dataset. The 'price' column is set as the target variable, while all other columns are considered as features.
  • Next, it initializes the RFE object with a LinearRegression estimator and sets the number of features to select to 10.
  • The RFE is then fitted to the data, which performs the recursive feature elimination process.
  • After fitting, the code retrieves the selected features using rfe.support_ and prints them.
  • Finally, it updates the feature set X to include only the selected features.

This process helps identify the most important features for predicting car prices, potentially improving the model's performance and interpretability.

6.2.4 Split the Data and Build the Model

With our data preprocessed and features selected, we're now ready to move forward with model development. In this crucial step, we'll divide our dataset into training and testing subsets, a practice that allows us to build our linear regression model on one portion of the data and evaluate its performance on another. This approach helps ensure that our model can generalize well to new, unseen data.

By splitting our data, we create a robust framework for assessing our model's predictive capabilities. The training set will be used to teach our linear regression algorithm the underlying patterns in car prices, while the testing set will serve as a proxy for real-world data, allowing us to gauge how well our model performs on previously unseen examples.

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f"R-squared Score: {r2}")

Here's a breakdown of what each part does:

  • Data Splitting: The data is split into training and testing sets using train_test_split(). 80% of the data is used for training (test_size=0.2) and 20% for testing.
  • Model Creation and Training: A LinearRegression model is instantiated and trained on the training data using the fit() method.
  • Prediction: The trained model is used to make predictions on the test data.
  • Model Evaluation: The model's performance is evaluated using three metrics: 
    • Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values.
    • Root Mean Squared Error (RMSE): The square root of MSE, which provides an error measure in the same unit as the target variable.
    • R-squared Score: Indicates the proportion of variance in the dependent variable that is predictable from the independent variable(s).

These metrics help assess how well the model is performing in predicting car prices based on the selected features.

6.2.5 Model Interpretation

Now, let's delve into the coefficients of our linear regression model to gain a comprehensive understanding of how each feature influences car prices. By examining these coefficients, we can discern which factors have the most significant impact on determining a vehicle's value, providing valuable insights for both buyers and sellers in the used car market.

# Display feature coefficients
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})
coefficients = coefficients.sort_values(by='Coefficient', key=abs, ascending=False)
print(coefficients)

# Visualize feature importance
plt.figure(figsize=(12, 6))
sns.barplot(x='Coefficient', y='Feature', data=coefficients)
plt.title('Feature Importance in Linear Regression Model')
plt.show()

Let's break it down:

  1. Display feature coefficients:
    • It creates a DataFrame 'coefficients' with two columns: 'Feature' (from X.columns) and 'Coefficient' (from model.coef_)
    • The coefficients are then sorted by their absolute values in descending order
    • This sorted DataFrame is printed, showing which features have the largest impact on the prediction
  2. Visualize feature importance:
    • It creates a bar plot using seaborn (sns.barplot)
    • The x-axis represents the coefficient values, and the y-axis shows the feature names
    • This visualization helps to quickly identify which features have the most significant positive or negative impact on car prices

This code is crucial for understanding how each feature in the model contributes to the prediction of car prices, allowing for better interpretation of the model's decision-making process.

6.2.6 Error Analysis

To gain deeper insights into our model's performance and identify potential areas for improvement, let's conduct a thorough analysis of its errors. This crucial step will help us uncover any systematic patterns or notable outliers in our predictions, allowing us to refine our approach and enhance the accuracy of our car price estimates.

By examining the discrepancies between predicted and actual prices, we can pinpoint specific scenarios where our model excels or struggles, ultimately leading to a more robust and reliable prediction system.

# Calculate residuals
residuals = y_test - y_pred

# Plot residuals
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residual Plot')
plt.xlabel('Actual Price')
plt.ylabel('Residuals')
plt.show()

# Plot actual vs predicted prices
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.title('Actual vs Predicted Prices')
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.show()

This code performs error analysis for a linear regression model used to predict car prices. It consists of two main parts:

  1. Residual Plot:
  • Calculates residuals (differences between actual and predicted prices)
  • Creates a scatter plot of actual prices vs. residuals
  • Adds a horizontal red dashed line at y=0 to highlight the baseline
  • This plot helps identify any patterns or heteroscedasticity in the errors
  1. Actual vs. Predicted Prices Plot:
  • Creates a scatter plot of actual prices vs. predicted prices
  • Adds a red dashed diagonal line representing perfect predictions
  • This plot helps visualize how well the model's predictions align with actual prices

These visualizations are crucial for understanding the model's performance and identifying potential areas for improvement in the car price prediction model.

6.2.7 Model Comparison

To enhance our predictive capabilities and gain a deeper understanding of the factors influencing car prices, we will now compare our linear regression model with a more complex machine learning algorithm: the Random Forest Regressor.

This comparison will allow us to assess whether we can achieve improved accuracy in our predictions and potentially uncover non-linear relationships within our data that the linear model might have missed.

By implementing this additional model, we'll be able to evaluate the strengths and weaknesses of both approaches, providing valuable insights into the most effective method for estimating used car prices in various scenarios.

# Create and train a Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions with Random Forest
rf_pred = rf_model.predict(X_test)

# Evaluate Random Forest model
rf_mse = mean_squared_error(y_test, rf_pred)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, rf_pred)

print("Random Forest Performance:")
print(f"Mean Squared Error: {rf_mse}")
print(f"Root Mean Squared Error: {rf_rmse}")
print(f"R-squared Score: {rf_r2}")

# Compare feature importance
rf_importance = pd.DataFrame({'Feature': X.columns, 'Importance': rf_model.feature_importances_})
rf_importance = rf_importance.sort_values('Importance', ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x='Importance', y='Feature', data=rf_importance)
plt.title('Feature Importance in Random Forest Model')
plt.show()

Here's a breakdown of what the code does:

  1. Creates and trains a Random Forest model with 100 trees
  2. Uses the trained model to make predictions on the test data
  3. Evaluates the Random Forest model's performance using three metrics:
    • - Mean Squared Error (MSE)
    • - Root Mean Squared Error (RMSE)
    • - R-squared Score
  4. Prints the performance metrics for easy comparison with the linear regression model
  5. Analyzes feature importance in the Random Forest model:
    • - Creates a DataFrame with features and their importance scores
    • - Sorts features by importance
    • - Visualizes feature importance using a bar plot

This code allows for a comprehensive comparison between the linear regression and Random Forest models, helping to identify which approach might be more effective for predicting car prices in this specific scenario.

6.2.8 Conclusion

In this project, we've built a comprehensive car price prediction model using linear regression. We've incorporated advanced data exploration techniques, feature engineering, and model interpretation. By comparing our linear regression model with a Random Forest model, we've gained insights into the strengths and limitations of different approaches.

Key takeaways from this project include:

  • The importance of thorough data exploration and visualization
  • The impact of feature engineering on model performance
  • The value of interpretable models like linear regression in understanding feature importance
  • The potential for ensemble methods like Random Forest to capture non-linear relationships and improve predictions

This project demonstrates the power of machine learning in solving real-world problems and provides a solid foundation for further exploration in the field of predictive modeling.

6.2 Project 2: Predicting Car Prices Using Linear Regression

In this project, we will develop a predictive model to estimate the prices of used cars based on various features such as mileage, year, make, model, and other relevant factors. This project has significant real-world applications in the automotive industry, particularly for car dealerships, insurance companies, and online marketplaces dealing with used vehicles.

Linear regression is well-suited for this task as our goal is to predict a continuous value (car price) based on multiple input features. Throughout this project, we will:

  1. Explore and preprocess a comprehensive car dataset
  2. Apply linear regression to predict car prices
  3. Evaluate the model's performance using various metrics
  4. Optimize the model through feature engineering and selection
  5. Compare our linear regression model with other algorithms
  6. Analyze feature importance and model interpretability

6.2.1 Load and Explore the Dataset

We'll begin by loading and exploring our comprehensive used car dataset. This crucial step forms the foundation of our analysis, allowing us to gain deep insights into the structure and characteristics of our data.

Through careful examination, we can identify potential issues, such as missing values or outliers, and uncover meaningful patterns that may influence our model's performance.

This initial exploration not only helps us understand the nature of our dataset but also guides our subsequent preprocessing and feature engineering decisions, ultimately leading to a more robust and accurate car price prediction model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

# Load the dataset
car_df = pd.read_csv('/mnt/data/used_car_data.csv')

# Display basic information about the dataset
print(car_df.info())
print(car_df.describe())

# Encode categorical columns
label_encoders = {}
for col in ['make', 'model', 'fuel_type']:
    le = LabelEncoder()
    car_df[col] = le.fit_transform(car_df[col])
    label_encoders[col] = le

# Visualize the distribution of car prices
plt.figure(figsize=(12, 6))
sns.histplot(car_df['price'], bins=50, kde=True)
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(car_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

# Scatter plot of price vs. mileage
plt.figure(figsize=(10, 6))
sns.scatterplot(x='mileage', y='price', data=car_df)
plt.title('Price vs. Mileage')
plt.xlabel('Mileage')
plt.ylabel('Price')
plt.show()

Download the CSV File here: https://files.cuantum.tech/csv/used_car_data.csv

Here's a detailed breakdown:

1. Library Imports:

  • Core data analysis libraries (pandas, numpy)
  • Visualization libraries (matplotlib, seaborn)
  • Machine learning components from scikit-learn for model building and preprocessing

2. Data Loading and Initial Analysis:

  • Loads a car dataset from a CSV file
  • Displays basic information and statistical summaries using info() and describe()

3. Categorical Data Encoding:

  • Uses LabelEncoder to convert categorical variables (make, model, fuel_type) into numerical format
  • Stores the encoders in a dictionary for potential later use

4. Data Visualization:

  • Creates a histogram showing the distribution of car prices
  • Generates a correlation heatmap to show relationships between numerical features
  • Plots a scatter plot comparing mileage vs. price

6.2.2 Data Preprocessing

Before we can construct our regression model, it's crucial to preprocess the data to ensure its quality and suitability for analysis.

This essential step involves several key processes:

  1. Handling missing values: We need to address any gaps in our dataset, either by imputing values or removing incomplete records.
  2. Encoding categorical variables: Since our model works with numerical data, we must convert categorical information (like car makes and models) into a format the algorithm can process.
  3. Scaling numerical features: To ensure all features contribute equally to the model, we'll standardize or normalize numerical variables to a common scale.
  4. Feature engineering: We may create new features or transform existing ones to capture important relationships in the data.

These preprocessing steps are vital for building a robust and accurate car price prediction model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

# Load the dataset
car_df = pd.read_csv('/mnt/data/used_car_data.csv')

# Display basic information about the dataset
print(car_df.info())
print(car_df.describe())

# Encode categorical columns
label_encoders = {}
for col in ['make', 'model', 'fuel_type']:
    le = LabelEncoder()
    car_df[col] = le.fit_transform(car_df[col])
    label_encoders[col] = le

# Handle missing values
car_df.dropna(subset=['price'], inplace=True)
car_df['mileage'].fillna(car_df['mileage'].median(), inplace=True)
car_df['year'].fillna(car_df['year'].mode()[0], inplace=True)

# Encode categorical variables
car_df = pd.get_dummies(car_df, columns=['make', 'model'], drop_first=True)

# Feature engineering
car_df['age'] = 2023 - car_df['year']  # Assuming current year is 2023
car_df['miles_per_year'] = car_df['mileage'] / car_df['age']

# Scale numerical features
scaler = StandardScaler()
numerical_features = ['mileage', 'year', 'age', 'miles_per_year']
car_df[numerical_features] = scaler.fit_transform(car_df[numerical_features])

# Display the updated dataset
print(car_df.head())

# Visualize the distribution of car prices
plt.figure(figsize=(12, 6))
sns.histplot(car_df['price'], bins=50, kde=True)
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(car_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

# Scatter plot of price vs. mileage
plt.figure(figsize=(10, 6))
sns.scatterplot(x='mileage', y='price', data=car_df)
plt.title('Price vs. Mileage')
plt.xlabel('Mileage')
plt.ylabel('Price')
plt.show()

Here's a breakdown of its main components:

1. Initial Setup and Data Loading

  • Imports necessary libraries for data analysis, visualization, and machine learning
  • Loads a car dataset from a CSV file and displays basic information about it

2. Data Preprocessing

  • Handles missing values by:
    • Removing rows with missing prices
    • Filling missing mileage values with median
    • Filling missing year values with mode
  • Performs categorical encoding in two steps:
    • Initially uses LabelEncoder for make, model, and fuel type
    • Later converts make and model to dummy variables

3. Feature Engineering

  • Creates two new features:
    • 'age': calculated as (2023 - car's year)
    • 'miles_per_year': calculated as (mileage/age)

4. Data Scaling

  • Uses StandardScaler to normalize numerical features (mileage, year, age, miles_per_year)

5. Visualization

  • Creates three visualizations:
    • Histogram of car prices
    • Correlation heatmap of numerical features
    • Scatter plot comparing price vs. mileage

This preprocessing pipeline is essential for preparing the data for machine learning modeling and understanding the relationships between different features

6.2.3 Feature Selection

In this crucial step of our model development process, we will employ Recursive Feature Elimination (RFE) to identify and select the most influential features for our car price prediction model.

RFE is an advanced feature selection technique that recursively removes less important features while building the model, allowing us to focus on the variables that have the strongest impact on our target variable.

By implementing RFE, we can streamline our model, improve its performance, and gain valuable insights into which factors are most significant in determining used car prices.

# Prepare features and target
X = car_df.drop('price', axis=1)
y = car_df['price']

# Perform RFE
rfe = RFE(estimator=LinearRegression(), n_features_to_select=10)
rfe = rfe.fit(X, y)

# Get selected features
selected_features = X.columns[rfe.support_]
print("Selected features:", selected_features)

# Update X with selected features
X = X[selected_features]

Here's a breakdown of what the code does:

  • First, it separates the features (X) and the target variable (y) from the dataset. The 'price' column is set as the target variable, while all other columns are considered as features.
  • Next, it initializes the RFE object with a LinearRegression estimator and sets the number of features to select to 10.
  • The RFE is then fitted to the data, which performs the recursive feature elimination process.
  • After fitting, the code retrieves the selected features using rfe.support_ and prints them.
  • Finally, it updates the feature set X to include only the selected features.

This process helps identify the most important features for predicting car prices, potentially improving the model's performance and interpretability.

6.2.4 Split the Data and Build the Model

With our data preprocessed and features selected, we're now ready to move forward with model development. In this crucial step, we'll divide our dataset into training and testing subsets, a practice that allows us to build our linear regression model on one portion of the data and evaluate its performance on another. This approach helps ensure that our model can generalize well to new, unseen data.

By splitting our data, we create a robust framework for assessing our model's predictive capabilities. The training set will be used to teach our linear regression algorithm the underlying patterns in car prices, while the testing set will serve as a proxy for real-world data, allowing us to gauge how well our model performs on previously unseen examples.

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f"R-squared Score: {r2}")

Here's a breakdown of what each part does:

  • Data Splitting: The data is split into training and testing sets using train_test_split(). 80% of the data is used for training (test_size=0.2) and 20% for testing.
  • Model Creation and Training: A LinearRegression model is instantiated and trained on the training data using the fit() method.
  • Prediction: The trained model is used to make predictions on the test data.
  • Model Evaluation: The model's performance is evaluated using three metrics: 
    • Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values.
    • Root Mean Squared Error (RMSE): The square root of MSE, which provides an error measure in the same unit as the target variable.
    • R-squared Score: Indicates the proportion of variance in the dependent variable that is predictable from the independent variable(s).

These metrics help assess how well the model is performing in predicting car prices based on the selected features.

6.2.5 Model Interpretation

Now, let's delve into the coefficients of our linear regression model to gain a comprehensive understanding of how each feature influences car prices. By examining these coefficients, we can discern which factors have the most significant impact on determining a vehicle's value, providing valuable insights for both buyers and sellers in the used car market.

# Display feature coefficients
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})
coefficients = coefficients.sort_values(by='Coefficient', key=abs, ascending=False)
print(coefficients)

# Visualize feature importance
plt.figure(figsize=(12, 6))
sns.barplot(x='Coefficient', y='Feature', data=coefficients)
plt.title('Feature Importance in Linear Regression Model')
plt.show()

Let's break it down:

  1. Display feature coefficients:
    • It creates a DataFrame 'coefficients' with two columns: 'Feature' (from X.columns) and 'Coefficient' (from model.coef_)
    • The coefficients are then sorted by their absolute values in descending order
    • This sorted DataFrame is printed, showing which features have the largest impact on the prediction
  2. Visualize feature importance:
    • It creates a bar plot using seaborn (sns.barplot)
    • The x-axis represents the coefficient values, and the y-axis shows the feature names
    • This visualization helps to quickly identify which features have the most significant positive or negative impact on car prices

This code is crucial for understanding how each feature in the model contributes to the prediction of car prices, allowing for better interpretation of the model's decision-making process.

6.2.6 Error Analysis

To gain deeper insights into our model's performance and identify potential areas for improvement, let's conduct a thorough analysis of its errors. This crucial step will help us uncover any systematic patterns or notable outliers in our predictions, allowing us to refine our approach and enhance the accuracy of our car price estimates.

By examining the discrepancies between predicted and actual prices, we can pinpoint specific scenarios where our model excels or struggles, ultimately leading to a more robust and reliable prediction system.

# Calculate residuals
residuals = y_test - y_pred

# Plot residuals
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residual Plot')
plt.xlabel('Actual Price')
plt.ylabel('Residuals')
plt.show()

# Plot actual vs predicted prices
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.title('Actual vs Predicted Prices')
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.show()

This code performs error analysis for a linear regression model used to predict car prices. It consists of two main parts:

  1. Residual Plot:
  • Calculates residuals (differences between actual and predicted prices)
  • Creates a scatter plot of actual prices vs. residuals
  • Adds a horizontal red dashed line at y=0 to highlight the baseline
  • This plot helps identify any patterns or heteroscedasticity in the errors
  1. Actual vs. Predicted Prices Plot:
  • Creates a scatter plot of actual prices vs. predicted prices
  • Adds a red dashed diagonal line representing perfect predictions
  • This plot helps visualize how well the model's predictions align with actual prices

These visualizations are crucial for understanding the model's performance and identifying potential areas for improvement in the car price prediction model.

6.2.7 Model Comparison

To enhance our predictive capabilities and gain a deeper understanding of the factors influencing car prices, we will now compare our linear regression model with a more complex machine learning algorithm: the Random Forest Regressor.

This comparison will allow us to assess whether we can achieve improved accuracy in our predictions and potentially uncover non-linear relationships within our data that the linear model might have missed.

By implementing this additional model, we'll be able to evaluate the strengths and weaknesses of both approaches, providing valuable insights into the most effective method for estimating used car prices in various scenarios.

# Create and train a Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions with Random Forest
rf_pred = rf_model.predict(X_test)

# Evaluate Random Forest model
rf_mse = mean_squared_error(y_test, rf_pred)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, rf_pred)

print("Random Forest Performance:")
print(f"Mean Squared Error: {rf_mse}")
print(f"Root Mean Squared Error: {rf_rmse}")
print(f"R-squared Score: {rf_r2}")

# Compare feature importance
rf_importance = pd.DataFrame({'Feature': X.columns, 'Importance': rf_model.feature_importances_})
rf_importance = rf_importance.sort_values('Importance', ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x='Importance', y='Feature', data=rf_importance)
plt.title('Feature Importance in Random Forest Model')
plt.show()

Here's a breakdown of what the code does:

  1. Creates and trains a Random Forest model with 100 trees
  2. Uses the trained model to make predictions on the test data
  3. Evaluates the Random Forest model's performance using three metrics:
    • - Mean Squared Error (MSE)
    • - Root Mean Squared Error (RMSE)
    • - R-squared Score
  4. Prints the performance metrics for easy comparison with the linear regression model
  5. Analyzes feature importance in the Random Forest model:
    • - Creates a DataFrame with features and their importance scores
    • - Sorts features by importance
    • - Visualizes feature importance using a bar plot

This code allows for a comprehensive comparison between the linear regression and Random Forest models, helping to identify which approach might be more effective for predicting car prices in this specific scenario.

6.2.8 Conclusion

In this project, we've built a comprehensive car price prediction model using linear regression. We've incorporated advanced data exploration techniques, feature engineering, and model interpretation. By comparing our linear regression model with a Random Forest model, we've gained insights into the strengths and limitations of different approaches.

Key takeaways from this project include:

  • The importance of thorough data exploration and visualization
  • The impact of feature engineering on model performance
  • The value of interpretable models like linear regression in understanding feature importance
  • The potential for ensemble methods like Random Forest to capture non-linear relationships and improve predictions

This project demonstrates the power of machine learning in solving real-world problems and provides a solid foundation for further exploration in the field of predictive modeling.

6.2 Project 2: Predicting Car Prices Using Linear Regression

In this project, we will develop a predictive model to estimate the prices of used cars based on various features such as mileage, year, make, model, and other relevant factors. This project has significant real-world applications in the automotive industry, particularly for car dealerships, insurance companies, and online marketplaces dealing with used vehicles.

Linear regression is well-suited for this task as our goal is to predict a continuous value (car price) based on multiple input features. Throughout this project, we will:

  1. Explore and preprocess a comprehensive car dataset
  2. Apply linear regression to predict car prices
  3. Evaluate the model's performance using various metrics
  4. Optimize the model through feature engineering and selection
  5. Compare our linear regression model with other algorithms
  6. Analyze feature importance and model interpretability

6.2.1 Load and Explore the Dataset

We'll begin by loading and exploring our comprehensive used car dataset. This crucial step forms the foundation of our analysis, allowing us to gain deep insights into the structure and characteristics of our data.

Through careful examination, we can identify potential issues, such as missing values or outliers, and uncover meaningful patterns that may influence our model's performance.

This initial exploration not only helps us understand the nature of our dataset but also guides our subsequent preprocessing and feature engineering decisions, ultimately leading to a more robust and accurate car price prediction model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

# Load the dataset
car_df = pd.read_csv('/mnt/data/used_car_data.csv')

# Display basic information about the dataset
print(car_df.info())
print(car_df.describe())

# Encode categorical columns
label_encoders = {}
for col in ['make', 'model', 'fuel_type']:
    le = LabelEncoder()
    car_df[col] = le.fit_transform(car_df[col])
    label_encoders[col] = le

# Visualize the distribution of car prices
plt.figure(figsize=(12, 6))
sns.histplot(car_df['price'], bins=50, kde=True)
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(car_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

# Scatter plot of price vs. mileage
plt.figure(figsize=(10, 6))
sns.scatterplot(x='mileage', y='price', data=car_df)
plt.title('Price vs. Mileage')
plt.xlabel('Mileage')
plt.ylabel('Price')
plt.show()

Download the CSV File here: https://files.cuantum.tech/csv/used_car_data.csv

Here's a detailed breakdown:

1. Library Imports:

  • Core data analysis libraries (pandas, numpy)
  • Visualization libraries (matplotlib, seaborn)
  • Machine learning components from scikit-learn for model building and preprocessing

2. Data Loading and Initial Analysis:

  • Loads a car dataset from a CSV file
  • Displays basic information and statistical summaries using info() and describe()

3. Categorical Data Encoding:

  • Uses LabelEncoder to convert categorical variables (make, model, fuel_type) into numerical format
  • Stores the encoders in a dictionary for potential later use

4. Data Visualization:

  • Creates a histogram showing the distribution of car prices
  • Generates a correlation heatmap to show relationships between numerical features
  • Plots a scatter plot comparing mileage vs. price

6.2.2 Data Preprocessing

Before we can construct our regression model, it's crucial to preprocess the data to ensure its quality and suitability for analysis.

This essential step involves several key processes:

  1. Handling missing values: We need to address any gaps in our dataset, either by imputing values or removing incomplete records.
  2. Encoding categorical variables: Since our model works with numerical data, we must convert categorical information (like car makes and models) into a format the algorithm can process.
  3. Scaling numerical features: To ensure all features contribute equally to the model, we'll standardize or normalize numerical variables to a common scale.
  4. Feature engineering: We may create new features or transform existing ones to capture important relationships in the data.

These preprocessing steps are vital for building a robust and accurate car price prediction model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

# Load the dataset
car_df = pd.read_csv('/mnt/data/used_car_data.csv')

# Display basic information about the dataset
print(car_df.info())
print(car_df.describe())

# Encode categorical columns
label_encoders = {}
for col in ['make', 'model', 'fuel_type']:
    le = LabelEncoder()
    car_df[col] = le.fit_transform(car_df[col])
    label_encoders[col] = le

# Handle missing values
car_df.dropna(subset=['price'], inplace=True)
car_df['mileage'].fillna(car_df['mileage'].median(), inplace=True)
car_df['year'].fillna(car_df['year'].mode()[0], inplace=True)

# Encode categorical variables
car_df = pd.get_dummies(car_df, columns=['make', 'model'], drop_first=True)

# Feature engineering
car_df['age'] = 2023 - car_df['year']  # Assuming current year is 2023
car_df['miles_per_year'] = car_df['mileage'] / car_df['age']

# Scale numerical features
scaler = StandardScaler()
numerical_features = ['mileage', 'year', 'age', 'miles_per_year']
car_df[numerical_features] = scaler.fit_transform(car_df[numerical_features])

# Display the updated dataset
print(car_df.head())

# Visualize the distribution of car prices
plt.figure(figsize=(12, 6))
sns.histplot(car_df['price'], bins=50, kde=True)
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(car_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

# Scatter plot of price vs. mileage
plt.figure(figsize=(10, 6))
sns.scatterplot(x='mileage', y='price', data=car_df)
plt.title('Price vs. Mileage')
plt.xlabel('Mileage')
plt.ylabel('Price')
plt.show()

Here's a breakdown of its main components:

1. Initial Setup and Data Loading

  • Imports necessary libraries for data analysis, visualization, and machine learning
  • Loads a car dataset from a CSV file and displays basic information about it

2. Data Preprocessing

  • Handles missing values by:
    • Removing rows with missing prices
    • Filling missing mileage values with median
    • Filling missing year values with mode
  • Performs categorical encoding in two steps:
    • Initially uses LabelEncoder for make, model, and fuel type
    • Later converts make and model to dummy variables

3. Feature Engineering

  • Creates two new features:
    • 'age': calculated as (2023 - car's year)
    • 'miles_per_year': calculated as (mileage/age)

4. Data Scaling

  • Uses StandardScaler to normalize numerical features (mileage, year, age, miles_per_year)

5. Visualization

  • Creates three visualizations:
    • Histogram of car prices
    • Correlation heatmap of numerical features
    • Scatter plot comparing price vs. mileage

This preprocessing pipeline is essential for preparing the data for machine learning modeling and understanding the relationships between different features

6.2.3 Feature Selection

In this crucial step of our model development process, we will employ Recursive Feature Elimination (RFE) to identify and select the most influential features for our car price prediction model.

RFE is an advanced feature selection technique that recursively removes less important features while building the model, allowing us to focus on the variables that have the strongest impact on our target variable.

By implementing RFE, we can streamline our model, improve its performance, and gain valuable insights into which factors are most significant in determining used car prices.

# Prepare features and target
X = car_df.drop('price', axis=1)
y = car_df['price']

# Perform RFE
rfe = RFE(estimator=LinearRegression(), n_features_to_select=10)
rfe = rfe.fit(X, y)

# Get selected features
selected_features = X.columns[rfe.support_]
print("Selected features:", selected_features)

# Update X with selected features
X = X[selected_features]

Here's a breakdown of what the code does:

  • First, it separates the features (X) and the target variable (y) from the dataset. The 'price' column is set as the target variable, while all other columns are considered as features.
  • Next, it initializes the RFE object with a LinearRegression estimator and sets the number of features to select to 10.
  • The RFE is then fitted to the data, which performs the recursive feature elimination process.
  • After fitting, the code retrieves the selected features using rfe.support_ and prints them.
  • Finally, it updates the feature set X to include only the selected features.

This process helps identify the most important features for predicting car prices, potentially improving the model's performance and interpretability.

6.2.4 Split the Data and Build the Model

With our data preprocessed and features selected, we're now ready to move forward with model development. In this crucial step, we'll divide our dataset into training and testing subsets, a practice that allows us to build our linear regression model on one portion of the data and evaluate its performance on another. This approach helps ensure that our model can generalize well to new, unseen data.

By splitting our data, we create a robust framework for assessing our model's predictive capabilities. The training set will be used to teach our linear regression algorithm the underlying patterns in car prices, while the testing set will serve as a proxy for real-world data, allowing us to gauge how well our model performs on previously unseen examples.

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f"R-squared Score: {r2}")

Here's a breakdown of what each part does:

  • Data Splitting: The data is split into training and testing sets using train_test_split(). 80% of the data is used for training (test_size=0.2) and 20% for testing.
  • Model Creation and Training: A LinearRegression model is instantiated and trained on the training data using the fit() method.
  • Prediction: The trained model is used to make predictions on the test data.
  • Model Evaluation: The model's performance is evaluated using three metrics: 
    • Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values.
    • Root Mean Squared Error (RMSE): The square root of MSE, which provides an error measure in the same unit as the target variable.
    • R-squared Score: Indicates the proportion of variance in the dependent variable that is predictable from the independent variable(s).

These metrics help assess how well the model is performing in predicting car prices based on the selected features.

6.2.5 Model Interpretation

Now, let's delve into the coefficients of our linear regression model to gain a comprehensive understanding of how each feature influences car prices. By examining these coefficients, we can discern which factors have the most significant impact on determining a vehicle's value, providing valuable insights for both buyers and sellers in the used car market.

# Display feature coefficients
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})
coefficients = coefficients.sort_values(by='Coefficient', key=abs, ascending=False)
print(coefficients)

# Visualize feature importance
plt.figure(figsize=(12, 6))
sns.barplot(x='Coefficient', y='Feature', data=coefficients)
plt.title('Feature Importance in Linear Regression Model')
plt.show()

Let's break it down:

  1. Display feature coefficients:
    • It creates a DataFrame 'coefficients' with two columns: 'Feature' (from X.columns) and 'Coefficient' (from model.coef_)
    • The coefficients are then sorted by their absolute values in descending order
    • This sorted DataFrame is printed, showing which features have the largest impact on the prediction
  2. Visualize feature importance:
    • It creates a bar plot using seaborn (sns.barplot)
    • The x-axis represents the coefficient values, and the y-axis shows the feature names
    • This visualization helps to quickly identify which features have the most significant positive or negative impact on car prices

This code is crucial for understanding how each feature in the model contributes to the prediction of car prices, allowing for better interpretation of the model's decision-making process.

6.2.6 Error Analysis

To gain deeper insights into our model's performance and identify potential areas for improvement, let's conduct a thorough analysis of its errors. This crucial step will help us uncover any systematic patterns or notable outliers in our predictions, allowing us to refine our approach and enhance the accuracy of our car price estimates.

By examining the discrepancies between predicted and actual prices, we can pinpoint specific scenarios where our model excels or struggles, ultimately leading to a more robust and reliable prediction system.

# Calculate residuals
residuals = y_test - y_pred

# Plot residuals
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residual Plot')
plt.xlabel('Actual Price')
plt.ylabel('Residuals')
plt.show()

# Plot actual vs predicted prices
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.title('Actual vs Predicted Prices')
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.show()

This code performs error analysis for a linear regression model used to predict car prices. It consists of two main parts:

  1. Residual Plot:
  • Calculates residuals (differences between actual and predicted prices)
  • Creates a scatter plot of actual prices vs. residuals
  • Adds a horizontal red dashed line at y=0 to highlight the baseline
  • This plot helps identify any patterns or heteroscedasticity in the errors
  1. Actual vs. Predicted Prices Plot:
  • Creates a scatter plot of actual prices vs. predicted prices
  • Adds a red dashed diagonal line representing perfect predictions
  • This plot helps visualize how well the model's predictions align with actual prices

These visualizations are crucial for understanding the model's performance and identifying potential areas for improvement in the car price prediction model.

6.2.7 Model Comparison

To enhance our predictive capabilities and gain a deeper understanding of the factors influencing car prices, we will now compare our linear regression model with a more complex machine learning algorithm: the Random Forest Regressor.

This comparison will allow us to assess whether we can achieve improved accuracy in our predictions and potentially uncover non-linear relationships within our data that the linear model might have missed.

By implementing this additional model, we'll be able to evaluate the strengths and weaknesses of both approaches, providing valuable insights into the most effective method for estimating used car prices in various scenarios.

# Create and train a Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions with Random Forest
rf_pred = rf_model.predict(X_test)

# Evaluate Random Forest model
rf_mse = mean_squared_error(y_test, rf_pred)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, rf_pred)

print("Random Forest Performance:")
print(f"Mean Squared Error: {rf_mse}")
print(f"Root Mean Squared Error: {rf_rmse}")
print(f"R-squared Score: {rf_r2}")

# Compare feature importance
rf_importance = pd.DataFrame({'Feature': X.columns, 'Importance': rf_model.feature_importances_})
rf_importance = rf_importance.sort_values('Importance', ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x='Importance', y='Feature', data=rf_importance)
plt.title('Feature Importance in Random Forest Model')
plt.show()

Here's a breakdown of what the code does:

  1. Creates and trains a Random Forest model with 100 trees
  2. Uses the trained model to make predictions on the test data
  3. Evaluates the Random Forest model's performance using three metrics:
    • - Mean Squared Error (MSE)
    • - Root Mean Squared Error (RMSE)
    • - R-squared Score
  4. Prints the performance metrics for easy comparison with the linear regression model
  5. Analyzes feature importance in the Random Forest model:
    • - Creates a DataFrame with features and their importance scores
    • - Sorts features by importance
    • - Visualizes feature importance using a bar plot

This code allows for a comprehensive comparison between the linear regression and Random Forest models, helping to identify which approach might be more effective for predicting car prices in this specific scenario.

6.2.8 Conclusion

In this project, we've built a comprehensive car price prediction model using linear regression. We've incorporated advanced data exploration techniques, feature engineering, and model interpretation. By comparing our linear regression model with a Random Forest model, we've gained insights into the strengths and limitations of different approaches.

Key takeaways from this project include:

  • The importance of thorough data exploration and visualization
  • The impact of feature engineering on model performance
  • The value of interpretable models like linear regression in understanding feature importance
  • The potential for ensemble methods like Random Forest to capture non-linear relationships and improve predictions

This project demonstrates the power of machine learning in solving real-world problems and provides a solid foundation for further exploration in the field of predictive modeling.

6.2 Project 2: Predicting Car Prices Using Linear Regression

In this project, we will develop a predictive model to estimate the prices of used cars based on various features such as mileage, year, make, model, and other relevant factors. This project has significant real-world applications in the automotive industry, particularly for car dealerships, insurance companies, and online marketplaces dealing with used vehicles.

Linear regression is well-suited for this task as our goal is to predict a continuous value (car price) based on multiple input features. Throughout this project, we will:

  1. Explore and preprocess a comprehensive car dataset
  2. Apply linear regression to predict car prices
  3. Evaluate the model's performance using various metrics
  4. Optimize the model through feature engineering and selection
  5. Compare our linear regression model with other algorithms
  6. Analyze feature importance and model interpretability

6.2.1 Load and Explore the Dataset

We'll begin by loading and exploring our comprehensive used car dataset. This crucial step forms the foundation of our analysis, allowing us to gain deep insights into the structure and characteristics of our data.

Through careful examination, we can identify potential issues, such as missing values or outliers, and uncover meaningful patterns that may influence our model's performance.

This initial exploration not only helps us understand the nature of our dataset but also guides our subsequent preprocessing and feature engineering decisions, ultimately leading to a more robust and accurate car price prediction model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

# Load the dataset
car_df = pd.read_csv('/mnt/data/used_car_data.csv')

# Display basic information about the dataset
print(car_df.info())
print(car_df.describe())

# Encode categorical columns
label_encoders = {}
for col in ['make', 'model', 'fuel_type']:
    le = LabelEncoder()
    car_df[col] = le.fit_transform(car_df[col])
    label_encoders[col] = le

# Visualize the distribution of car prices
plt.figure(figsize=(12, 6))
sns.histplot(car_df['price'], bins=50, kde=True)
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(car_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

# Scatter plot of price vs. mileage
plt.figure(figsize=(10, 6))
sns.scatterplot(x='mileage', y='price', data=car_df)
plt.title('Price vs. Mileage')
plt.xlabel('Mileage')
plt.ylabel('Price')
plt.show()

Download the CSV File here: https://files.cuantum.tech/csv/used_car_data.csv

Here's a detailed breakdown:

1. Library Imports:

  • Core data analysis libraries (pandas, numpy)
  • Visualization libraries (matplotlib, seaborn)
  • Machine learning components from scikit-learn for model building and preprocessing

2. Data Loading and Initial Analysis:

  • Loads a car dataset from a CSV file
  • Displays basic information and statistical summaries using info() and describe()

3. Categorical Data Encoding:

  • Uses LabelEncoder to convert categorical variables (make, model, fuel_type) into numerical format
  • Stores the encoders in a dictionary for potential later use

4. Data Visualization:

  • Creates a histogram showing the distribution of car prices
  • Generates a correlation heatmap to show relationships between numerical features
  • Plots a scatter plot comparing mileage vs. price

6.2.2 Data Preprocessing

Before we can construct our regression model, it's crucial to preprocess the data to ensure its quality and suitability for analysis.

This essential step involves several key processes:

  1. Handling missing values: We need to address any gaps in our dataset, either by imputing values or removing incomplete records.
  2. Encoding categorical variables: Since our model works with numerical data, we must convert categorical information (like car makes and models) into a format the algorithm can process.
  3. Scaling numerical features: To ensure all features contribute equally to the model, we'll standardize or normalize numerical variables to a common scale.
  4. Feature engineering: We may create new features or transform existing ones to capture important relationships in the data.

These preprocessing steps are vital for building a robust and accurate car price prediction model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

# Load the dataset
car_df = pd.read_csv('/mnt/data/used_car_data.csv')

# Display basic information about the dataset
print(car_df.info())
print(car_df.describe())

# Encode categorical columns
label_encoders = {}
for col in ['make', 'model', 'fuel_type']:
    le = LabelEncoder()
    car_df[col] = le.fit_transform(car_df[col])
    label_encoders[col] = le

# Handle missing values
car_df.dropna(subset=['price'], inplace=True)
car_df['mileage'].fillna(car_df['mileage'].median(), inplace=True)
car_df['year'].fillna(car_df['year'].mode()[0], inplace=True)

# Encode categorical variables
car_df = pd.get_dummies(car_df, columns=['make', 'model'], drop_first=True)

# Feature engineering
car_df['age'] = 2023 - car_df['year']  # Assuming current year is 2023
car_df['miles_per_year'] = car_df['mileage'] / car_df['age']

# Scale numerical features
scaler = StandardScaler()
numerical_features = ['mileage', 'year', 'age', 'miles_per_year']
car_df[numerical_features] = scaler.fit_transform(car_df[numerical_features])

# Display the updated dataset
print(car_df.head())

# Visualize the distribution of car prices
plt.figure(figsize=(12, 6))
sns.histplot(car_df['price'], bins=50, kde=True)
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(car_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

# Scatter plot of price vs. mileage
plt.figure(figsize=(10, 6))
sns.scatterplot(x='mileage', y='price', data=car_df)
plt.title('Price vs. Mileage')
plt.xlabel('Mileage')
plt.ylabel('Price')
plt.show()

Here's a breakdown of its main components:

1. Initial Setup and Data Loading

  • Imports necessary libraries for data analysis, visualization, and machine learning
  • Loads a car dataset from a CSV file and displays basic information about it

2. Data Preprocessing

  • Handles missing values by:
    • Removing rows with missing prices
    • Filling missing mileage values with median
    • Filling missing year values with mode
  • Performs categorical encoding in two steps:
    • Initially uses LabelEncoder for make, model, and fuel type
    • Later converts make and model to dummy variables

3. Feature Engineering

  • Creates two new features:
    • 'age': calculated as (2023 - car's year)
    • 'miles_per_year': calculated as (mileage/age)

4. Data Scaling

  • Uses StandardScaler to normalize numerical features (mileage, year, age, miles_per_year)

5. Visualization

  • Creates three visualizations:
    • Histogram of car prices
    • Correlation heatmap of numerical features
    • Scatter plot comparing price vs. mileage

This preprocessing pipeline is essential for preparing the data for machine learning modeling and understanding the relationships between different features

6.2.3 Feature Selection

In this crucial step of our model development process, we will employ Recursive Feature Elimination (RFE) to identify and select the most influential features for our car price prediction model.

RFE is an advanced feature selection technique that recursively removes less important features while building the model, allowing us to focus on the variables that have the strongest impact on our target variable.

By implementing RFE, we can streamline our model, improve its performance, and gain valuable insights into which factors are most significant in determining used car prices.

# Prepare features and target
X = car_df.drop('price', axis=1)
y = car_df['price']

# Perform RFE
rfe = RFE(estimator=LinearRegression(), n_features_to_select=10)
rfe = rfe.fit(X, y)

# Get selected features
selected_features = X.columns[rfe.support_]
print("Selected features:", selected_features)

# Update X with selected features
X = X[selected_features]

Here's a breakdown of what the code does:

  • First, it separates the features (X) and the target variable (y) from the dataset. The 'price' column is set as the target variable, while all other columns are considered as features.
  • Next, it initializes the RFE object with a LinearRegression estimator and sets the number of features to select to 10.
  • The RFE is then fitted to the data, which performs the recursive feature elimination process.
  • After fitting, the code retrieves the selected features using rfe.support_ and prints them.
  • Finally, it updates the feature set X to include only the selected features.

This process helps identify the most important features for predicting car prices, potentially improving the model's performance and interpretability.

6.2.4 Split the Data and Build the Model

With our data preprocessed and features selected, we're now ready to move forward with model development. In this crucial step, we'll divide our dataset into training and testing subsets, a practice that allows us to build our linear regression model on one portion of the data and evaluate its performance on another. This approach helps ensure that our model can generalize well to new, unseen data.

By splitting our data, we create a robust framework for assessing our model's predictive capabilities. The training set will be used to teach our linear regression algorithm the underlying patterns in car prices, while the testing set will serve as a proxy for real-world data, allowing us to gauge how well our model performs on previously unseen examples.

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f"R-squared Score: {r2}")

Here's a breakdown of what each part does:

  • Data Splitting: The data is split into training and testing sets using train_test_split(). 80% of the data is used for training (test_size=0.2) and 20% for testing.
  • Model Creation and Training: A LinearRegression model is instantiated and trained on the training data using the fit() method.
  • Prediction: The trained model is used to make predictions on the test data.
  • Model Evaluation: The model's performance is evaluated using three metrics: 
    • Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values.
    • Root Mean Squared Error (RMSE): The square root of MSE, which provides an error measure in the same unit as the target variable.
    • R-squared Score: Indicates the proportion of variance in the dependent variable that is predictable from the independent variable(s).

These metrics help assess how well the model is performing in predicting car prices based on the selected features.

6.2.5 Model Interpretation

Now, let's delve into the coefficients of our linear regression model to gain a comprehensive understanding of how each feature influences car prices. By examining these coefficients, we can discern which factors have the most significant impact on determining a vehicle's value, providing valuable insights for both buyers and sellers in the used car market.

# Display feature coefficients
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})
coefficients = coefficients.sort_values(by='Coefficient', key=abs, ascending=False)
print(coefficients)

# Visualize feature importance
plt.figure(figsize=(12, 6))
sns.barplot(x='Coefficient', y='Feature', data=coefficients)
plt.title('Feature Importance in Linear Regression Model')
plt.show()

Let's break it down:

  1. Display feature coefficients:
    • It creates a DataFrame 'coefficients' with two columns: 'Feature' (from X.columns) and 'Coefficient' (from model.coef_)
    • The coefficients are then sorted by their absolute values in descending order
    • This sorted DataFrame is printed, showing which features have the largest impact on the prediction
  2. Visualize feature importance:
    • It creates a bar plot using seaborn (sns.barplot)
    • The x-axis represents the coefficient values, and the y-axis shows the feature names
    • This visualization helps to quickly identify which features have the most significant positive or negative impact on car prices

This code is crucial for understanding how each feature in the model contributes to the prediction of car prices, allowing for better interpretation of the model's decision-making process.

6.2.6 Error Analysis

To gain deeper insights into our model's performance and identify potential areas for improvement, let's conduct a thorough analysis of its errors. This crucial step will help us uncover any systematic patterns or notable outliers in our predictions, allowing us to refine our approach and enhance the accuracy of our car price estimates.

By examining the discrepancies between predicted and actual prices, we can pinpoint specific scenarios where our model excels or struggles, ultimately leading to a more robust and reliable prediction system.

# Calculate residuals
residuals = y_test - y_pred

# Plot residuals
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residual Plot')
plt.xlabel('Actual Price')
plt.ylabel('Residuals')
plt.show()

# Plot actual vs predicted prices
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.title('Actual vs Predicted Prices')
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.show()

This code performs error analysis for a linear regression model used to predict car prices. It consists of two main parts:

  1. Residual Plot:
  • Calculates residuals (differences between actual and predicted prices)
  • Creates a scatter plot of actual prices vs. residuals
  • Adds a horizontal red dashed line at y=0 to highlight the baseline
  • This plot helps identify any patterns or heteroscedasticity in the errors
  1. Actual vs. Predicted Prices Plot:
  • Creates a scatter plot of actual prices vs. predicted prices
  • Adds a red dashed diagonal line representing perfect predictions
  • This plot helps visualize how well the model's predictions align with actual prices

These visualizations are crucial for understanding the model's performance and identifying potential areas for improvement in the car price prediction model.

6.2.7 Model Comparison

To enhance our predictive capabilities and gain a deeper understanding of the factors influencing car prices, we will now compare our linear regression model with a more complex machine learning algorithm: the Random Forest Regressor.

This comparison will allow us to assess whether we can achieve improved accuracy in our predictions and potentially uncover non-linear relationships within our data that the linear model might have missed.

By implementing this additional model, we'll be able to evaluate the strengths and weaknesses of both approaches, providing valuable insights into the most effective method for estimating used car prices in various scenarios.

# Create and train a Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions with Random Forest
rf_pred = rf_model.predict(X_test)

# Evaluate Random Forest model
rf_mse = mean_squared_error(y_test, rf_pred)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, rf_pred)

print("Random Forest Performance:")
print(f"Mean Squared Error: {rf_mse}")
print(f"Root Mean Squared Error: {rf_rmse}")
print(f"R-squared Score: {rf_r2}")

# Compare feature importance
rf_importance = pd.DataFrame({'Feature': X.columns, 'Importance': rf_model.feature_importances_})
rf_importance = rf_importance.sort_values('Importance', ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x='Importance', y='Feature', data=rf_importance)
plt.title('Feature Importance in Random Forest Model')
plt.show()

Here's a breakdown of what the code does:

  1. Creates and trains a Random Forest model with 100 trees
  2. Uses the trained model to make predictions on the test data
  3. Evaluates the Random Forest model's performance using three metrics:
    • - Mean Squared Error (MSE)
    • - Root Mean Squared Error (RMSE)
    • - R-squared Score
  4. Prints the performance metrics for easy comparison with the linear regression model
  5. Analyzes feature importance in the Random Forest model:
    • - Creates a DataFrame with features and their importance scores
    • - Sorts features by importance
    • - Visualizes feature importance using a bar plot

This code allows for a comprehensive comparison between the linear regression and Random Forest models, helping to identify which approach might be more effective for predicting car prices in this specific scenario.

6.2.8 Conclusion

In this project, we've built a comprehensive car price prediction model using linear regression. We've incorporated advanced data exploration techniques, feature engineering, and model interpretation. By comparing our linear regression model with a Random Forest model, we've gained insights into the strengths and limitations of different approaches.

Key takeaways from this project include:

  • The importance of thorough data exploration and visualization
  • The impact of feature engineering on model performance
  • The value of interpretable models like linear regression in understanding feature importance
  • The potential for ensemble methods like Random Forest to capture non-linear relationships and improve predictions

This project demonstrates the power of machine learning in solving real-world problems and provides a solid foundation for further exploration in the field of predictive modeling.