Multiple Linear Regression

Nerd Cafe

Module 1: Introduction to Multiple Linear Regression

What is Multiple Linear Regression?

  • A supervised learning algorithm.

  • Predicts a dependent variable (target) using multiple independent variables (predictors).

  • Extension of simple linear regression.

Mathematical Representation:

y^=ω0+ω1x1+ω2x2+...+ωpxp\hat{y}=\omega_{0}+\omega_{1}x_{1}+\omega_{2}x_{2}+...+\omega_{p}x_{p}

Where:

  • y^\hat{y}​: predicted value

  • ω0\omega_{0}​: intercept

  • ω1,...ωp\omega_{1} , ... \omega_{p}​: coefficients

  • x1,...,xpx_{1} , ... , x_{p}​: independent variables

Module 2: Assumptions of Multiple Linear Regression

  • Linearity: Target vs predictors must be linear.

  • Independence: Observations are independent.

  • Homoscedasticity: Equal variance of residuals.

  • Normality: Residuals follow a normal distribution.

  • No Multicollinearity: Predictors are not highly correlated.

  • No Autocorrelation: Residuals are independent.

  • Fixed Independent Variables: Same across all samples.

Module 3: Python Implementation (Step-by-Step)

Step 1: Data Preparation

I am using below dataset:

1. Importing Libraries

  • numpy (np): Used for numerical operations (e.g., arrays, linear algebra).

  • pandas (pd): Used for data manipulation and analysis using DataFrames.

  • matplotlib.pyplot (plt): Used for plotting graphs and visualizations.

2. Load Dataset

  • This line loads a CSV file named db_mlr.csv into a DataFrame called dataset.

  • dataset now holds tabular data—rows and columns like an Excel sheet.

3. Feature Matrix (X) and Target Vector (y)

  • X is a DataFrame containing the input features:

    • 'R&D Spend'

    • 'Administration'

    • 'Marketing Spend'

  • y is a Series containing the target variable (what you want to predict), which is 'Profit'.

Step 2: Splitting the Dataset

1. Importing the Function

  • This imports the train_test_split function from Scikit-learn, a popular machine learning library in Python.

  • train_test_split() is used to split your dataset into two parts:

    • Training set (to train the model)

    • Testing set (to evaluate how well the model performs)

2. Splitting the Data

Here's what it does:

  • X : Your features (independent variables) — e.g., R&D Spend, Administration, Marketing Spend

  • y : Your target/output — e.g., Profit

It returns four variables:

  1. X_train — 80% of the data for training (features)

  2. X_test — 20% of the data for testing (features)

  3. y_train — 80% of the target values for training

  4. y_test — 20% of the target values for testing

Parameter:

  • test_size=0.2 means:

    • 20% of the data will go into the test set

    • 80% will go into the training set

Step 3: Model Training

  • This line imports the LinearRegression class from the sklearn.linear_model module.

  • LinearRegression is a machine learning model that assumes a linear relationship between input features (X) and the output (y).

  • This line creates an instance of the LinearRegression class named regressor.

  • At this stage, the model is initialized but not trained yet.

  • This line trains (fits) the linear regression model using training data.

  • X_train is the matrix of input features for training.

  • y_train is the corresponding target/output values.

  • The .fit() method calculates the best-fit line (or hyperplane) that minimizes the mean squared error between the predicted and actual y values.

Step 4: Model Testing and Prediction

  • This line uses the trained linear regression model (regressor) to make predictions on the test data (X_test).

  • y_pred will contain the predicted output values corresponding to the input features in X_test.

  • This line creates a DataFrame using pandas (assumes import pandas as pd has been done).

  • The DataFrame df shows a side-by-side comparison of:

    • 'Real Values': The actual output values (y_test) from the test set.

    • 'Predicted Values': The predicted values (y_pred) from the model.

This line simply prints the DataFrame to the console, allowing you to visually compare the actual and predicted results.

Step 5: Model Evaluation

1. Imports:

  • mean_squared_error: Calculates the average of squared differences between actual (y_test) and predicted (y_pred) values.

  • mean_absolute_error: Calculates the average of absolute differences.

  • r2_score: Calculates the R-squared (coefficient of determination), which indicates how well the model explains the variance in the data.

  • sqrt: From Python’s math module, used to compute the square root (for RMSE).

2. Metric Calculations:

  • MSE (Mean Squared Error): Measures the average of the squared errors.

  • A lower MSE indicates better model performance.

  • RMSE (Root Mean Squared Error): The square root of MSE, which brings the error back to the original unit of the target variable.

  • Easier to interpret than MSE.

  • MAE (Mean Absolute Error): Measures the average magnitude of the errors (without squaring).

  • It’s more robust to outliers than MSE.

  • R² Score: Indicates the proportion of variance in the target variable that is predictable from the features.

  • Ranges from:

    • 1.0 (perfect prediction)

    • 0 (no explanatory power)

    • < 0 (worse than baseline model)

3. Output

Step 6: Making Predictions for New Data

Output

Step 7: Interpreting Model Coefficients

Output

Module 4: Applications

Field
Application Example

Finance

Stock price prediction

Marketing

Campaign effectiveness

Real Estate

House price prediction

Healthcare

Treatment outcome prediction

Economics

GDP or inflation forecasting

Social Sci.

Election result modeling

Module 5: Common Challenges

Challenge
Description

Multicollinearity

Highly correlated predictors affect model stability

Overfitting

Model too closely fits training data

Underfitting

Model too simple to capture patterns

Outliers

Affect predictions drastically

Missing Data

Leads to biased results

Non-Linearity

Linear model may not fit non-linear relationships

Module 6: Simple vs Multiple Linear Regression

Feature
Simple Linear Regression
Multiple Linear Regression

Independent Vars

One

Two or more

Complexity

Low

Higher

Use Case Example

Sales vs Ad Spend

Sales vs Ad, Price, Competition

Keywords

multiple linear regression, machine learning, regression model, independent variables, dependent variable, R&D spend, administration, marketing spend, profit prediction, data preprocessing, train test split, sklearn, linear regression, model evaluation, mean squared error, root mean squared error, R-squared score, model coefficients, multicollinearity, residual error

Last updated