Multiple Linear Regression

Nerd Cafe

Module 1: Introduction to Multiple Linear Regression

What is Multiple Linear Regression?

A supervised learning algorithm.
Predicts a dependent variable (target) using multiple independent variables (predictors).
Extension of simple linear regression.

Mathematical Representation:

\hat{y}=\omega_{0}+\omega_{1}x_{1}+\omega_{2}x_{2}+...+\omega_{p}x_{p}

Where:

$\hat{y}$ : predicted value
$\omega_{0}$ : intercept
$\omega_{1} , ... \omega_{p}$ : coefficients
$x_{1} , ... , x_{p}$ : independent variables

Module 2: Assumptions of Multiple Linear Regression

Linearity: Target vs predictors must be linear.
Independence: Observations are independent.
Homoscedasticity: Equal variance of residuals.
Normality: Residuals follow a normal distribution.
No Multicollinearity: Predictors are not highly correlated.
No Autocorrelation: Residuals are independent.
Fixed Independent Variables: Same across all samples.

Module 3: Python Implementation (Step-by-Step)

Step 1: Data Preparation

I am using below dataset:

1. Importing Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

numpy (np): Used for numerical operations (e.g., arrays, linear algebra).
pandas (pd): Used for data manipulation and analysis using DataFrames.
matplotlib.pyplot (plt): Used for plotting graphs and visualizations.

2. Load Dataset

# Load data
dataset = pd.read_csv('Dataset/datamlr.csv')

This line loads a CSV file named db_mlr.csv into a DataFrame called dataset.
dataset now holds tabular data—rows and columns like an Excel sheet.

3. Feature Matrix (X) and Target Vector (y)

X = dataset[['R&D Spend', 'Administration', 'Marketing Spend']]
y = dataset['Profit']

X is a DataFrame containing the input features:
- 'R&D Spend'
- 'Administration'
- 'Marketing Spend'
y is a Series containing the target variable (what you want to predict), which is 'Profit'.

Step 2: Splitting the Dataset

1. Importing the Function

from sklearn.model_selection import train_test_split

This imports the train_test_split function from Scikit-learn, a popular machine learning library in Python.
train_test_split() is used to split your dataset into two parts:
- Training set (to train the model)
- Testing set (to evaluate how well the model performs)

2. Splitting the Data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Here's what it does:

X : Your features (independent variables) — e.g., R&D Spend, Administration, Marketing Spend
y : Your target/output — e.g., Profit

It returns four variables:

X_train — 80% of the data for training (features)
X_test — 20% of the data for testing (features)
y_train — 80% of the target values for training
y_test — 20% of the target values for testing

Parameter:

test_size=0.2 means:
- 20% of the data will go into the test set
- 80% will go into the training set

Step 3: Model Training

from sklearn.linear_model import LinearRegression

This line imports the LinearRegression class from the sklearn.linear_model module.
LinearRegression is a machine learning model that assumes a linear relationship between input features (X) and the output (y).

regressor = LinearRegression()

This line creates an instance of the LinearRegression class named regressor.
At this stage, the model is initialized but not trained yet.

regressor.fit(X_train, y_train)

This line trains (fits) the linear regression model using training data.
X_train is the matrix of input features for training.
y_train is the corresponding target/output values.
The .fit() method calculates the best-fit line (or hyperplane) that minimizes the mean squared error between the predicted and actual y values.

Step 4: Model Testing and Prediction

y_pred = regressor.predict(X_test)

This line uses the trained linear regression model (regressor) to make predictions on the test data (X_test).
y_pred will contain the predicted output values corresponding to the input features in X_test.

# Compare actual vs predicted
df = pd.DataFrame({'Real Values': y_test, 'Predicted Values': y_pred})

This line creates a DataFrame using pandas (assumes import pandas as pd has been done).
The DataFrame df shows a side-by-side comparison of:
- 'Real Values': The actual output values (y_test) from the test set.
- 'Predicted Values': The predicted values (y_pred) from the model.

print(df)

This line simply prints the DataFrame to the console, allowing you to visually compare the actual and predicted results.

    Real Values  Predicted Values
27     103282.4     111763.191880
18     124266.9     128057.177230
11     144259.4     136245.503634
16     126992.9     115683.954593
8      152211.8     149899.284176
26     105733.5     109892.305745

Step 5: Model Evaluation

1. Imports:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from math import sqrt

mean_squared_error: Calculates the average of squared differences between actual (y_test) and predicted (y_pred) values.
mean_absolute_error: Calculates the average of absolute differences.
r2_score: Calculates the R-squared (coefficient of determination), which indicates how well the model explains the variance in the data.
sqrt: From Python’s math module, used to compute the square root (for RMSE).

2. Metric Calculations:

mse = mean_squared_error(y_test, y_pred)

MSE (Mean Squared Error): Measures the average of the squared errors.
A lower MSE indicates better model performance.

rmse = sqrt(mse)

RMSE (Root Mean Squared Error): The square root of MSE, which brings the error back to the original unit of the target variable.
Easier to interpret than MSE.

mae = mean_absolute_error(y_test, y_pred)

MAE (Mean Absolute Error): Measures the average magnitude of the errors (without squaring).
It’s more robust to outliers than MSE.

r2 = r2_score(y_test, y_pred)

R² Score: Indicates the proportion of variance in the target variable that is predictable from the features.
Ranges from:
- 1.0 (perfect prediction)
- 0 (no explanatory power)
- < 0 (worse than baseline model)

3. Output

MSE: 50174701.37246796
RMSE: 7083.410292540449
MAE: 6344.205408672026
R²: 0.8457097502712363

Step 6: Making Predictions for New Data

new_data = [[166343.2, 136787.8, 461724.1]]
profit = regressor.predict(new_data)
print(profit)

Output

[191057.03223215]

Step 7: Interpreting Model Coefficients

print("Coefficients:", regressor.coef_)
print("Intercept:", regressor.intercept_)

Output

Coefficients: [ 0.82965119 -0.06684259  0.01573258]
Intercept: 54929.336292630556

Module 4: Applications

Field

Application Example

Finance

Stock price prediction

Marketing

Campaign effectiveness

Real Estate

House price prediction

Healthcare

Treatment outcome prediction

Economics

GDP or inflation forecasting

Social Sci.

Election result modeling

Module 5: Common Challenges

Challenge

Description

Multicollinearity

Highly correlated predictors affect model stability

Overfitting

Model too closely fits training data

Underfitting

Model too simple to capture patterns

Outliers

Affect predictions drastically

Missing Data

Leads to biased results

Non-Linearity

Linear model may not fit non-linear relationships

Module 6: Simple vs Multiple Linear Regression

Feature

Simple Linear Regression

Multiple Linear Regression

Independent Vars

One

Two or more

Complexity

Low

Higher

Use Case Example

Sales vs Ad Spend

Sales vs Ad, Price, Competition

Keywords

multiple linear regression, machine learning, regression model, independent variables, dependent variable, R&D spend, administration, marketing spend, profit prediction, data preprocessing, train test split, sklearn, linear regression, model evaluation, mean squared error, root mean squared error, R-squared score, model coefficients, multicollinearity, residual error

PreviousSimple Linear Regression Implementation NextClassification Algorithms in ML

Last updated 24 days ago

Real Values Predicted Values 27 103282.4 111763.191880 18 124266.9 128057.177230 11 144259.4 136245.503634 16 126992.9 115683.954593 8 152211.8 149899.284176 26 105733.5 109892.305745