Multiple Linear Regression
Nerd Cafe
Module 1: Introduction to Multiple Linear Regression
What is Multiple Linear Regression?
A supervised learning algorithm.
Predicts a dependent variable (target) using multiple independent variables (predictors).
Extension of simple linear regression.
Mathematical Representation:
Where:
: predicted value
: intercept
: coefficients
: independent variables
Module 2: Assumptions of Multiple Linear Regression
Linearity: Target vs predictors must be linear.
Independence: Observations are independent.
Homoscedasticity: Equal variance of residuals.
Normality: Residuals follow a normal distribution.
No Multicollinearity: Predictors are not highly correlated.
No Autocorrelation: Residuals are independent.
Fixed Independent Variables: Same across all samples.
Module 3: Python Implementation (Step-by-Step)
Step 1: Data Preparation
I am using below dataset:
1. Importing Libraries
numpy (np)
: Used for numerical operations (e.g., arrays, linear algebra).pandas (pd)
: Used for data manipulation and analysis using DataFrames.matplotlib.pyplot (plt)
: Used for plotting graphs and visualizations.
2. Load Dataset
This line loads a CSV file named
db_mlr.csv
into a DataFrame calleddataset
.dataset
now holds tabular data—rows and columns like an Excel sheet.
3. Feature Matrix (X) and Target Vector (y)
X
is a DataFrame containing the input features:'R&D Spend'
'Administration'
'Marketing Spend'
y
is a Series containing the target variable (what you want to predict), which is'Profit'
.
Step 2: Splitting the Dataset
1. Importing the Function
This imports the
train_test_split
function from Scikit-learn, a popular machine learning library in Python.train_test_split()
is used to split your dataset into two parts:Training set (to train the model)
Testing set (to evaluate how well the model performs)
2. Splitting the Data
Here's what it does:
X
: Your features (independent variables) — e.g., R&D Spend, Administration, Marketing Spendy
: Your target/output — e.g., Profit
It returns four variables:
X_train
— 80% of the data for training (features)X_test
— 20% of the data for testing (features)y_train
— 80% of the target values for trainingy_test
— 20% of the target values for testing
Parameter:
test_size=0.2
means:20% of the data will go into the test set
80% will go into the training set
Step 3: Model Training
This line imports the
LinearRegression
class from thesklearn.linear_model
module.LinearRegression
is a machine learning model that assumes a linear relationship between input features (X
) and the output (y
).
This line creates an instance of the
LinearRegression
class namedregressor
.At this stage, the model is initialized but not trained yet.
This line trains (fits) the linear regression model using training data.
X_train
is the matrix of input features for training.y_train
is the corresponding target/output values.The
.fit()
method calculates the best-fit line (or hyperplane) that minimizes the mean squared error between the predicted and actualy
values.
Step 4: Model Testing and Prediction
This line uses the trained linear regression model (
regressor
) to make predictions on the test data (X_test
).y_pred
will contain the predicted output values corresponding to the input features inX_test
.
This line creates a DataFrame using
pandas
(assumesimport pandas as pd
has been done).The DataFrame
df
shows a side-by-side comparison of:'Real Values'
: The actual output values (y_test
) from the test set.'Predicted Values'
: The predicted values (y_pred
) from the model.
This line simply prints the DataFrame to the console, allowing you to visually compare the actual and predicted results.
Step 5: Model Evaluation
1. Imports:
mean_squared_error
: Calculates the average of squared differences between actual (y_test
) and predicted (y_pred
) values.mean_absolute_error
: Calculates the average of absolute differences.r2_score
: Calculates the R-squared (coefficient of determination), which indicates how well the model explains the variance in the data.sqrt
: From Python’smath
module, used to compute the square root (for RMSE).
2. Metric Calculations:
MSE (Mean Squared Error): Measures the average of the squared errors.
A lower MSE indicates better model performance.
RMSE (Root Mean Squared Error): The square root of MSE, which brings the error back to the original unit of the target variable.
Easier to interpret than MSE.
MAE (Mean Absolute Error): Measures the average magnitude of the errors (without squaring).
It’s more robust to outliers than MSE.
R² Score: Indicates the proportion of variance in the target variable that is predictable from the features.
Ranges from:
1.0 (perfect prediction)
0 (no explanatory power)
< 0 (worse than baseline model)
3. Output
Step 6: Making Predictions for New Data
Output
Step 7: Interpreting Model Coefficients
Output
Module 4: Applications
Finance
Stock price prediction
Marketing
Campaign effectiveness
Real Estate
House price prediction
Healthcare
Treatment outcome prediction
Economics
GDP or inflation forecasting
Social Sci.
Election result modeling
Module 5: Common Challenges
Multicollinearity
Highly correlated predictors affect model stability
Overfitting
Model too closely fits training data
Underfitting
Model too simple to capture patterns
Outliers
Affect predictions drastically
Missing Data
Leads to biased results
Non-Linearity
Linear model may not fit non-linear relationships
Module 6: Simple vs Multiple Linear Regression
Independent Vars
One
Two or more
Complexity
Low
Higher
Use Case Example
Sales vs Ad Spend
Sales vs Ad, Price, Competition
Keywords
multiple linear regression
, machine learning
, regression model
, independent variables
, dependent variable
, R&D spend
, administration
, marketing spend
, profit prediction
, data preprocessing
, train test split
, sklearn
, linear regression
, model evaluation
, mean squared error
, root mean squared error
, R-squared score
, model coefficients
, multicollinearity
, residual error
Last updated