Module 1: Introduction to Multiple Linear Regression
What is Multiple Linear Regression?
A supervised learning algorithm.
Predicts a dependent variable (target) using multiple independent variables (predictors).
Extension of simple linear regression.
Mathematical Representation:
y^=ω0+ω1x1+ω2x2+...+ωpxp Where:
y^: predicted value
ω0: intercept
ω1,...ωp: coefficients
x1,...,xp: independent variables
Module 2: Assumptions of Multiple Linear Regression
Linearity: Target vs predictors must be linear.
Independence: Observations are independent.
Homoscedasticity: Equal variance of residuals.
Normality: Residuals follow a normal distribution.
No Multicollinearity: Predictors are not highly correlated.
No Autocorrelation: Residuals are independent.
Fixed Independent Variables: Same across all samples.
Module 3: Python Implementation (Step-by-Step)
Step 1: Data Preparation
I am using below dataset:
1. Importing Libraries
numpy (np): Used for numerical operations (e.g., arrays, linear algebra).
pandas (pd): Used for data manipulation and analysis using DataFrames.
matplotlib.pyplot (plt): Used for plotting graphs and visualizations.
2. Load Dataset
This line loads a CSV file named db_mlr.csv into a DataFrame called dataset.
dataset now holds tabular data—rows and columns like an Excel sheet.
3. Feature Matrix (X) and Target Vector (y)
X is a DataFrame containing the input features:
y is a Series containing the target variable (what you want to predict), which is 'Profit'.
Step 2: Splitting the Dataset
1. Importing the Function
This imports the train_test_split function from Scikit-learn, a popular machine learning library in Python.
train_test_split() is used to split your dataset into two parts:
Training set (to train the model)
Testing set (to evaluate how well the model performs)
2. Splitting the Data
Here's what it does:
X : Your features (independent variables) — e.g., R&D Spend, Administration, Marketing Spend
y : Your target/output — e.g., Profit
It returns four variables:
X_train — 80% of the data for training (features)
X_test — 20% of the data for testing (features)
y_train — 80% of the target values for training
y_test — 20% of the target values for testing
Parameter:
test_size=0.2 means:
20% of the data will go into the test set
80% will go into the training set
Step 3: Model Training
This line imports the LinearRegression class from the sklearn.linear_model module.
LinearRegression is a machine learning model that assumes a linear relationship between input features (X) and the output (y).
This line creates an instance of the LinearRegression class named regressor.
At this stage, the model is initialized but not trained yet.
This line trains (fits) the linear regression model using training data.
X_train is the matrix of input features for training.
y_train is the corresponding target/output values.
The .fit() method calculates the best-fit line (or hyperplane) that minimizes the mean squared error between the predicted and actual y values.
Step 4: Model Testing and Prediction
This line uses the trained linear regression model (regressor) to make predictions on the test data (X_test).
y_pred will contain the predicted output values corresponding to the input features in X_test.
This line creates a DataFrame using pandas (assumes import pandas as pd has been done).
The DataFrame df shows a side-by-side comparison of:
'Real Values': The actual output values (y_test) from the test set.
'Predicted Values': The predicted values (y_pred) from the model.
This line simply prints the DataFrame to the console, allowing you to visually compare the actual and predicted results.
Step 5: Model Evaluation
mean_squared_error: Calculates the average of squared differences between actual (y_test) and predicted (y_pred) values.
mean_absolute_error: Calculates the average of absolute differences.
r2_score: Calculates the R-squared (coefficient of determination), which indicates how well the model explains the variance in the data.
sqrt: From Python’s math module, used to compute the square root (for RMSE).
2. Metric Calculations:
MSE (Mean Squared Error): Measures the average of the squared errors.
A lower MSE indicates better model performance.
RMSE (Root Mean Squared Error): The square root of MSE, which brings the error back to the original unit of the target variable.
Easier to interpret than MSE.
MAE (Mean Absolute Error): Measures the average magnitude of the errors (without squaring).
It’s more robust to outliers than MSE.
R² Score: Indicates the proportion of variance in the target variable that is predictable from the features.
Ranges from:
< 0 (worse than baseline model)
Step 6: Making Predictions for New Data
Step 7: Interpreting Model Coefficients
Module 4: Applications
Treatment outcome prediction
GDP or inflation forecasting
Module 5: Common Challenges
Highly correlated predictors affect model stability
Model too closely fits training data
Model too simple to capture patterns
Affect predictions drastically
Linear model may not fit non-linear relationships
Module 6: Simple vs Multiple Linear Regression
Feature
Simple Linear Regression
Multiple Linear Regression
Sales vs Ad, Price, Competition
multiple linear regression, machine learning, regression model, independent variables, dependent variable, R&D spend, administration, marketing spend, profit prediction, data preprocessing, train test split, sklearn, linear regression, model evaluation, mean squared error, root mean squared error, R-squared score, model coefficients, multicollinearity, residual error
Last updated