Python Encyclopedia for Academics
  • Course Outline
  • Artificial Intelligence
    • Data Science Foundation
      • Python Programming
        • Introduction and Basics
          • Variables
          • Print Function
          • Input From User
          • Data Types
          • Type Conversion
        • Operators
          • Arithmetic Operators
          • Relational Operators
          • Bitwise Operators
          • Logical Operators
          • Assignment Operators
          • Compound Operators
          • Membership Operators
          • Identity Operators
      • Numpy
        • Vectors, Matrix
        • Operations on Matrix
        • Mean, Variance, and Standard Deviation
        • Reshaping Arrays
        • Transpose and Determinant of Matrix
      • Pandas
        • Series and DataFrames
        • Slicing, Rows, and Columns
        • Operations on DataFrames
        • Different wayes to creat DataFrame
        • Read, Write Operations with CSV files
      • Matplotlib
        • Graph Basics
        • Format Strings in Plots
        • Label Parameters, Legend
        • Bar Chart, Pie Chart, Histogram, and Scatter Plot
  • Machine Learning Algorithms
    • Regression Analysis In ML
      • Regression Analysis in Machine Learning
      • Proof of Linear Regression Formulas
      • Simple Linear Regression Implementation
      • Multiple Linear Regression
      • Advertising Dataset Example
      • Bike Sharing Dataset
      • Wine Quality Dataset
      • Auto MPG Dataset
    • Classification Algorithms in ML
      • Proof of Logistic Regression
      • Simplified Mathematical Proof of SVM
      • Iris Dataset
  • Machine Learning Laboratory
    • Lab 1: Titanic Dataset
      • Predicting Survival on the Titanic with Machine Learning
    • Lab 2: Dow Jones Index Dataset
      • Dow Jones Index Predictions Using Machine Learning
    • Lab 3: Diabetes Dataset
      • Numpy
      • Pandas
      • Matplotlib
      • Simple Linear Regression
      • Simple Non-linear Regression
      • Performance Matrix
      • Preprocessing
      • Naive Bayes Classification
      • K-Nearest Neighbors (KNN) Classification
      • Decision Tree & Random Forest
      • SVM Classifier
      • Logistic Regression
      • Artificial Neural Network
      • K means Clustering
    • Lab 4: MAGIC Gamma Telescope Dataset
      • Classification in ML-MAGIC Gamma Telescope Dataset
    • Lab 5: Seoul Bike Sharing Demand Dataset
      • Regression in ML-Seoul Bike Sharing Demand Dataset
    • Lab 6: Medical Cost Personal Datasets
      • Predict Insurance Costs with Linear Regression in Python
    • Lab 6: Predict The S&P 500 Index With Machine Learning And Python
      • Predict The S&P 500 Index With Machine Learning And Python
  • Artificial Neural Networks
    • Biological Inspiration vs. Artificial Neurons
    • Review linear algebra and calculus essentials for ANNs
    • Activation Function
  • Mathematics
    • Pre-Calculus
      • Factorials
      • Roots of Polynomials
      • Complex Numbers
      • Polar Coordinates
      • Graph of a Function
    • Calculus 1
      • Limit of a Function
      • Derivative of Function
      • Critical Points
      • Indefinite Integrals
  • Calculus 2
    • 3D Coordinates and Vectors
    • Vectors and Vector Operations
    • Lines and Planes in Space (3D)
    • Partial Derivatives
    • Optimization Problems (Maxima/Minima) in Multivariable Functions
    • Gradient Vectors
  • Engineering Mathematics
    • Laplace Transform
  • Electrical & electronics Eng
    • Resistor
      • Series Resistors
      • Parallel Resistors
    • Nodal Analysis
      • Example 1
      • Example 2
    • Transient State
      • RC Circuit Equations in the s-Domain
      • RL Circuit Equations in the s-Domain
      • LC Circuit Equations in the s-Domain
      • Series RLC Circuit with DC Source
  • Computer Networking
    • Fundamental
      • IPv4 Addressing
      • Network Diagnostics
  • Cybersecurity
    • Classical Ciphers
      • Caesar Cipher
      • Affine Cipher
      • Atbash Cipher
      • Vigenère Cipher
      • Gronsfeld Cipher
      • Alberti Cipher
      • Hill Cipher
Powered by GitBook
On this page
  • Module 1: Introduction to Multiple Linear Regression
  • Module 2: Assumptions of Multiple Linear Regression
  • Module 3: Python Implementation (Step-by-Step)
  • Module 4: Applications
  • Module 5: Common Challenges
  • Module 6: Simple vs Multiple Linear Regression
  • Keywords
  1. Machine Learning Algorithms
  2. Regression Analysis In ML

Multiple Linear Regression

Nerd Cafe

Module 1: Introduction to Multiple Linear Regression

What is Multiple Linear Regression?

  • A supervised learning algorithm.

  • Predicts a dependent variable (target) using multiple independent variables (predictors).

  • Extension of simple linear regression.

Mathematical Representation:

y^=ω0+ω1x1+ω2x2+...+ωpxp\hat{y}=\omega_{0}+\omega_{1}x_{1}+\omega_{2}x_{2}+...+\omega_{p}x_{p}y^​=ω0​+ω1​x1​+ω2​x2​+...+ωp​xp​

Where:

  • y^\hat{y}y^​​: predicted value

  • ω0\omega_{0}ω0​​: intercept

  • ω1,...ωp\omega_{1} , ... \omega_{p}ω1​,...ωp​​: coefficients

  • x1,...,xpx_{1} , ... , x_{p}x1​,...,xp​​: independent variables

Module 2: Assumptions of Multiple Linear Regression

  • Linearity: Target vs predictors must be linear.

  • Independence: Observations are independent.

  • Homoscedasticity: Equal variance of residuals.

  • Normality: Residuals follow a normal distribution.

  • No Multicollinearity: Predictors are not highly correlated.

  • No Autocorrelation: Residuals are independent.

  • Fixed Independent Variables: Same across all samples.

Module 3: Python Implementation (Step-by-Step)

Step 1: Data Preparation

I am using below dataset:

1. Importing Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
  • numpy (np): Used for numerical operations (e.g., arrays, linear algebra).

  • pandas (pd): Used for data manipulation and analysis using DataFrames.

  • matplotlib.pyplot (plt): Used for plotting graphs and visualizations.

2. Load Dataset

# Load data
dataset = pd.read_csv('Dataset/datamlr.csv')
  • This line loads a CSV file named db_mlr.csv into a DataFrame called dataset.

  • dataset now holds tabular data—rows and columns like an Excel sheet.

3. Feature Matrix (X) and Target Vector (y)

X = dataset[['R&D Spend', 'Administration', 'Marketing Spend']]
y = dataset['Profit']
  • X is a DataFrame containing the input features:

    • 'R&D Spend'

    • 'Administration'

    • 'Marketing Spend'

  • y is a Series containing the target variable (what you want to predict), which is 'Profit'.

Step 2: Splitting the Dataset

1. Importing the Function

from sklearn.model_selection import train_test_split
  • This imports the train_test_split function from Scikit-learn, a popular machine learning library in Python.

  • train_test_split() is used to split your dataset into two parts:

    • Training set (to train the model)

    • Testing set (to evaluate how well the model performs)

2. Splitting the Data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Here's what it does:

  • X : Your features (independent variables) — e.g., R&D Spend, Administration, Marketing Spend

  • y : Your target/output — e.g., Profit

It returns four variables:

  1. X_train — 80% of the data for training (features)

  2. X_test — 20% of the data for testing (features)

  3. y_train — 80% of the target values for training

  4. y_test — 20% of the target values for testing

Parameter:

  • test_size=0.2 means:

    • 20% of the data will go into the test set

    • 80% will go into the training set

Step 3: Model Training

from sklearn.linear_model import LinearRegression
  • This line imports the LinearRegression class from the sklearn.linear_model module.

  • LinearRegression is a machine learning model that assumes a linear relationship between input features (X) and the output (y).

regressor = LinearRegression()
  • This line creates an instance of the LinearRegression class named regressor.

  • At this stage, the model is initialized but not trained yet.

regressor.fit(X_train, y_train)
  • This line trains (fits) the linear regression model using training data.

  • X_train is the matrix of input features for training.

  • y_train is the corresponding target/output values.

  • The .fit() method calculates the best-fit line (or hyperplane) that minimizes the mean squared error between the predicted and actual y values.

Step 4: Model Testing and Prediction

y_pred = regressor.predict(X_test)
  • This line uses the trained linear regression model (regressor) to make predictions on the test data (X_test).

  • y_pred will contain the predicted output values corresponding to the input features in X_test.

# Compare actual vs predicted
df = pd.DataFrame({'Real Values': y_test, 'Predicted Values': y_pred})
  • This line creates a DataFrame using pandas (assumes import pandas as pd has been done).

  • The DataFrame df shows a side-by-side comparison of:

    • 'Real Values': The actual output values (y_test) from the test set.

    • 'Predicted Values': The predicted values (y_pred) from the model.

print(df)

This line simply prints the DataFrame to the console, allowing you to visually compare the actual and predicted results.

    Real Values  Predicted Values
27     103282.4     111763.191880
18     124266.9     128057.177230
11     144259.4     136245.503634
16     126992.9     115683.954593
8      152211.8     149899.284176
26     105733.5     109892.305745

Step 5: Model Evaluation

1. Imports:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from math import sqrt
  • mean_squared_error: Calculates the average of squared differences between actual (y_test) and predicted (y_pred) values.

  • mean_absolute_error: Calculates the average of absolute differences.

  • r2_score: Calculates the R-squared (coefficient of determination), which indicates how well the model explains the variance in the data.

  • sqrt: From Python’s math module, used to compute the square root (for RMSE).

2. Metric Calculations:

mse = mean_squared_error(y_test, y_pred)
  • MSE (Mean Squared Error): Measures the average of the squared errors.

  • A lower MSE indicates better model performance.

rmse = sqrt(mse)
  • RMSE (Root Mean Squared Error): The square root of MSE, which brings the error back to the original unit of the target variable.

  • Easier to interpret than MSE.

mae = mean_absolute_error(y_test, y_pred)
  • MAE (Mean Absolute Error): Measures the average magnitude of the errors (without squaring).

  • It’s more robust to outliers than MSE.

r2 = r2_score(y_test, y_pred)
  • R² Score: Indicates the proportion of variance in the target variable that is predictable from the features.

  • Ranges from:

    • 1.0 (perfect prediction)

    • 0 (no explanatory power)

    • < 0 (worse than baseline model)

3. Output

MSE: 50174701.37246796
RMSE: 7083.410292540449
MAE: 6344.205408672026
R²: 0.8457097502712363

Step 6: Making Predictions for New Data

new_data = [[166343.2, 136787.8, 461724.1]]
profit = regressor.predict(new_data)
print(profit)

Output

[191057.03223215]

Step 7: Interpreting Model Coefficients

print("Coefficients:", regressor.coef_)
print("Intercept:", regressor.intercept_)

Output

Coefficients: [ 0.82965119 -0.06684259  0.01573258]
Intercept: 54929.336292630556

Module 4: Applications

Field
Application Example

Finance

Stock price prediction

Marketing

Campaign effectiveness

Real Estate

House price prediction

Healthcare

Treatment outcome prediction

Economics

GDP or inflation forecasting

Social Sci.

Election result modeling

Module 5: Common Challenges

Challenge
Description

Multicollinearity

Highly correlated predictors affect model stability

Overfitting

Model too closely fits training data

Underfitting

Model too simple to capture patterns

Outliers

Affect predictions drastically

Missing Data

Leads to biased results

Non-Linearity

Linear model may not fit non-linear relationships

Module 6: Simple vs Multiple Linear Regression

Feature
Simple Linear Regression
Multiple Linear Regression

Independent Vars

One

Two or more

Complexity

Low

Higher

Use Case Example

Sales vs Ad Spend

Sales vs Ad, Price, Competition

Keywords

multiple linear regression, machine learning, regression model, independent variables, dependent variable, R&D spend, administration, marketing spend, profit prediction, data preprocessing, train test split, sklearn, linear regression, model evaluation, mean squared error, root mean squared error, R-squared score, model coefficients, multicollinearity, residual error

PreviousSimple Linear Regression ImplementationNextClassification Algorithms in ML

Last updated 24 days ago

1KB
datamlr.csv.csv