Python Encyclopedia for Academics
  • Course Outline
  • Artificial Intelligence
    • Data Science Foundation
      • Python Programming
        • Introduction and Basics
          • Variables
          • Print Function
          • Input From User
          • Data Types
          • Type Conversion
        • Operators
          • Arithmetic Operators
          • Relational Operators
          • Bitwise Operators
          • Logical Operators
          • Assignment Operators
          • Compound Operators
          • Membership Operators
          • Identity Operators
      • Numpy
        • Vectors, Matrix
        • Operations on Matrix
        • Mean, Variance, and Standard Deviation
        • Reshaping Arrays
        • Transpose and Determinant of Matrix
      • Pandas
        • Series and DataFrames
        • Slicing, Rows, and Columns
        • Operations on DataFrames
        • Different wayes to creat DataFrame
        • Read, Write Operations with CSV files
      • Matplotlib
        • Graph Basics
        • Format Strings in Plots
        • Label Parameters, Legend
        • Bar Chart, Pie Chart, Histogram, and Scatter Plot
  • Machine Learning Algorithms
    • Regression Analysis In ML
      • Regression Analysis in Machine Learning
      • Proof of Linear Regression Formulas
      • Simple Linear Regression Implementation
      • Multiple Linear Regression
      • Advertising Dataset Example
      • Bike Sharing Dataset
      • Wine Quality Dataset
      • Auto MPG Dataset
    • Classification Algorithms in ML
      • Proof of Logistic Regression
      • Simplified Mathematical Proof of SVM
      • Iris Dataset
  • Machine Learning Laboratory
    • Lab 1: Titanic Dataset
      • Predicting Survival on the Titanic with Machine Learning
    • Lab 2: Dow Jones Index Dataset
      • Dow Jones Index Predictions Using Machine Learning
    • Lab 3: Diabetes Dataset
      • Numpy
      • Pandas
      • Matplotlib
      • Simple Linear Regression
      • Simple Non-linear Regression
      • Performance Matrix
      • Preprocessing
      • Naive Bayes Classification
      • K-Nearest Neighbors (KNN) Classification
      • Decision Tree & Random Forest
      • SVM Classifier
      • Logistic Regression
      • Artificial Neural Network
      • K means Clustering
    • Lab 4: MAGIC Gamma Telescope Dataset
      • Classification in ML-MAGIC Gamma Telescope Dataset
    • Lab 5: Seoul Bike Sharing Demand Dataset
      • Regression in ML-Seoul Bike Sharing Demand Dataset
    • Lab 6: Medical Cost Personal Datasets
      • Predict Insurance Costs with Linear Regression in Python
    • Lab 6: Predict The S&P 500 Index With Machine Learning And Python
      • Predict The S&P 500 Index With Machine Learning And Python
  • Artificial Neural Networks
    • Biological Inspiration vs. Artificial Neurons
    • Review linear algebra and calculus essentials for ANNs
    • Activation Function
  • Mathematics
    • Pre-Calculus
      • Factorials
      • Roots of Polynomials
      • Complex Numbers
      • Polar Coordinates
      • Graph of a Function
    • Calculus 1
      • Limit of a Function
      • Derivative of Function
      • Critical Points
      • Indefinite Integrals
  • Calculus 2
    • 3D Coordinates and Vectors
    • Vectors and Vector Operations
    • Lines and Planes in Space (3D)
    • Partial Derivatives
    • Optimization Problems (Maxima/Minima) in Multivariable Functions
    • Gradient Vectors
  • Engineering Mathematics
    • Laplace Transform
  • Electrical & electronics Eng
    • Resistor
      • Series Resistors
      • Parallel Resistors
    • Nodal Analysis
      • Example 1
      • Example 2
    • Transient State
      • RC Circuit Equations in the s-Domain
      • RL Circuit Equations in the s-Domain
      • LC Circuit Equations in the s-Domain
      • Series RLC Circuit with DC Source
  • Computer Networking
    • Fundamental
      • IPv4 Addressing
      • Network Diagnostics
  • Cybersecurity
    • Classical Ciphers
      • Caesar Cipher
      • Affine Cipher
      • Atbash Cipher
      • Vigenère Cipher
      • Gronsfeld Cipher
      • Alberti Cipher
      • Hill Cipher
Powered by GitBook
On this page
  • 1. Understanding Relationships
  • 2. Predict Future Outcomes
  • 3. Evaluate Model Accuracy
  • 4. Learn Model Parameters
  • 5. Validate Assumptions
  • 6. Foundational Learning
  • Python code
  1. Machine Learning Algorithms
  2. Regression Analysis In ML

Simple Linear Regression Implementation

Nerd Cafe

PreviousProof of Linear Regression FormulasNextMultiple Linear Regression

Last updated 26 days ago

The primary purpose of implementing a Simple Linear Regression model on the below dataset

is to model and quantify the linear relationship between years of experience (independent variable) and salary (dependent variable) in order to make predictions. This serves the following specific goals:

1. Understanding Relationships

  • You want to determine how salary changes with years of experience.

  • A linear regression line provides an interpretable model that tells you how much increase in salary is expected for each additional year of experience.

2. Predict Future Outcomes

  • Once trained, the model can be used to predict salary for unseen or future inputs (e.g., "What will be the salary of a person with 7.5 years of experience?").

3. Evaluate Model Accuracy

  • By comparing predicted values with actual values on the test set, you can evaluate how well the model generalizes to new data.

  • The y_test vs y_pred comparison helps calculate accuracy metrics like:

    • Mean Squared Error (MSE)

    • R² Score (coefficient of determination)

    • Mean Absolute Error (MAE)

4. Learn Model Parameters

  • After fitting the model, the values of w₀ (intercept) and w₁ (slope) are learned from the training data.

  • These parameters define the best-fit line:

which explains the trend in your dataset.

5. Validate Assumptions

  • By plotting residuals or checking histograms, you can verify the assumptions of linear regression (linearity, normality, independence, homoskedasticity).

6. Foundational Learning

  • This is a baseline model in machine learning. It introduces important concepts such as:

    • Supervised learning

    • Cost/loss functions

    • Optimization (e.g., gradient descent)

    • Model evaluation and visualization

Python code

The dataset contains two columns:

  • Years of Experience: Number of years a person has worked.

  • Salary: The corresponding salary.

Here's a preview of the first few rows:

import pandas as pd

# Load the CSV file
file_path = 'Dataset/Salary_Data.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
data.head()

The output is:

	Years of Experience	Salary
0	1.1			39343
1	1.3			46205
2	1.5			37731
3	2.0			43525
4	2.2			39891

Let's build and visualize a simple linear regression model to predict salary based on years of experience.

1. Import Required Libraries

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
  • pandas is used to load and manage the dataset.

  • matplotlib.pyplot is for plotting the graph.

  • sklearn.linear_model.LinearRegression is the core linear regression model.

  • train_test_split splits the data into training and testing sets.

  • r2_score and mean_squared_error are evaluation metrics.

2. Load the CSV File

df = pd.read_csv('Dataset/Salary_Data.csv')
  • This reads the CSV file into a DataFrame called df.

3. Separate Features and Labels python Copy Edit

X = df[['Years of Experience']]
y = df['Salary']
  • X contains the input feature: Years of Experience (2D array).

  • y contains the target variable: Salary.

4. Split into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  • 80% of the data goes into training (X_train, y_train).

  • 20% goes into testing (X_test, y_test).

  • random_state=42 ensures reproducibility

5. Train the Linear Regression Model

model = LinearRegression()
model.fit(X_train, y_train)
  • We create a linear regression model.

  • .fit() trains the model using training data.

6. Make Predictions on Test Set

y_pred = model.predict(X_test)
  • This uses the trained model to predict salaries for X_test.

7. Print the Model Equation

slope = model.coef_[0]
intercept = model.intercept_
print(f"Model Equation: Salary = {slope:.2f} * YearsExperience + {intercept:.2f}")
  • slope is the coefficient (how much salary increases per year).

  • intercept is the base salary when experience is 0.

  • You get the equation in the form: Salary = a × YearsExperience + b

Output

Model Equation: Salary = 9423.82 * YearsExperience + 25321.58

8. Evaluate Accuracy

r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f"R² Score: {r2:.4f}")
print(f"Mean Squared Error: {mse:.2f}")
  • R² Score: how well the model fits the data (1 = perfect).

  • MSE: average of squared prediction errors (lower is better).

Output

R² Score: 0.9024
Mean Squared Error: 49830096.86

9. Predict Salary for a Specific Experience

specific_years = 5
predicted_salary = model.predict([[specific_years]])[0]
print(f"Predicted Salary for {specific_years} years of experience: ${predicted_salary:.2f}")

We predict the salary for 5 years of experience using our model.

Output

Predicted Salary for 5 years of experience: $72440.66

10. Plot the Data and the Regression Line

plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, model.predict(X), color='red', linewidth=2, label='Regression Line')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Salary vs Years of Experience')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
  • Blue points are actual data.

  • Red line is the predicted regression line.

  • Shows how well the line fits the data.

356B
Salary_Data.csv