Simple Linear Regression Implementation
Nerd Cafe
Last updated
Nerd Cafe
Last updated
The primary purpose of implementing a Simple Linear Regression model on the below dataset
is to model and quantify the linear relationship between years of experience (independent variable) and salary (dependent variable) in order to make predictions. This serves the following specific goals:
You want to determine how salary changes with years of experience.
A linear regression line provides an interpretable model that tells you how much increase in salary is expected for each additional year of experience.
Once trained, the model can be used to predict salary for unseen or future inputs (e.g., "What will be the salary of a person with 7.5 years of experience?").
By comparing predicted values with actual values on the test set, you can evaluate how well the model generalizes to new data.
The y_test
vs y_pred
comparison helps calculate accuracy metrics like:
Mean Squared Error (MSE)
R² Score (coefficient of determination)
Mean Absolute Error (MAE)
After fitting the model, the values of w₀
(intercept) and w₁
(slope) are learned from the training data.
These parameters define the best-fit line:
which explains the trend in your dataset.
By plotting residuals or checking histograms, you can verify the assumptions of linear regression (linearity, normality, independence, homoskedasticity).
This is a baseline model in machine learning. It introduces important concepts such as:
Supervised learning
Cost/loss functions
Optimization (e.g., gradient descent)
Model evaluation and visualization
The dataset contains two columns:
Years of Experience: Number of years a person has worked.
Salary: The corresponding salary.
Here's a preview of the first few rows:
The output is:
Let's build and visualize a simple linear regression model to predict salary based on years of experience.
pandas
is used to load and manage the dataset.
matplotlib.pyplot
is for plotting the graph.
sklearn.linear_model.LinearRegression
is the core linear regression model.
train_test_split
splits the data into training and testing sets.
r2_score
and mean_squared_error
are evaluation metrics.
This reads the CSV file into a DataFrame called df
.
X
contains the input feature: Years of Experience (2D array).
y
contains the target variable: Salary.
80% of the data goes into training (X_train
, y_train
).
20% goes into testing (X_test
, y_test
).
random_state=42
ensures reproducibility
We create a linear regression model.
.fit()
trains the model using training data.
This uses the trained model to predict salaries for X_test
.
slope
is the coefficient (how much salary increases per year).
intercept
is the base salary when experience is 0.
You get the equation in the form: Salary = a × YearsExperience + b
Output
R² Score
: how well the model fits the data (1 = perfect).
MSE
: average of squared prediction errors (lower is better).
Output
We predict the salary for 5 years of experience using our model.
Output
Blue points are actual data.
Red line is the predicted regression line.
Shows how well the line fits the data.