Simple Linear Regression Implementation

Nerd Cafe

PreviousProof of Linear Regression Formulas NextMultiple Linear Regression

Last updated 26 days ago

Simple Linear Regression Implementation

Nerd Cafe

The primary purpose of implementing a Simple Linear Regression model on the below dataset

is to model and quantify the linear relationship between years of experience (independent variable) and salary (dependent variable) in order to make predictions. This serves the following specific goals:

1. Understanding Relationships

You want to determine how salary changes with years of experience.
A linear regression line provides an interpretable model that tells you how much increase in salary is expected for each additional year of experience.

2. Predict Future Outcomes

Once trained, the model can be used to predict salary for unseen or future inputs (e.g., "What will be the salary of a person with 7.5 years of experience?").

3. Evaluate Model Accuracy

By comparing predicted values with actual values on the test set, you can evaluate how well the model generalizes to new data.
The y_test vs y_pred comparison helps calculate accuracy metrics like:
- Mean Squared Error (MSE)
- R² Score (coefficient of determination)
- Mean Absolute Error (MAE)

4. Learn Model Parameters

After fitting the model, the values of w₀ (intercept) and w₁ (slope) are learned from the training data.
These parameters define the best-fit line:

which explains the trend in your dataset.

5. Validate Assumptions

By plotting residuals or checking histograms, you can verify the assumptions of linear regression (linearity, normality, independence, homoskedasticity).

6. Foundational Learning

This is a baseline model in machine learning. It introduces important concepts such as:
- Supervised learning
- Cost/loss functions
- Optimization (e.g., gradient descent)
- Model evaluation and visualization

Python code

The dataset contains two columns:

Years of Experience: Number of years a person has worked.
Salary: The corresponding salary.

Here's a preview of the first few rows:

import pandas as pd

# Load the CSV file
file_path = 'Dataset/Salary_Data.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
data.head()

The output is:

	Years of Experience	Salary
0	1.1			39343
1	1.3			46205
2	1.5			37731
3	2.0			43525
4	2.2			39891

Let's build and visualize a simple linear regression model to predict salary based on years of experience.

1. Import Required Libraries

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

pandas is used to load and manage the dataset.
matplotlib.pyplot is for plotting the graph.
sklearn.linear_model.LinearRegression is the core linear regression model.
train_test_split splits the data into training and testing sets.
r2_score and mean_squared_error are evaluation metrics.

2. Load the CSV File

df = pd.read_csv('Dataset/Salary_Data.csv')

This reads the CSV file into a DataFrame called df.

3. Separate Features and Labels python Copy Edit

X = df[['Years of Experience']]
y = df['Salary']

X contains the input feature: Years of Experience (2D array).
y contains the target variable: Salary.

4. Split into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

80% of the data goes into training (X_train, y_train).
20% goes into testing (X_test, y_test).
random_state=42 ensures reproducibility

5. Train the Linear Regression Model

model = LinearRegression()
model.fit(X_train, y_train)

We create a linear regression model.
.fit() trains the model using training data.

6. Make Predictions on Test Set

y_pred = model.predict(X_test)

This uses the trained model to predict salaries for X_test.

7. Print the Model Equation

slope = model.coef_[0]
intercept = model.intercept_
print(f"Model Equation: Salary = {slope:.2f} * YearsExperience + {intercept:.2f}")

slope is the coefficient (how much salary increases per year).
intercept is the base salary when experience is 0.
You get the equation in the form: Salary = a × YearsExperience + b

Output

Model Equation: Salary = 9423.82 * YearsExperience + 25321.58

8. Evaluate Accuracy

r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f"R² Score: {r2:.4f}")
print(f"Mean Squared Error: {mse:.2f}")

R² Score: how well the model fits the data (1 = perfect).
MSE: average of squared prediction errors (lower is better).

Output

R² Score: 0.9024
Mean Squared Error: 49830096.86

9. Predict Salary for a Specific Experience

specific_years = 5
predicted_salary = model.predict([[specific_years]])[0]
print(f"Predicted Salary for {specific_years} years of experience: ${predicted_salary:.2f}")

We predict the salary for 5 years of experience using our model.

Output

Predicted Salary for 5 years of experience: $72440.66

10. Plot the Data and the Regression Line

plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, model.predict(X), color='red', linewidth=2, label='Regression Line')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Salary vs Years of Experience')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Blue points are actual data.
Red line is the predicted regression line.
Shows how well the line fits the data.

PreviousProof of Linear Regression Formulas NextMultiple Linear Regression

Last updated 26 days ago

The primary purpose of implementing a Simple Linear Regression model on the below dataset

1. Understanding Relationships

You want to determine how salary changes with years of experience.
A linear regression line provides an interpretable model that tells you how much increase in salary is expected for each additional year of experience.

2. Predict Future Outcomes

Once trained, the model can be used to predict salary for unseen or future inputs (e.g., "What will be the salary of a person with 7.5 years of experience?").

3. Evaluate Model Accuracy

By comparing predicted values with actual values on the test set, you can evaluate how well the model generalizes to new data.
The y_test vs y_pred comparison helps calculate accuracy metrics like:
- Mean Squared Error (MSE)
- R² Score (coefficient of determination)
- Mean Absolute Error (MAE)

4. Learn Model Parameters

After fitting the model, the values of w₀ (intercept) and w₁ (slope) are learned from the training data.
These parameters define the best-fit line:

which explains the trend in your dataset.

5. Validate Assumptions

By plotting residuals or checking histograms, you can verify the assumptions of linear regression (linearity, normality, independence, homoskedasticity).

6. Foundational Learning

This is a baseline model in machine learning. It introduces important concepts such as:
- Supervised learning
- Cost/loss functions
- Optimization (e.g., gradient descent)
- Model evaluation and visualization

Python code

The dataset contains two columns:

Years of Experience: Number of years a person has worked.
Salary: The corresponding salary.

Here's a preview of the first few rows:

import pandas as pd

# Load the CSV file
file_path = 'Dataset/Salary_Data.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
data.head()

The output is:

	Years of Experience	Salary
0	1.1			39343
1	1.3			46205
2	1.5			37731
3	2.0			43525
4	2.2			39891

Let's build and visualize a simple linear regression model to predict salary based on years of experience.

1. Import Required Libraries

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

pandas is used to load and manage the dataset.
matplotlib.pyplot is for plotting the graph.
sklearn.linear_model.LinearRegression is the core linear regression model.
train_test_split splits the data into training and testing sets.
r2_score and mean_squared_error are evaluation metrics.

2. Load the CSV File

df = pd.read_csv('Dataset/Salary_Data.csv')

This reads the CSV file into a DataFrame called df.

3. Separate Features and Labels python Copy Edit

X = df[['Years of Experience']]
y = df['Salary']

X contains the input feature: Years of Experience (2D array).
y contains the target variable: Salary.

4. Split into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

80% of the data goes into training (X_train, y_train).
20% goes into testing (X_test, y_test).
random_state=42 ensures reproducibility

5. Train the Linear Regression Model

model = LinearRegression()
model.fit(X_train, y_train)

We create a linear regression model.
.fit() trains the model using training data.

6. Make Predictions on Test Set

y_pred = model.predict(X_test)

This uses the trained model to predict salaries for X_test.

7. Print the Model Equation

slope = model.coef_[0]
intercept = model.intercept_
print(f"Model Equation: Salary = {slope:.2f} * YearsExperience + {intercept:.2f}")

slope is the coefficient (how much salary increases per year).
intercept is the base salary when experience is 0.
You get the equation in the form: Salary = a × YearsExperience + b

Output

Model Equation: Salary = 9423.82 * YearsExperience + 25321.58

8. Evaluate Accuracy

r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f"R² Score: {r2:.4f}")
print(f"Mean Squared Error: {mse:.2f}")

R² Score: how well the model fits the data (1 = perfect).
MSE: average of squared prediction errors (lower is better).

Output

R² Score: 0.9024
Mean Squared Error: 49830096.86

9. Predict Salary for a Specific Experience

specific_years = 5
predicted_salary = model.predict([[specific_years]])[0]
print(f"Predicted Salary for {specific_years} years of experience: ${predicted_salary:.2f}")

We predict the salary for 5 years of experience using our model.

Output

Predicted Salary for 5 years of experience: $72440.66

10. Plot the Data and the Regression Line

plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, model.predict(X), color='red', linewidth=2, label='Regression Line')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Salary vs Years of Experience')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Blue points are actual data.
Red line is the predicted regression line.
Shows how well the line fits the data.