is to model and quantify the linear relationship between years of experience (independent variable) and salary (dependent variable) in order to make predictions. This serves the following specific goals:
1. Understanding Relationships
You want to determine how salary changes with years of experience.
A linear regression line provides an interpretable model that tells you how much increase in salary is expected for each additional year of experience.
2. Predict Future Outcomes
Once trained, the model can be used to predict salary for unseen or future inputs (e.g., "What will be the salary of a person with 7.5 years of experience?").
3. Evaluate Model Accuracy
By comparing predicted values with actual values on the test set, you can evaluate how well the model generalizes to new data.
The y_test vs y_pred comparison helps calculate accuracy metrics like:
Mean Squared Error (MSE)
R² Score (coefficient of determination)
Mean Absolute Error (MAE)
4. Learn Model Parameters
After fitting the model, the values of w₀ (intercept) and w₁ (slope) are learned from the training data.
These parameters define the best-fit line:
which explains the trend in your dataset.
5. Validate Assumptions
By plotting residuals or checking histograms, you can verify the assumptions of linear regression (linearity, normality, independence, homoskedasticity).
6. Foundational Learning
This is a baseline model in machine learning. It introduces important concepts such as:
Supervised learning
Cost/loss functions
Optimization (e.g., gradient descent)
Model evaluation and visualization
Python code
The dataset contains two columns:
Years of Experience: Number of years a person has worked.
Salary: The corresponding salary.
Here's a preview of the first few rows:
The output is:
Let's build and visualize a simple linear regression model to predict salary based on years of experience.
1. Import Required Libraries
pandas is used to load and manage the dataset.
matplotlib.pyplot is for plotting the graph.
sklearn.linear_model.LinearRegression is the core linear regression model.
train_test_split splits the data into training and testing sets.
r2_score and mean_squared_error are evaluation metrics.
2. Load the CSV File
This reads the CSV file into a DataFrame called df.
3. Separate Features and Labels python Copy Edit
X contains the input feature: Years of Experience (2D array).
y contains the target variable: Salary.
4. Split into Training and Testing Sets
80% of the data goes into training (X_train, y_train).
20% goes into testing (X_test, y_test).
random_state=42 ensures reproducibility
5. Train the Linear Regression Model
We create a linear regression model.
.fit() trains the model using training data.
6. Make Predictions on Test Set
This uses the trained model to predict salaries for X_test.
7. Print the Model Equation
slope is the coefficient (how much salary increases per year).
intercept is the base salary when experience is 0.
You get the equation in the form: Salary = a × YearsExperience + b
Output
8. Evaluate Accuracy
R² Score: how well the model fits the data (1 = perfect).
MSE: average of squared prediction errors (lower is better).
Output
9. Predict Salary for a Specific Experience
We predict the salary for 5 years of experience using our model.
import pandas as pd
# Load the CSV file
file_path = 'Dataset/Salary_Data.csv'
data = pd.read_csv(file_path)
# Display the first few rows of the dataset
data.head()
Years of Experience Salary
0 1.1 39343
1 1.3 46205
2 1.5 37731
3 2.0 43525
4 2.2 39891
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
df = pd.read_csv('Dataset/Salary_Data.csv')
X = df[['Years of Experience']]
y = df['Salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
specific_years = 5
predicted_salary = model.predict([[specific_years]])[0]
print(f"Predicted Salary for {specific_years} years of experience: ${predicted_salary:.2f}")
Predicted Salary for 5 years of experience: $72440.66
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, model.predict(X), color='red', linewidth=2, label='Regression Line')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Salary vs Years of Experience')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()