Simple Linear Regression Implementation

Nerd Cafe

The primary purpose of implementing a Simple Linear Regression model on the below dataset

is to model and quantify the linear relationship between years of experience (independent variable) and salary (dependent variable) in order to make predictions. This serves the following specific goals:

1. Understanding Relationships

  • You want to determine how salary changes with years of experience.

  • A linear regression line provides an interpretable model that tells you how much increase in salary is expected for each additional year of experience.

2. Predict Future Outcomes

  • Once trained, the model can be used to predict salary for unseen or future inputs (e.g., "What will be the salary of a person with 7.5 years of experience?").

3. Evaluate Model Accuracy

  • By comparing predicted values with actual values on the test set, you can evaluate how well the model generalizes to new data.

  • The y_test vs y_pred comparison helps calculate accuracy metrics like:

    • Mean Squared Error (MSE)

    • R² Score (coefficient of determination)

    • Mean Absolute Error (MAE)

4. Learn Model Parameters

  • After fitting the model, the values of w₀ (intercept) and w₁ (slope) are learned from the training data.

  • These parameters define the best-fit line:

which explains the trend in your dataset.

5. Validate Assumptions

  • By plotting residuals or checking histograms, you can verify the assumptions of linear regression (linearity, normality, independence, homoskedasticity).

6. Foundational Learning

  • This is a baseline model in machine learning. It introduces important concepts such as:

    • Supervised learning

    • Cost/loss functions

    • Optimization (e.g., gradient descent)

    • Model evaluation and visualization

Python code

The dataset contains two columns:

  • Years of Experience: Number of years a person has worked.

  • Salary: The corresponding salary.

Here's a preview of the first few rows:

The output is:

Let's build and visualize a simple linear regression model to predict salary based on years of experience.

1. Import Required Libraries

  • pandas is used to load and manage the dataset.

  • matplotlib.pyplot is for plotting the graph.

  • sklearn.linear_model.LinearRegression is the core linear regression model.

  • train_test_split splits the data into training and testing sets.

  • r2_score and mean_squared_error are evaluation metrics.

2. Load the CSV File

  • This reads the CSV file into a DataFrame called df.

3. Separate Features and Labels python Copy Edit

  • X contains the input feature: Years of Experience (2D array).

  • y contains the target variable: Salary.

4. Split into Training and Testing Sets

  • 80% of the data goes into training (X_train, y_train).

  • 20% goes into testing (X_test, y_test).

  • random_state=42 ensures reproducibility

5. Train the Linear Regression Model

  • We create a linear regression model.

  • .fit() trains the model using training data.

6. Make Predictions on Test Set

  • This uses the trained model to predict salaries for X_test.

7. Print the Model Equation

  • slope is the coefficient (how much salary increases per year).

  • intercept is the base salary when experience is 0.

  • You get the equation in the form: Salary = a × YearsExperience + b

Output

8. Evaluate Accuracy

  • R² Score: how well the model fits the data (1 = perfect).

  • MSE: average of squared prediction errors (lower is better).

Output

9. Predict Salary for a Specific Experience

We predict the salary for 5 years of experience using our model.

Output

10. Plot the Data and the Regression Line

  • Blue points are actual data.

  • Red line is the predicted regression line.

  • Shows how well the line fits the data.

Last updated