Posted by: shijesh | April 26, 2020

Simple Linear Regression – Python

After a brief introduction about Regression Analysis and Simple Linear Regression, let’s get our hands dirty on “Simple Linear Regression Analysis” using Python.

For this we will  use salary.csv, which has two columns. First column is the Years of experience and the second column is the Salary.  I have used “Jupyter NoteBook” to execute the python code, but you can use any editor of your choice.

Step 1

  1. Understand the data for which you are going to build the model.
  2. Determine which column will be independent variable (X) and which will be dependent variable (y)

Step 2

  1. Import the data → For this we will use pandas library.
import pandas as pd
dataset = pd.read_csv("E:\Salary_Data.csv")
dataset.head() ## This is to view the data which we imported just now

slr_step2

Step 3

  1. Create a variable in python say (X) and assign independent variable to it. In this case independent variable is Year of Experinece.
  2. Create another variable in python say (y) and assign dependent variable to it. In this case dependent variable is Salary.

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, :1].values

slr_step3_1

slr_step3_2

slr_step3_3

Step 4

  1. Splitting the data into training set and test set → For this we will use training_test_split library from sklearn.model_selection module.
  2. In this step sample data is split into two sets – training set and test set.
  3. Size of training set and test set depends on test_size parameter. In this case test_size is 1/3, which means training set will have 2/3 of the rows and test set will have 1/3 of the rows from the sample data.
  4. random_state parameter → If the random_state parameter is set to None, each time we execute train_test_split, we will get a different result. But, if we set random_state parameter to some fixed value like “1”, the each time we execute train_test_split, we are guarantee to get the same result. During development, we generally use fixed value for random_state parameter, so that we get consistence split for comparison.

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

slr_step4

Step 5

  1. Fitting the simple linear model to training set (i.e. Train the algorithm)
  2. For this we use LinearRegression library from sklearn.linear_model module

# Fitting Simple Linear Regression to the Training set

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

 

Step 6

  1. Predicting the test set  →  Once the model is trained, we will test the model by providing test data (X_test i.e Years of Experience) . The result (y_pred i.e. Predicted Salary ) obtained will be compared with y_test(Actual Salary). If the predicted salary (result) is close enough to actual salary, we can assume that our model is sufficiently trained and can be used to predict salary based on No. of years of experience.

[Note: To satisfactorily confirm, if the model used is proper or not, certain assumptions are made and some checks are performed. These topics will be covered in later sections]


#Predicting the Test set results.
#For each record in X_test (i.e for each No. of Year),
#we will predict the salary i.e. the value of y

y_pred = regressor.predict(X_test)

slr_step6

As you can see above, except for 3rd, 8th and 9th salary, all other salary are very close. So the model was able to predict salary.

Step 7

  1. Compare the predicted salary with actual salary  → In this step we will use matplotlib.pyplot library to compare actual salary versus predicted salary.
  2. Actual salary (y_test) will be plotted against Year of experience (X_test). In the same graph regression line representing predicted salary (y_pred) against Year of experience (X) will be drawn.

#Visualize the actual salary versus predicted salary

import matplotlib.pyplot as plt
plt.scatter(X_test,y_test , color='red')
plt.plot( X , regressor.predict(X))
plt.title('Salary V/s Expereince')
plt.xlabel('Year of Experience')
plt.ylabel('Salary')
plt.show

slr_step7


Leave a comment

Categories