Linear Regression basics - course note

A second notebook test with some course notes on Linear Regression

This notebook is based on the exercises in the Anaconda training Getting started with AI and ML.

I’m copying my notes here as a more in-depth test of the ability to publish directly from Jupyter notebooks, and also to put my notes somewhere I can access them later!

Linear Regression

The most commonly used supervised machine learning algorithm.

This module covered:

  • Fit a line to data
  • Measure loss with residuals and sum of squares
  • Use `scikit-learn`` to fit a linear regression
  • Evaluate a linear regression using R2 and train-test splits

Advantages

  • simple to understand and interpret
  • doesn’t over-fit

When is Linear Regression suitable?

  1. variables are continuous, not binary or categorical (use logistic regression for the latter)
  2. input variables follow a Gaussian (bell curve) distribution
  3. input variables are relevant to the output variables and not highly correlated with each other (collinearity)

Simple Linear Regression

ML often splits into two tasks - regression (predict quantity) and classification (predict a category)

E.g $y = mx+b$

Challenge is to define m and b for “best fit”

Multiple linear regression

With multiple independent variables

e.g. $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon$

$\epsilon $ is error due to noise

Multiple variables can get complex, important to use tools to help select only input variables correlated with the output variables.

e.g.:

  • Pearson correlation and $R^2$
  • adjusted $R^2$
  • Akakike Information Criterion
  • Ridge and lasso regression

#furtherlearning

Residuals

Residuals are the difference between the data points and the equivalent regression. Linear Regression models aim to minimise the regressions by optimising a loss function such as Sum of Squares.

Overfitting

When ML model works well with training data but fails to predict correctly with new data. Linear regression tends to show low variance and high bias, so less likely to be overfitted. (define terms variance and bias)

Train/Test Splits

Common technique to mitigate overfitting is the use of train/test splits. Training data is used to fit the model, then test data is used to test it with previously-unseen data, if necessary the model can then be tweaked.

Evaluating the model with $R^2$

$R^2$ (the coefficient of determination) ratios the average y-value to the average of the residuals.

It measures how well the independent variables explain a dependent variable, with 0.0 meaning no connection and 1.0 meaning a perfect explanation.

Example using scikit-learn

The package scikit-learn contains many tools to support Machine LEarning techniquies such as Linear Regression.

This worked example demonstrates some of them.

First we import the packages we are going to use, making use of two key utilities from scikit-learn:

  • train_test_split makes it easy to split a set of data into training and test subsets.
  • LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

We need to import our data into a pandas DataFrame. For convenience I am using one of the datasets provided by the course author

# Load the data
df = pd.read_csv('https://bit.ly/3pBKSuN', delimiter=",")
df

xy
01-13.115843
1225.806547
23-5.017285
3420.256415
454.075003
56-3.530260
6724.045999
7822.112566
895.968591
91043.392339
101132.224643
111214.666142
121317.966141
1314-2.754718
141525.156840
151620.182870
161722.281929
171816.757447
181954.219575
192060.564151

We need to split our data into inputs and the associated outputs

# Extract input variables (all rows, all columns but last column)
X = df.values[:, :-1]

# Extract output column (all rows, last column)
Y = df.values[:, -1]

We then need to create separate training and testing data to evaluate performance and reduce overfitting. Her ewe make use of the train_test_split utility.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=10)

Next:

  • we train the standard LinearRegression model provided by scikit-learn against our training data
  • then we use the trained model to fit a regression to our test data

The utility allows us to easily score the model using $R^2$.

model = LinearRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("R^2: %.3f" % result)
R^2: 0.182

Using matplotlib we can visualise the model output against the whole input data set

import matplotlib.pyplot as plt

plt.plot(X, Y, 'o') # scatterplot
plt.plot(X, model.coef_.flatten()*X+model.intercept_.flatten()) # line
plt.show()

png

Avatar
Proactive application of technology to business

My interests include technology, personal knowledge management, social change

Next
Previous