Linear Regression basics - course note

Nov 7, 2023

Updated Nov 7, 2023

A second notebook test with some course notes on Linear Regression

This notebook is based on the exercises in the Anaconda training Getting started with AI and ML.

I’m copying my notes here as a more in-depth test of the ability to publish directly from Jupyter notebooks, and also to put my notes somewhere I can access them later!

Linear Regression

The most commonly used supervised machine learning algorithm.

This module covered:

Fit a line to data
Measure loss with residuals and sum of squares
Use `scikit-learn`` to fit a linear regression
Evaluate a linear regression using R2 and train-test splits

Advantages

simple to understand and interpret
doesn’t over-fit

When is Linear Regression suitable?

variables are continuous, not binary or categorical (use logistic regression for the latter)
input variables follow a Gaussian (bell curve) distribution
input variables are relevant to the output variables and not highly correlated with each other (collinearity)

Simple Linear Regression

ML often splits into two tasks - regression (predict quantity) and classification (predict a category)

E.g $y = mx+b$

Challenge is to define m and b for “best fit”

Multiple linear regression

With multiple independent variables

e.g. $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon$

$\epsilon $ is error due to noise

Multiple variables can get complex, important to use tools to help select only input variables correlated with the output variables.

e.g.:

Pearson correlation and $R^2$
adjusted $R^2$
Akakike Information Criterion
Ridge and lasso regression

#furtherlearning

Residuals

Residuals are the difference between the data points and the equivalent regression. Linear Regression models aim to minimise the regressions by optimising a loss function such as Sum of Squares.

Overfitting

When ML model works well with training data but fails to predict correctly with new data. Linear regression tends to show low variance and high bias, so less likely to be overfitted. (define terms variance and bias)

Train/Test Splits

Common technique to mitigate overfitting is the use of train/test splits. Training data is used to fit the model, then test data is used to test it with previously-unseen data, if necessary the model can then be tweaked.

Evaluating the model with $R^2$

$R^2$ (the coefficient of determination) ratios the average y-value to the average of the residuals.

It measures how well the independent variables explain a dependent variable, with 0.0 meaning no connection and 1.0 meaning a perfect explanation.

Example using `scikit-learn`

The package scikit-learn contains many tools to support Machine LEarning techniquies such as Linear Regression.

This worked example demonstrates some of them.

First we import the packages we are going to use, making use of two key utilities from scikit-learn:

train_test_split makes it easy to split a set of data into training and test subsets.
LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

We need to import our data into a pandas DataFrame. For convenience I am using one of the datasets provided by the course author

# Load the data
df = pd.read_csv('https://bit.ly/3pBKSuN', delimiter=",")
df

	x	y
0	1	-13.115843
1	2	25.806547
2	3	-5.017285
3	4	20.256415
4	5	4.075003
5	6	-3.530260
6	7	24.045999
7	8	22.112566
8	9	5.968591
9	10	43.392339
10	11	32.224643
11	12	14.666142
12	13	17.966141
13	14	-2.754718
14	15	25.156840
15	16	20.182870
16	17	22.281929
17	18	16.757447
18	19	54.219575
19	20	60.564151

We need to split our data into inputs and the associated outputs

# Extract input variables (all rows, all columns but last column)
X = df.values[:, :-1]

# Extract output column (all rows, last column)
Y = df.values[:, -1]

We then need to create separate training and testing data to evaluate performance and reduce overfitting. Her ewe make use of the train_test_split utility.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=10)

we train the standard LinearRegression model provided by scikit-learn against our training data
then we use the trained model to fit a regression to our test data

The utility allows us to easily score the model using $R^2$.

model = LinearRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("R^2: %.3f" % result)

R^2: 0.182

Using matplotlib we can visualise the model output against the whole input data set

import matplotlib.pyplot as plt

plt.plot(X, Y, 'o') # scatterplot
plt.plot(X, model.coef_.flatten()*X+model.intercept_.flatten()) # line
plt.show()

machinelearning datascience