BMR 617: Statistical Techniques for the Biomedical Sciences

Linear Regression: Introduction

Introduction

The last specific hypothesis test we'll look at is linear regression. Linear regression can be used to test the hypothesis that there is a linear relationship between two quantitative variables. In other words, this addresses the case where both the explanatory and response variables are quantitative.

A linear regression is more than a hypothesis test, however. It also provides the linear function which best describes the relationship between the response and explanatory variables. This function can be used to predict values of the response variable, for a given value of the explanatory variable.

It is important to note that the prediction is based on the data provided, so it is only really valid for values of the explanatory variable that are within, or at least close to, values of the explanatory variable in the data set. Extrapolating far beyond the range of the data will, in almost every case, give inaccurate and even unrealistic results.

It is also important to understand what assumptions are being made in a linear regression. The first, and most important, is that we are examining a linear relationship. Mathematically, this means that for each unit increase in the explanatory variable, we get the same increase in the response variable. Graphically, this means that the data lie on a straight line, with random deviations away from the line. The other main assumptions are statistical ones. The first of these says that the random deviations from the straight line are normally distributed. That is, points are equally likely to be above the line as below it, and should be randomly distributed above and below the line. And finally, there is an assumption that the standard deviation of the random distributions is constant; i.e. points don't generally get closer or further away from the line as you move along the line.

How Linear Regression Works

For a linear regression with one explanatory variable, \(x\), and a response variable \(y\), any linear relationship can be expressed as \[ y = ax + b \] where \(a\) is the slope of the line, and \(b\) is the y-intercept (i.e. the value of \(y\) when \(x=0\).)

Given our data set, which consists of pairs of values \((x_i, y_i)\), for any choice of \(a\) and \(b\), our predicted value of \(y_i\) would be \[\hat{y_i}=ax_i+b\] This means we can compute the difference between the actual value in the data set and the predicted value from the linear regression as \[\epsilon_i=y_i-(ax_i+b)\] The values \(\epsilon_i\) are called the residuals, and the statistical assumptions above are that the residuals are normally distributed with standard deviation that is independent of \(x\).

Linear Regression works by choosing the line - that is, the values of \(a\) and \(b\), that give the smallest possible value of the sums of the squares of the residuals. This is actually not too hard to calculate, with a little bit of calculus (and quite a lot of algebra), but we won't do it here.

To get a feel for how this works, experiment with the graph below. This shows some random data points, and a straight line. You can change the slope and intercept of the line by dragging the two "handles" (gray circles) on the line. The residuals are represented by the vertical lines between the line and the data points, and the text below shows the sum of the squares of the residuals. Try to move the line to make the sum of squares of residuals as small as possible. Press the "Find Best Fit" button to find the best fit from the Linear Regression algorithm.

Slope:
Intercept:
Sum of squares: