BMR 617: Statistical Techniques for the Biomedical Sciences

Multiple Linear Regression

Introduction

Last time we looked at Linear Regression. This addressed the Q→Q case, where we had a quantitative response variable, and a single quantitative explanatory variable. We work under the assumption that there is a linear relationship between the two variables, i.e. that a unit increase in the explanatory variable causes a fixed change in the response variable, and we use our data set to estimate the values of \(a\) and \(b\), for \[ y = ax + b \] where \(x\) is the explanatory variable and \(y\) is the response variable.

We can view linear regression in the same way we have viewed other statistical topics so far. We assume that a linear relationship between our explanatory variable \(x\) and our response variable \(y\). This linear relationship is fixed but unknown, and is defined by the slope \(a\) and intercept \(b\) above. So \(a\) and \(b\) are fixed, unknown values that are properties of the population.

As usual, we collect data; that is, we collect pairs of values \((x_i, y_i)\). Because the values in this sample are subject to random effects, they won't exactly match the equation. We use the sample data to make a best estimate of \(a\) and \(b\). We can compute confidence intervals for both, given our data. As usual, if for example we compute a 95% confidence interval for \(a\) and \(b\), we are 95% confident that each interval contains the "true value" of those parameters. If we repeated the experiment or study over and over again, on average 95% of the intervals we computed would contain the "true value" of the parameter.

Linear regression also acts as a hypothesis test (or actually two hypothesis tests). It tests the null hypothesis that \(b\) is zero (i.e. the value of the response variable is zero when the explanatory variable is zero, which is rarely interesting) and the null hypothesis that \(a\) is zero. The latter is equivalent to saying "the response variable is independent of the explanatory variable", so rejecting the null hypothesis means we are providing evidence that there is a relationship between the two variables.

In many experiments, and in most observational studies, we have more than one explanatory variable. Under the assumption that each explanatory variable has a linear relationship with the response variable, we can use multiple linear regression to estimate the magnitude of those linear relationships give a data set. We can also test the null hypotheses that each of the linear factors is equal to zero (i.e. we can test if each explanatory variable is related to the response variable).

Mathematically, if \(y\) is the (quantitative) response variable, and \(x^{(1)},...,x^{(m)}\) are the explanatory variables, then the mathematical set up is \[y=\beta_0+\beta_1 x^{(1)} + \beta_2 x^{(1)} + ... + \beta_m x^{(m)}\] Here \(\beta_0, \beta_1, ... \beta_m\) are the unknown, fixed, parameters that are properties of the population. Each \(\beta_j\) represents the amount that \(y\) increases for a unit increase in \(x^{(j)}\). We can collect a sample - a collection of values of the \(x^{(1)},...,x^{(m)},y\) - and use the sample to estimate the values of all the \(\beta_j\).

Conceptually, this all works the same way as for linear regression with one explanatory variable. Given a data set, \((x_i^{(1)}, x_i^{(2)}, ..., x_i^{(m)}, y_i)\) for \(i=1...n\), for any given choice of the \(\beta_j\), the residuals \(\epsilon_i\) are the difference between the observed \(y_i\) and the value predicted by the \(\beta_j\): \[\epsilon_i=y_i-\left(\beta_0 + \beta_1 x_i^{(1)} + \beta_2 x_i^{(2)} + ... + \beta_m x_i^{(m)}\right)\] The algorithm finds the values of \(\beta_j\) than minimize the sum of all the \(\epsilon^2\).

Multiple Linear Regression Example

We'll borrow examples from the excellent book "Introduction to Statistical Learning with Applications in R" by James, Witten, Hastie, and Tibshirani. This book includes a number of data sets, which are available in R. To load the data for the second edition of the book, first install the package (you only need to do this once):


install.packages("ISLR2")
	
and then load the library:

library(ISLR2)
	
The library includes a data set called Boston, which includes the median house price for 506 suburbs of Boston in 1970, along with 12 other variables for those suburbs. To see what those variables are, search in the help pane for "Boston" (or type ?Boston in the console).

Which variables do you think would have an effect on median home value, and what do you think that effect would be?

If we wanted to test if the median home value were associated with crime rate, we could perform a simple linear regression:


fit.crim <- lm(medv ~ crim, data=Boston)
summary(fit.crim)
	
Similarly, to test if the median home value is associated with the proportion of the population with lower socio-ecomonic status, we can do

fit.lstat <- lm(medv ~ lstat, data=Boston)
summary(fit.lstat)
	
If we want to test both together, we can perform a multiple linear regression with:

fit.lstat.crim <- lm(medv ~ crim + lstat, data=Boston)
summary(fit.lstat.crim)
	

Let's think about the results of these.


> fit.crim <- lm(medv ~ crim, data=Boston)
> summary(fit.crim)

Call:
lm(formula = medv ~ crim, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.957  -5.449  -2.007   2.512  29.800 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 24.03311    0.40914   58.74   <2e-16 ***
crim        -0.41519    0.04389   -9.46   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.484 on 504 degrees of freedom
Multiple R-squared:  0.1508,	Adjusted R-squared:  0.1491 
F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16
	
The estimated coefficient for crim here is -0.41519. This means, for every unit increase in crime per capita (i.e. for an increase of one crime committed per person in the population of the suburb), the median house value decreases by $415.19. (Remember these are 1970 prices!) The p-value for the null hypothesis that crime rate is not associated with median house price is \(2\times 10^{-16}\) (effectively zero), so we would reject that null hypothesis and conclude that crime rate decreases house value.


> fit.lstat <- lm(medv ~ lstat, data=Boston)
> summary(fit.lstat)

Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,	Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
	
The coefficient for lstat is -0.95005. This means, for each increase of 1% of the population who are lower socio-economic status, the median house price is $950.05 lower. The p-value for the null hypothesis that there is no association between low socio-economic status and median house value is \(2\times 10^{-16}\) (again, effectively zero), so we would reject that null hypothesis and conclude that rate of low socio-economic status is associated with median house price.

However, if we look at the linear regression with both these variables:


> fit.lstat.crim <- lm(medv ~ crim + lstat, data=Boston)
> summary(fit.lstat.crim)

Call:
lm(formula = medv ~ crim + lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.234  -3.987  -1.513   2.138  25.017 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.31921    0.57374  59.816   <2e-16 ***
crim        -0.07045    0.03602  -1.956   0.0511 .  
lstat       -0.91139    0.04339 -21.004   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.198 on 503 degrees of freedom
Multiple R-squared:  0.5476,	Adjusted R-squared:  0.5458 
F-statistic: 304.4 on 2 and 503 DF,  p-value: < 2.2e-16
	
Here, the conclusion for lower socio-economic status is more or less the same. For each increase of 1% in the proportion of the population with lower socio-economic status, the median house value decreases by $911.39, >em>if the crime rate is unchanged.

On the other hand, if the crime rate increases by one unit (one crime per person), and the rate of the population in lower socio-economic status is unchanged, the median house value decreases by only $70.45. The p-value for the null hypothesis that crime rate has no effet on median house value is \(0.0511\), which typically is not small enough for us to consider that there is strong enough evidence to reject this null hypothesis.

Why did the linear regression with just the crime rate suggest such a strong effect?

Crime rate is correlated with proportion of the population with low socio-economic status:


cor(Boston$crim, Boston$lstat)
	
So generally speaking, if we look at suburbs with high crime rate, they tend to be suburbs with high rates of the population having low socio-economic status. The multiple linear regression on both the variables suggests that if the rate of low socio-economic status is held constant, the crime rate has little or no effect on the median home value. We say here that socio-economic status is a confounding variable when considering the effect of crime rate on house value. By including the confounding variable in the model, we have corrected for its effect.

Including all variables

It would be cumbersome to write out the names of all the variables in the data set: medv ~ crim + zn + indus + chas + .... Fortunately, there is a shorthand for this:


fit.full <- lm(medv ~ .)
summary(fit.full)
	
Examine the results of this and think about how to interpret them.

We can get confidence intervals for all the parameter estimates with


confint(fit.full)
	

Choosing a model

There are a number of ways of measuring how well a model fits the data. The \(R^2\) value, as we saw previously, is the proportion of all the variance in the response variable that is "explained by" the explanatory variables.

Unfortunately, \(R^2\) will always increase if a new variable is added to the model, even if that variable is not informative. Thus the \(R^2\) value is a bad choice for comparing models, particularly if they have different numbers of explanatory variables.

Akaike's Information Criterion (AIC) is a measure of how well a model fits the data that accounts for the number of explanatory variables. It meaasures the "loss of information" in the model, so the smaller the value of AIC, the better the model fits the data.


AIC(fit.lstat)
AIC(fit.crim)
AIC(fit.lstat.crim)
AIC(fit.full)
	
Which of these four models gives the best fit?

Automatically choosing a model

Use this with extreme caution!

We can use a "stepwise" approach to choosing a model. In a forward stepwise approach, we start with a linear model with few (or no) explanatory variables. We can then consider adding each variable on its own, and see which (if any) decreases the AIC by the most. We then add that variable, and repeat, until the AIC cannot be decreased any further.

In a backward stepwise approach, we start with all the variables, and consider removing each one. We choose to remove the one, if any, that decreases the AIC the most. We then remove that variable, and repeat, until the AIC cannot be decreased any further.

We can also combine these. Starting with some model, we can consider adding each variable that is missing, or removing each variable that is included. We choose whichever of those decreases the AIC the most. We then add or remove that variable, and repeat until the AIC cannot be decreased any further.

These methods must be treated with caution, as they can produce spurious results. Always consider from a scientific perspective if the model makes sense. Forward stepwise approaches are particularly discouraged.

Implementing a stepwise approach:


auto.fit <- step(fit.full, direction="both")
summary(auto.fit)
	
This starts with the "full model" (including all possible explanatory variables), and considers adding or removing variables one at a time, until we can no longer decrease the AIC.

To start with the empty model, we must provide a formula indicating the maximum model we want to consider:


fit.empty <- lm(medv ~ 1, data=Boston)
full.formula <- formula(fit.full)
auto.fit2 <- step(fit.empty, scope=full.formula, direction="both")
	
	
Since both these approaches yield the same model, we can have some confidence it is a good model.