Linear Regression

2/23/23

Housekeeping

Lab 01 due tonight to Canvas at 11:59pm
Lab 02 assigned tomorrow (and has to do with baseball)

Linear regression

A simple, widely used approach in supervised learning
Assumes that the dependence of \(Y\) on the predictors \(X_{1}, \ldots, X_{p}\) is linear

Possible questions of interest

Is there a relationship between the abundance of LRUG mites and substrate density or water content in the soil where they are found?
- If so, how are strong are these relationships?
Is the relationship linear?
How accurately can we predict the abundance of these mites?

Simple Linear Regression

Simple linear regression

Simple linear regression (SLR)
- Assumes a linear model for a quantitative response \(Y\) using a single predictor \(X\)
  
  \[Y = \beta_{0} + \beta_{1} X + \epsilon,\]
  
  where \(\beta_{0}, \beta_{1}\) are unknown coefficients (parameters) and \(\epsilon\) is the error
\(\beta_{0}\) is commonly referred to as the intercept, and \(\beta_{1}\) is the slope
For example: abundance = \(\beta_{0}\) + \(\beta_{1}\) watercont + \(\epsilon\)

Parameter estimation

Assuming \(n\) observations, we have data of the form \((x_{1}, y_{1}), (x_{2}, y_{2}), \ldots, (x_{n}, y_{n})\)
An SLR model says \[\begin{align*} y_{i} &= \beta_{0} + \beta_{1}x_{i}+ \epsilon\\ &\approx \beta_{0} + \beta_{1}x_{i}\ , \qquad \text{ for all } i = 1,\ldots, n \end{align*}\]
- In this model, \(f(x_{i}) = \beta_{0} + \beta_{1} x_{i}\)
- Notice that the relationship between \(x_{i}\) and \(y_{i}\) is the same for all \(i\)
In practice, \(\beta_{0}\) and \(\beta_{1}\) are unknown, so we must estimate them
Goal: obtain (good) estimates \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\) that are as close to the true values as possible, such that \(y_{i} \approx \hat{\beta}_{0} + \hat{\beta}_{1} x_{i}\)
- How? Minimize the least squares criterion

Least squares

Let \(\hat{y}_{i} = \hat{\beta}_{0} + \hat{\beta}_{1}x_{i}\) be predicted response for \(i\)-th observation with predictor \(x_{i}\)
The \(i\)-th residual \(e_{i}\) is defined as \[e_{i} = y_{i} - \hat{y}_{i}\]
Define residual sum of squares (RSS) as \[\text{RSS} = e_{1}^{2} + e_{2}^{2} + \ldots + e_{n}^{2} = \sum_{i=1}^{n} e_{i}^2\]

Estimation by least squares

\[\text{RSS} = \sum_{i=1}^{n} e_{i}^2 = \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^2 = \sum_{i=1}^{n} (y_{i} - (\hat{\beta}_{0} + \hat{\beta}_{1}x_{i}))^2\]

Least squares approach selects the pair \((\hat{\beta}_{0}, \hat{\beta}_{1})\) that minimize the RSS. Can be shown that the minimizing values are: \[\begin{align*} \hat{\beta}_{1} &= \frac{\sum_{i=1}^{n}(x_{i} - \bar{x})(y_{i} - \bar{y})}{\sum_{i=1}^{n}(x_{i} - \bar{x})^2}\\ \hat{\beta}_{0} &= \bar{y} - \hat{\beta}_{1} \bar{x} \end{align*}\]

where \(\bar{y} = \frac{1}{n}\sum_{i=1}^{n}y_{i}\) and \(\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}\)

Mite data

Least squares fit for abundance regressed on WaterCont, with residuals in orange.

m1 <- lm(abundance ~ WatrCont, data = mite_dat)

Let’s interpret this plot! Do you see anything strange or any patterns?

Interpreting Coefficients and Estimates

\[Y = \beta_{0} + \beta_{1} X + \epsilon\]

\(\beta_{0}\) is the expected value of \(Y\) when \(X = 0\)
\(\beta_{1}\) is the average increase in \(Y\) for one-unit increase in \(X\)
\(\epsilon\) is error
This equation is the population regression line
When using the least squares estimates for the coefficients, \(\hat{Y} = \hat{\beta}_{0} + \hat{\beta}_{1} X\) is the least squares line

Interpreting Coefficients and Estimates (cont.)

summary(m1)


Call:
lm(formula = abundance ~ WatrCont, data = mite_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.525  -8.033  -4.088   4.493  47.937 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.63410    4.51171   0.141   0.8886  
WatrCont     0.02385    0.01039   2.296   0.0248 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.29 on 68 degrees of freedom
Multiple R-squared:  0.07194,   Adjusted R-squared:  0.05829 
F-statistic: 5.271 on 1 and 68 DF,  p-value: 0.02477

\[\widehat{\text{LRUG}} = 0.634 + 0.024 \text{WatrCont}\]

How do I interpret \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\) for this specific example?
Note: the estimates \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\) will depend on the observed data! If I took a different sample of LRUG mites, I would probably have different estimated values.
- Isn’t that problematic? Can we asses how accurate are our estimates \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\)?

Assessing Accuracy of Coefficient Estimates

Standard error (SE) of an estimator reflects how it varies under repeated sampling.
For simple linear regression: \[\text{SE}(\hat{\beta}_{0}) = \sigma^2 \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n}(x_{i} - \bar{x})^2}\right] \qquad \text{SE}(\hat{\beta}_{1}) = \frac{\sigma^2}{\sum_{i=1}^{n}(x_{i} - \bar{x})^2}\]

where \(\sigma^2 = \text{Var}(\epsilon)\)
Typically \(\sigma^2\) is not known, but can be estimated from the data.
Estimate \(\hat{\sigma}\) is residual standard error (RSE), given by: \[\hat{\sigma}= \text{RSE} = \sqrt{\frac{1}{n-2}\text{RSS}}\]
We use this estimate to calculate \(\text{SE}(\hat{\beta}_{0})\) and \(\text{SE}(\hat{\beta}_{1})\)

Hypothesis Testing

Hypothesis testing is a method of statistical inference to determine whether the data at hand sufficiently support a particular hypothesis
- Helps test the results of an experiment or survey to see if you have meaningful results
- Helps draw conclusions about a population parameter
Standard errors can be used to perform hypothesis tests on the coefficients

Hypothesis Testing

Notion of “null” versus “alternate” hypothesis
- Null hypothesis \(H_{0}\): there is no relationship between \(X\) and \(Y\)
- Alternative hypothesis \(H_{A}\): there is some relationship between \(X\) and \(Y\)
Mathematically, corresponds to testing \[H_{0}: \beta_{1} = 0 \quad \text{ vs. } \quad H_{A}: \beta_{1} \neq 0\]

because if \(H_{0}\) true, then the model reduces to \(Y = \beta_{0} + \epsilon\) so there is no relationship
To test this null hypothesis, want to determine if \(\hat{\beta}_{1}\) is sufficiently far from zero
- How much is ‘sufficiently far’? Depends on \(\text{SE}(\hat{\beta}_{1})\).

p-value

With lots of hand-waving: can calculate a p-value, which is a probability that we observed the data we did, given that \(H_{0}\) is true. If the p-value is small, the observed data don’t seem to support \(H_{0}\)
- Compare \(p\)-value to a pre-determined rejection level \(\alpha\) (often 0.05).
- If \(p\)-value \(< \alpha\), reject \(H_{0}\). Otherwise, fail to reject \(H_{0}\).

summary(m1)


Call:
lm(formula = abundance ~ WatrCont, data = mite_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.525  -8.033  -4.088   4.493  47.937 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.63410    4.51171   0.141   0.8886  
WatrCont     0.02385    0.01039   2.296   0.0248 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.29 on 68 degrees of freedom
Multiple R-squared:  0.07194,   Adjusted R-squared:  0.05829 
F-statistic: 5.271 on 1 and 68 DF,  p-value: 0.02477

\(H_{0}: \beta_{1} = 0\) (there is no relationship between LRUG abundance and WatrCont)
- What is the p-value for this hypothesis?
- What is our decision in the mite example?

Why hypothesis testing?

Hypothesis testing can help us determine if there is a relationship between the predictor and the response variable!
- This is the inference part of statistical learning
If there is a relationship, then it makes sense to interpret the strength of the relationship (i.e. interpret the value \(\hat{\beta}_{1}\))

Multiple Linear Regression

Multiple linear regression

In practice, we often have more than one predictor
With \(p\) predictors, the model is \[Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \ldots + \beta_{p}X_{p} +\epsilon\]
Interpret \(\beta_{j}\) as the average effect on \(Y\) for a one-unit increase in \(X_{j}\), holding all other predictors fixed/constant

Mite data

Regressing abundance on both WatrCont and SubsDens:

m2 <- lm(abundance ~ WatrCont + SubsDens, data = mite_dat)
summary(m2)


Call:
lm(formula = abundance ~ WatrCont + SubsDens, data = mite_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.192  -8.633  -1.385   6.866  44.245 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 10.30549    5.48833   1.878  0.06477 . 
WatrCont     0.03444    0.01057   3.257  0.00177 **
SubsDens    -0.35682    0.12604  -2.831  0.00612 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.7 on 67 degrees of freedom
Multiple R-squared:  0.1711,    Adjusted R-squared:  0.1464 
F-statistic: 6.915 on 2 and 67 DF,  p-value: 0.001861

\[\widehat{abundance} = 10.306 + 0.034 \text{WatrCont} -0.357 \text{SubsDens}\]

How do we interpret the estimated coefficients?
How do we interpret the p-values?
- The p-value for \(\beta_{j}\) corresponds to the test of \(H_{0}: \beta_{j} = 0\) and \(\beta_{k} (k \neq j)\) unrestricted

Model assessment

Model fit

How well does our linear regression model “fit” the data it was trained on? How accurate is our model?
Residual standard error (RSE) \[\text{RSE} = \sqrt{\frac{1}{n-p-1} \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^2},\]

where \(p\) is the number of predictors, and \(i\) indexes the observations used to fit the model
RSE is considered a measure of the lack of fit of the model
- Measured in the units of \(Y\)

Predictions

How well does our linear regression model predict new responses for a given set of covariates?
For example, suppose I want to use our model to predict the abundance of LRUG mites at a new sampling location where the WatrCont is 400 g/L and the SubsDens is 30 g/L
I will plug these values into our fitted model m2:

\[\widehat{abundance} = 10.306 + 0.034 \times 400 -0.357 \times 30 = 13.3747829\]

new_data <- data.frame(WatrCont = 400, SubsDens = 30)
pred_mite <- predict(m2, newdata = new_data)

Okay…great! But should I trust these predictions?

Prediction performance

We could get a better sense of a model’s prediction performance by comparing the predicted responses to the true values
We should always compare prediction performance for “previously unseen” data (i.e. test data)
- The model already “knows” the data used to fit it (i.e. the training data)
Discuss: what are some important criteria for the testing data?

Qualitative predictors

Thus far, we have assumed that all predictors in our linear model are quantitative. In practice, we often have categorical predictors
Our mite data has the following categorical variables: Shrub, Substrate, and Topo
Let’s begin with the simplest case: a categorical predictor with two categories/levels
- e.g. the Topo variable takes on the values “Blanket” or “Hummock” only

2-level qualitative predictor

We will create an indicator or dummy variable as follows: \[\text{TopoBlanket}_{i} = \begin{cases} 1 & \text{ if } \color{blue}{\text{Topo}_{i}} = \text{Blanket} \\ 0 & \text{ if } \color{blue}{\text{Topo}_{i}} = \text{Hummock} \end{cases}\]
Simple linear regression model for LRUG regressed on Topo: \[\text{LRUG}_{i} = \beta_{0} + \beta_{1}\text{TopoBlanket}_{i} + \epsilon_{i} = \begin{cases} \beta_{0} + \epsilon_{i} & \text{ if } \color{blue}{\text{Topo}_{i}} = \text{Blanket} \\ \beta_{0} + \beta_{1} + \epsilon_{i} & \text{ if } \color{blue}{\text{Topo}_{i}} = \text{Hummock} \end{cases}\]
How to interpret?

Mite data

m3 <- lm(abundance ~ Topo, data = mite_dat)
summary(m3)


Call:
lm(formula = abundance ~ Topo, data = mite_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.318  -4.318  -2.154   4.473  41.682 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   15.318      1.658   9.238 1.26e-13 ***
TopoHummock  -13.164      2.721  -4.838 7.85e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11 on 68 degrees of freedom
Multiple R-squared:  0.2561,    Adjusted R-squared:  0.2452 
F-statistic: 23.41 on 1 and 68 DF,  p-value: 7.851e-06

Fitted model is: \[\widehat{\text{LRUG}} = 15.318 - 13.164 \text{TopoHummock}\]
Interpret!

More than two levels

With more than two levels, we simply create additional dummy variables: \[\begin{align*}\text{Shrub}_{i,Few} &= \begin{cases} 1 & \text{ if } \color{blue}{\text{Shrub}_{i}} = \text{Few} \\ 0 & \text{ if } \color{blue}{\text{Shrub}_{i}} = \text{not Few} \end{cases} \\ \text{Shrub}_{i, Many} &= \begin{cases} 1 & \text{ if } \color{blue}{\text{Shrub}_{i}} = \text{Many} \\ 0 & \text{ if } \color{blue}{\text{Shrub}_{i}} = \text{not Many} \end{cases} \end{align*}\]
Resulting regression model for LRUG with only Shrub as predictor: \[\begin{align*} \text{LRUG}_{i} &= \beta_{0} + \beta_{1} \text{Shrub}_{i, Few} + \beta_{2} \text{Shrub}_{i, Many} + \epsilon_{i}\\ &\approx \begin{cases} \beta_{0} + \beta_{1} & \text{ if } \color{blue}{\text{Shrub}_{i}} = \text{Few} \\ \beta_{0} + \beta_{2} & \text{ if } \color{blue}{\text{Shrub}_{i}} = \text{Many} \\ \beta_{0} & \text{ if } \color{blue}{\text{Shrub}_{i}} = \text{None} \end{cases}\end{align*}\]

m4 <- lm(abundance ~ Shrub, data = mite_dat)
summary(m4)


Call:
lm(formula = abundance ~ Shrub, data = mite_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.895  -8.060  -2.760   6.635  40.105 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   10.923      1.405   7.775 6.12e-11 ***
Shrub.L       -9.288      2.505  -3.707 0.000427 ***
Shrub.Q       -1.460      2.359  -0.619 0.538164    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.64 on 67 degrees of freedom
Multiple R-squared:  0.1791,    Adjusted R-squared:  0.1545 
F-statistic: 7.306 on 2 and 67 DF,  p-value: 0.001348

Remarks

For a given categorical variable, there will always be one fewer dummy variables than levels
- Level with no dummy variable is known as baseline. For model m4 on previous slide, the “None” category was the baseline level
Can have multiple categorical variables in a single model:

m5 <- lm(abundance ~ Topo + Shrub, data = mite_dat)
summary(m5)


Call:
lm(formula = abundance ~ Topo + Shrub, data = mite_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.895  -6.056   0.691   4.756  40.105 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   14.755      1.641   8.993 4.48e-13 ***
TopoHummock  -11.253      2.997  -3.755 0.000369 ***
Shrub.L       -4.832      2.580  -1.873 0.065552 .  
Shrub.Q       -3.128      2.203  -1.420 0.160385    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.65 on 66 degrees of freedom
Multiple R-squared:  0.3236,    Adjusted R-squared:  0.2928 
F-statistic: 10.52 on 3 and 66 DF,  p-value: 9.593e-06

Let’s write out the estimated regression model together
- What does \(\beta_{0}\) represent?
- How do we interpret the coefficients?

Extensions of linear model (optional)

Linear model is restrictive

Linear model is widely used and works quite well, but has several highly restrictive assumptions
- Relationship between \(X\) and \(Y\) is additive
- Relationship between \(X\) and \(Y\) is linear
There are common approaches to loosen these assumptions. We will only discuss the first restriction here. Take a regression class for more!

Interactions

Additive assumption: the association between a predictor \(X_{j}\) and the response \(Y\) does not depend on the value of any other predictors

\[Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \epsilon\]

\[Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3}\color{orange}{X_{1}X_{2}} + \epsilon\]

This third predictor \(\color{orange}{X_{1}X_{2}}\) is known as an interaction term
The total effect of \(X_{1}\) on \(Y\) also depends on the value of \(X_{2}\) through the interaction
In the above equation, \(\beta_{1}\) and \(\beta_{2}\) are called the “main effects” of \(X_{1}\) and \(X_{2}\) respectively

Mite data

Suppose I want to regress LRUG abundance on SubsDens and WatrCont and their interaction:

mod_int <- lm(abundance ~ SubsDens + WatrCont + 
                SubsDens * WatrCont,
              data = mite_dat)
summary(mod_int)


Call:
lm(formula = abundance ~ SubsDens + WatrCont + SubsDens * WatrCont, 
    data = mite_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.012  -7.155  -1.896   5.011  44.539 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)  
(Intercept)       -1.012e+01  1.369e+01  -0.740   0.4621  
SubsDens           1.323e-01  3.256e-01   0.406   0.6859  
WatrCont           8.667e-02  3.379e-02   2.565   0.0126 *
SubsDens:WatrCont -1.210e-03  7.443e-04  -1.625   0.1088  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.56 on 66 degrees of freedom
Multiple R-squared:  0.203, Adjusted R-squared:  0.1668 
F-statistic: 5.604 on 3 and 66 DF,  p-value: 0.001744

Mite data

Fitted model: \(\widehat{LRUG} = -10.12 + 0.132 \text{SubsDens} + 0.087 \text{WatrCont} -0.001 \text{SubsDens} \times \text{WatrCont}\)
Interpretations?

Estimates suggest that a 1 g/L increase in the \(\color{blue}{\text{WaterCont}}\) of the soil is associated with an increased abundance of (0.087 + -0.001 \(\times \color{blue}{\text{SubsDens}}\)) LRUG mites

Interactions with categorical variable

Can also have interactions involving categorical variables!
In particular, the interaction between a quantitative and a categorical variable has nice interpretation
Consider the effects of SubsDens and Topo and their interaction on the abundances: \[\begin{align*} \text{LRUG}_{i} &\approx \beta_{0} + \beta_{1} \text{SubsDens}_{i} + \beta_{2}\text{TopoBlanket}_{i} + \beta_{3} \text{SubsDens}_{i} \times \text{TopoBlanket}_{i} \\ & = \begin{cases} \beta_{0} + \beta_{1}\text{SubsDens}_{i} & \text{ if } \color{blue}{\text{Topo}_{i}} = \text{Blanket} \\ (\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{3}) \text{SubsDens}_{i} & \text{ if } \color{blue}{\text{Topo}_{i}} = \text{Hummock} \end{cases} \end{align*}\]


Call:
lm(formula = abundance ~ SubsDens + Topo + SubsDens * Topo, data = mite_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.059  -4.764  -1.186   3.229  39.191 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           34.7881     5.3697   6.479 1.35e-08 ***
SubsDens              -0.4695     0.1242  -3.781 0.000338 ***
TopoHummock          -29.4584     9.0331  -3.261 0.001757 ** 
SubsDens:TopoHummock   0.3803     0.2323   1.637 0.106406    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.11 on 66 degrees of freedom
Multiple R-squared:  0.3901,    Adjusted R-squared:  0.3624 
F-statistic: 14.07 on 3 and 66 DF,  p-value: 3.425e-07

Linear Regression

Housekeeping

Linear regression