2/16/23
2/21/23: Reminder that Lab 01 is due to Canvas this Thursday at 11:59pm
Image on title slide: https://www.chaosofdelight.org/all-about-mites-oribatida
Set of tools used to understand data
Use data and build appropriate functions (models) to try and perform inference and make predictions
Data-centered approach
Categories of statistical learning problems
Notation: let \(i = 1,\ldots, n\) index the observation
For each observation \(i\), we have:
Regression: the \(y_{i}\) are quantitative (e.g. height, price)
Classification: the \(y_{i}\) are categorical (e.g. education level, diagnosis)
Goal: relate response \(y_{i}\) to the various predictors
Substrate density (quantitative)
Water content (quantitative)
Microtopography (binary categorical)
Shrub density (ordinal categorical, three levels)
Substrate type (nominal categorical, seven levels)
[1] "Brachy" "PHTH" "HPAV" "RARD" "SSTR" "Protopl"
[7] "MEGR" "MPRO" "TVIE" "HMIN" "HMIN2" "NPRA"
[13] "TVEL" "ONOV" "SUCT" "LCIL" "Oribatl1" "Ceratoz1"
[19] "PWIL" "Galumna1" "Stgncrs2" "HRUF" "Trhypch1" "PPEL"
[25] "NCOR" "SLAT" "FSET" "Lepidzts" "Eupelops" "Miniglmn"
[31] "LRUG" "PLAG2" "Ceratoz3" "Oppiminu" "Trimalc2"
# Focus on just the LRUG mite abundances
mite_dat <- mite.env %>%
add_column(abundance = mite$LRUG)
head(mite_dat)
SubsDens WatrCont Substrate Shrub Topo abundance
1 39.18 350.15 Sphagn1 Few Hummock 0
2 54.99 434.81 Litter Few Hummock 0
3 46.07 371.72 Interface Few Hummock 0
4 48.19 360.50 Sphagn1 Few Hummock 0
5 23.55 204.13 Sphagn1 Few Hummock 0
6 57.32 311.55 Sphagn1 Few Hummock 0
(scroll for more content)
Goal: predict LRUG
abundance using these variables
Maybe LRUG
\(\approx f(\) SubsDens
+ WatrCont
\()\)?
If so, how would we represent these variables using our notation? i.e., what are \(y_{i}\) and \(x_{i}\)?
Then our model can be written as \(y_{i} = f(x_{i}) + \epsilon_{i}\) where \(\epsilon_{i}\) represents random measurement error
What does this equation mean?
Model (dropping the indices): \(Y = f(X) + \epsilon\)
The function \(f(X)\) represents the systematic information that \(X\) tells us about \(Y\).
If \(f\) is “good”, then we can make reliable predictions of \(Y\) at new points \(X = x\)
If \(f\) is “good”, then we can identify which components of \(X\) are important for explaining \(Y\)
We have a set of inputs or predictors \(x_{i}\), and we want to predict a corresponding \(y_{i}\). Assume the true model is \(y_{i} = f(x_{i}) + \epsilon_{i}\), but don’t know \(f\)
Assuming the error \(\epsilon_{i}\) is 0 on average, we can obtain predictions of \(y_{i}\) as \[\hat{y}_{i} = \hat{f}(x_{i})\]
Generally, \(y_{i} \neq \hat{y}_{i}\). Why?
Model: \(y_{i} = f(x_{i}) + \epsilon_{i}\)
Irreducible error: \(\epsilon_{i}\)
Reducible error: how far \(\hat{f}\) is from the true \(f\)
Given \(\hat{f}\) and \(x_{i}\), we can obtain a prediction \(\hat{y}_{i} = \hat{f}(x_{i})\) for \(y_{i}\)
Mean-squared prediction error: \[\begin{align*} \mathsf{E}[(y_{i} - \hat{y}_{i})^2] &= \mathsf{E}[( f(x_{i}) + \epsilon_{i} - \hat{f}(x_{i}))^2] \\ &= \underbrace{[f(x_{i}) - \hat{f}(x_{i})]^2}_\text{reducible} + \underbrace{\text{Var}(\epsilon_{i})}_\text{irreducible} \end{align*}\]
We cannot do much to decrease the irreducible error
But we can potentially minimize the reducible error by choosing better \(\hat{f}\)!
Prediction: estimate \(\hat{f}\) for the purpose of \(\hat{Y}\) and \(Y\).
Inference: estimate \(\hat{f}\) for the purpose of \(X\) and \(Y\)
Some problems will call for prediction, inference, or both
To what extent is LRUG
abundance associated with microtopography
?
Given a specific land profile, how many LRUG
mites would we expect there to be?
No single method or choice of \(\hat{f}\) is superior over all possible data sets
Prediction accuracy vs. interpretability
More restrictive models may be easier to interpret (better for inference)
Good fit vs. over-fit (or under-fit)
A simpler model is often preferred over a very complex one
How can we know how well a chosen \(\hat{f}\) is performing?
In regression setting, we often use mean squared error (MSE) or root MSE (RMSE)
\(\text{MSE}=\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{f}(x_{i}))^2\)
\(\text{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{f}(x_{i}))^2}\)
MSE (and RMSE) will be small if predictions \(\hat{y}_{i} = \hat{f}(x_{i})\) are very close to true \(y_{i}\)
Question: why might we prefer reporting RMSE over MSE?
In practice, we split our data into training and test sets
We are often most interested in accuracy of our predictions when applying the method to previously unseen data. Why?
We can compute the MSE for the training and test data respectively…but we typically focus more attention to test MSE
I generated some fake data and fit three models that differ in flexibility. In this example, the generated data (points) follow a curve-y shape.
In this example, the generated data (points) look more linear.
As model flexibility increases, the training MSE will decrease but test MSE may not.
Flexible models may overfit the data, which leads to low train MSE and high test MSE
Let us consider a test observation \((x_{0}, y_{0})\).
The expected test MSE for given \(x_{0}\) can be decomposed as follows:
\(\mathsf{E}[(y_{0} - \hat{f}(x_{0}))^2] = \text{Var}(\hat{f}(x_{0})) + [\text{Bias}(\hat{f}(x_{0}))]^2 + \text{Var}(\epsilon)\)
\(\text{Bias}(\hat{f}(x_{0})) = \mathsf{E}[\hat{f}(x_{0})] - \hat{f}(x_{0})\)