3/9/23
Second deliverable for KNN regression due to Canvas tonight 11:59pm!
No class next Friday
Economically use a collected dataset by repeatedly drawing samples from the same training dataset and fitting a model of interest on each sample
Two methods: cross-validation and the bootstrap
These slides will focus on the following topics of cross-validation:
Recall the distinction between the training and test datasets
These two datasets result in two types of error:
Training error is often very different from test error
We have been using a validation set approach: randomly divide (e.g. 50/50) the available data into two parts: a training set and a test/validation/hold-out set
The resulting validation-set error provides an estimate of the test error (e.g. RMSE)
Our estimate for test error will depend on the observations that are included in the training and validation sets
Only a subset of the available data are used to fit the model
Leave-one-out cross-validation (LOOCV) attempts to address the drawbacks from validation set approach
We still split all observations into two sets: training and validation
Key difference: instead of splitting just once, we split many times, where one observation is used for the validation set, leaving the remaining \(n-1\) observations for training set
Start by choosing first observation \((x_{1}, y_{1})\) to be validation set, and fit model on remaining \(\{(x_{2}, y_{2}), (x_{3}, y_{3}), \ldots, (x_{n}, y_{n}) \}\) as our training set
Obtain \(\text{RMSE}_{1} = \sqrt{(y_{1} -\hat{y}_{1})^2}\), an approximately unbiased estimate for test error
Repeat procedure by selecting the second observation to be validation set, then third, etc.
Will end up with \(n\) errors: \(\text{RMSE}_{1}, \text{RMSE}_{2}, \ldots, \text{RMSE}_{n}\). Then LOOCV estimate for test RMSE is the average:
\[\text{CV}_{(n)} = \frac{1}{n}\sum_{i=1}^{n} \text{RMSE}_{i}\]
Suppose I am fitting a simple linear regression model \(Y = \beta_{0} + \beta_{1}X + \epsilon\).
I want to obtain an estimate of the test error using LOOCV
Discuss exactly how you would implement this in code. Specific things to mention:
What “actions”/functions you would use, and in what order
What values you would compute
What values you would store
Live code
Pros
Cons
LOOCV can be expensive to implement – must fit the model \(n\) times
Estimates for each validation set \(i\) are highly correlated, so the average can have high variance
In k-fold CV, the observations are randomly divided into \(k\) partitions (or folds) of approximately equal size.
For each \(j\) in \(1, 2, \ldots, k\):
The \(k\)-fold CV estimate of the test error is the average:
\[\text{CV}_{(k)} = \frac{1}{k} \sum_{j=1}^{k} \text{RMSE}_{j}\]
\[\text{RMSE}_{j} = \sqrt{\frac{1}{n_{j}}\sum_{i \in \mathcal{C}_{j}} (y_{i} - \hat{y}^{(j)}_{i})^2},\]
where \(\mathcal{C}_{j}\) is set of observations in the \(j\)-th fold, so \(i\) indexes the observations in \(\mathcal{C}_{j}\). \(\hat{y}^{(j)}_{i}\) is the prediction for \(i\)-th observation, obtained from data with part \(j\) removed
I fit the linear model abundance
= \(\beta_{0} + \beta_{1}\) WatrCont
+ \(\beta_{2}\) SubsDens
and obtain estimates of the test RMSE using k-fold CV for varying k:
LOOCV is a special case of \(k\)-fold CV.
\(k\)-fold CV estimate is still biased upward; bias minimized when \(k = n\)
LOOCV and \(k\)-fold CV are useful and commonly used because of their generality