Validation

3/9/23

Housekeeping

Second deliverable for KNN regression due to Canvas tonight 11:59pm!
No class next Friday

Resampling

Economically use a collected dataset by repeatedly drawing samples from the same training dataset and fitting a model of interest on each sample
- Obtain additional information about the fitted model
Two methods: cross-validation and the bootstrap
These slides will focus on the following topics of cross-validation:
1. Validation set
2. LOOCV
3. k-fold CV (another k!)

Training vs Test errors

Recall the distinction between the training and test datasets
- Training data: used to fit model
- Test data: used to test/evaluate the model
These two datasets result in two types of error:
- Training error: average error resulting from using the model to predict the responses for the training data
- Test error: average error from using the model to predict the responses on new, “unseen” observations
Training error is often very different from test error

Validation set

Validation set approach

We have been using a validation set approach: randomly divide (e.g. 50/50) the available data into two parts: a training set and a test/validation/hold-out set
- Model is fit on training set
- Fitted model predicts responses for the observations in the validation set
The resulting validation-set error provides an estimate of the test error (e.g. RMSE)

Validation set approach: drawbacks

Our estimate for test error will depend on the observations that are included in the training and validation sets
- Validation estimate of test error can be highly variable
Only a subset of the available data are used to fit the model
- i.e. fewer observations used to fit model might lead to overestimating test error rate

Leave-One-Out Cross-Validation

Leave-one-out cross-validation (LOOCV) attempts to address the drawbacks from validation set approach
We still split all observations into two sets: training and validation
Key difference: instead of splitting just once, we split many times, where one observation is used for the validation set, leaving the remaining \(n-1\) observations for training set

LOOCV: method

Start by choosing first observation \((x_{1}, y_{1})\) to be validation set, and fit model on remaining \(\{(x_{2}, y_{2}), (x_{3}, y_{3}), \ldots, (x_{n}, y_{n}) \}\) as our training set
Obtain \(\text{RMSE}_{1} = \sqrt{(y_{1} -\hat{y}_{1})^2}\), an approximately unbiased estimate for test error
Repeat procedure by selecting the second observation to be validation set, then third, etc.
Will end up with \(n\) errors: \(\text{RMSE}_{1}, \text{RMSE}_{2}, \ldots, \text{RMSE}_{n}\). Then LOOCV estimate for test RMSE is the average:

\[\text{CV}_{(n)} = \frac{1}{n}\sum_{i=1}^{n} \text{RMSE}_{i}\]

Discuss

Suppose I am fitting a simple linear regression model \(Y = \beta_{0} + \beta_{1}X + \epsilon\).
I want to obtain an estimate of the test error using LOOCV
Discuss exactly how you would implement this in code. Specific things to mention:
- What “actions”/functions you would use, and in what order
- What values you would compute
- What values you would store
Live code

LOOCV pros and cons

Pros
- Each training set has \(n-1\) observations \(\rightarrow\) tend to not overestimate test error as much
- There is no randomness in how the original data is split

Cons
- LOOCV can be expensive to implement – must fit the model \(n\) times
- Estimates for each validation set \(i\) are highly correlated, so the average can have high variance

k-fold Cross-Validation

In k-fold CV, the observations are randomly divided into \(k\) partitions (or folds) of approximately equal size.
For each \(j\) in \(1, 2, \ldots, k\):
- Leave out \(j\)-th partition/fold as validation set, and fit model on remaining \(k-1\) partitions (combined)
- Predict for all observations in the held-out \(j\)-th fold, and obtain a corresponding \(\text{RMSE}_{j}\)
The \(k\)-fold CV estimate of the test error is the average:

\[\text{CV}_{(k)} = \frac{1}{k} \sum_{j=1}^{k} \text{RMSE}_{j}\]

k-fold CV (cont.)

Letting the \(j\)-th fold have \(n_{j}\) observations:

\[\text{RMSE}_{j} = \sqrt{\frac{1}{n_{j}}\sum_{i \in \mathcal{C}_{j}} (y_{i} - \hat{y}^{(j)}_{i})^2},\]

where \(\mathcal{C}_{j}\) is set of observations in the \(j\)-th fold, so \(i\) indexes the observations in \(\mathcal{C}_{j}\). \(\hat{y}^{(j)}_{i}\) is the prediction for \(i\)-th observation, obtained from data with part \(j\) removed

If \(n\) is a multiple of \(k\), then \(n_{j} = \frac{n}{k}\)

Visual

Important: k is the number of folds/partitions, not the number of observations within each fold!

Example: varying k

I fit the linear model abundance = \(\beta_{0} + \beta_{1}\) WatrCont + \(\beta_{2}\) SubsDens and obtain estimates of the test RMSE using k-fold CV for varying k:

Validation: remarks

LOOCV is a special case of \(k\)-fold CV.
- Question: Which value of \(k\) yields LOOCV?
\(k\)-fold CV estimate is still biased upward; bias minimized when \(k = n\)
- \(k = 5\) or \(k=10\) often used as a compromise for bias-variance tradeoff
LOOCV and \(k\)-fold CV are useful and commonly used because of their generality