Validation

3/9/23

Housekeeping

  • Second deliverable for KNN regression due to Canvas tonight 11:59pm!

  • No class next Friday

Resampling

  • Economically use a collected dataset by repeatedly drawing samples from the same training dataset and fitting a model of interest on each sample

    • Obtain additional information about the fitted model
  • Two methods: cross-validation and the bootstrap

  • These slides will focus on the following topics of cross-validation:

    1. Validation set
    2. LOOCV
    3. k-fold CV (another k!)

Training vs Test errors

  • Recall the distinction between the training and test datasets

    • Training data: used to fit model
    • Test data: used to test/evaluate the model
  • These two datasets result in two types of error:

    • Training error: average error resulting from using the model to predict the responses for the training data
    • Test error: average error from using the model to predict the responses on new, “unseen” observations
  • Training error is often very different from test error

Validation set

Validation set approach

  • We have been using a validation set approach: randomly divide (e.g. 50/50) the available data into two parts: a training set and a test/validation/hold-out set

    • Model is fit on training set
    • Fitted model predicts responses for the observations in the validation set
  • The resulting validation-set error provides an estimate of the test error (e.g. RMSE)

Validation set approach: drawbacks

  • Our estimate for test error will depend on the observations that are included in the training and validation sets

    • Validation estimate of test error can be highly variable
  • Only a subset of the available data are used to fit the model

    • i.e. fewer observations used to fit model might lead to overestimating test error rate

Leave-One-Out Cross-Validation

Leave-One-Out Cross-Validation

  • Leave-one-out cross-validation (LOOCV) attempts to address the drawbacks from validation set approach

  • We still split all observations into two sets: training and validation

  • Key difference: instead of splitting just once, we split many times, where one observation is used for the validation set, leaving the remaining \(n-1\) observations for training set

LOOCV: method

  • Start by choosing first observation \((x_{1}, y_{1})\) to be validation set, and fit model on remaining \(\{(x_{2}, y_{2}), (x_{3}, y_{3}), \ldots, (x_{n}, y_{n}) \}\) as our training set

  • Obtain \(\text{RMSE}_{1} = \sqrt{(y_{1} -\hat{y}_{1})^2}\), an approximately unbiased estimate for test error

  • Repeat procedure by selecting the second observation to be validation set, then third, etc.

  • Will end up with \(n\) errors: \(\text{RMSE}_{1}, \text{RMSE}_{2}, \ldots, \text{RMSE}_{n}\). Then LOOCV estimate for test RMSE is the average:

\[\text{CV}_{(n)} = \frac{1}{n}\sum_{i=1}^{n} \text{RMSE}_{i}\]

Discuss

  • Suppose I am fitting a simple linear regression model \(Y = \beta_{0} + \beta_{1}X + \epsilon\).

  • I want to obtain an estimate of the test error using LOOCV

  • Discuss exactly how you would implement this in code. Specific things to mention:

    • What “actions”/functions you would use, and in what order

    • What values you would compute

    • What values you would store

  • Live code

LOOCV pros and cons

  • Pros

    • Each training set has \(n-1\) observations \(\rightarrow\) tend to not overestimate test error as much
    • There is no randomness in how the original data is split
  • Cons

    • LOOCV can be expensive to implement – must fit the model \(n\) times

    • Estimates for each validation set \(i\) are highly correlated, so the average can have high variance

k-fold Cross-Validation

k-fold Cross-Validation

  • In k-fold CV, the observations are randomly divided into \(k\) partitions (or folds) of approximately equal size.

  • For each \(j\) in \(1, 2, \ldots, k\):

    • Leave out \(j\)-th partition/fold as validation set, and fit model on remaining \(k-1\) partitions (combined)
    • Predict for all observations in the held-out \(j\)-th fold, and obtain a corresponding \(\text{RMSE}_{j}\)
  • The \(k\)-fold CV estimate of the test error is the average:

\[\text{CV}_{(k)} = \frac{1}{k} \sum_{j=1}^{k} \text{RMSE}_{j}\]

k-fold CV (cont.)

  • Letting the \(j\)-th fold have \(n_{j}\) observations:

\[\text{RMSE}_{j} = \sqrt{\frac{1}{n_{j}}\sum_{i \in \mathcal{C}_{j}} (y_{i} - \hat{y}^{(j)}_{i})^2},\]

where \(\mathcal{C}_{j}\) is set of observations in the \(j\)-th fold, so \(i\) indexes the observations in \(\mathcal{C}_{j}\). \(\hat{y}^{(j)}_{i}\) is the prediction for \(i\)-th observation, obtained from data with part \(j\) removed

  • If \(n\) is a multiple of \(k\), then \(n_{j} = \frac{n}{k}\)

Visual

  • Important: k is the number of folds/partitions, not the number of observations within each fold!

Example: varying k

I fit the linear model abundance = \(\beta_{0} + \beta_{1}\) WatrCont + \(\beta_{2}\) SubsDens and obtain estimates of the test RMSE using k-fold CV for varying k:

Validation: remarks

  • LOOCV is a special case of \(k\)-fold CV.

    • Question: Which value of \(k\) yields LOOCV?
  • \(k\)-fold CV estimate is still biased upward; bias minimized when \(k = n\)

    • \(k = 5\) or \(k=10\) often used as a compromise for bias-variance tradeoff
  • LOOCV and \(k\)-fold CV are useful and commonly used because of their generality