Lab 03: Cross-validation

Ski resorts

Introduction

The data and .Rmd file can be found in your lab-03-ski-resorts GitHub project. Please clone it now! The data is ski_resorts.csv, and come from Kaggle.

We have data on ski resorts on the east coast. Each observation is a ski resort, and we information about the features of each resort (e.g. number of lifts, price of ticket, average elvation). A data dictionary can be found in the README.

ski_resorts <- read.csv("data/ski_resorts.csv")

We will compare two models’ prediction performance for the lift ticket price of these ski resorts using all the other quantitative variables as our predictors. We will compare the models via the estimated test RMSE obtained using k-fold cross-validation. The two models are:

A multiple linear regression model, and
A KNN regression model

This lab will also explore the effect of standardizing quantitative variables.

Now that you’ve implemented KNN regression and understand how it works, you are allowed to use the knnreg() R function provided in the caret library! This function is much faster than our own implementations.

Warning

Some people are having issues with installing the caret package. It may be safer to use your own implementation!

To use knnreg():

Install the caret package in your console
Load in the caret package at the top of your .Rmd file
The knnreg() function works slightly differently than our implementation. It only fits a model to the training data, where you pass in a train_x and a train_y. Then to obtain prediction, you use the predict() function, just as you would for linear regression.
1. By default, the knnreg() chooses a neighbor set of \(K = 5\). If you want a different choice of neighbors, you must explicitly pass that into the argument k.
2. The train_y you pass into knnreg() must be a vector, not a data frame!

# suppose my train data are stored as train_x and train_y
# suppose my test predictors are stores as test_x
knn_mod <- knnreg(x= train_x, y = train_y, k = 7)
preds <- predict(knn_mod, newdata = test_x)

This is a challenging lab assignment because there are a lot of moving parts! Please do not put it off until the last minute!

Define functions

To make our lives easier, we will write a function that standardizes data for you. Create a function called my_scale() that takes in three arguments:

A data frame (or matrix) that needs to be standardized,
A vector of means, where element \(j\) is the mean of the \(j\)-th column of (1), and
A vector of standard deviations where element \(j\) is the standard deviation of the \(j\)-th column of (1)

Your function my_scale() should return a standardized version of the data frame that was input. The R function sd() takes a vector as input and outputs the standard deviation.

You can confirm your my_scale() function is working by seeing if you get the same results as when you use the scale() function provided by R on the following data temp (you can also confirm if your mean_vec and sd_vec are correct by looking at the attributes center and scale in the following output):

temp <- data.frame(x = 1:5) %>%
  mutate(y = sqrt(x))
scale(temp)

              x          y
[1,] -1.2649111 -1.3900560
[2,] -0.6324555 -0.5388977
[3,]  0.0000000  0.1142190
[4,]  0.6324555  0.6648219
[5,]  1.2649111  1.1499128
attr(,"scaled:center")
       x        y 
3.000000 1.676466 
attr(,"scaled:scale")
        x         y 
1.5811388 0.4866469

Note

Please do not include code that tests your my_scale() function in your final submission.

Analysis

We will fit a total of four different models using k-fold cross-validation. For all the models, we will predict the price of the lift tickets using all of the remaining quantitative variables. In order to have a fair comparison of the models, each model should be fit and tested on the same folds/partitions of the original data. Therefore, we will begin by creating a set of indices that tell us which fold each observation belongs to.

I suggest you modify your data such that it only contains the variables of interest for this analysis!

Obtain the indices for each fold

We will perform 5-fold cross-validation.

Randomly split the indices of the observations into 5 folds of equal size. Because you will be randomly splitting, it is important for you to set a seed for reproducibility. Use a seed of 3. Hint: you will most likely need to use a list!

MLR: original scale

Using your folds in the previous step, run 5-fold CV to obtain an estimate of the test RMSE using MLR.

Note: suppose you are running lm() and the data you pass in only contains the response y and all of the predictors of interest. Rather than explicitly typing out the name of each predictor in lm(), you can simply type a . and R will recognize that you want to use all the other variables aside from y in the data frame as predictors:

lm(y ~ ., data)

Report your estimated test RMSE from running MLR with 5-fold CV.

KNN: original scale

Using the same folds, run 5-fold CV to obtain an estimate of the test RMSE using KNN regression, with \(K = 10\) neighbors. You may either use your own implementation of KNN, or you may use the knnreg() + predict() functions provided in R.

State the number of neighbors, and report your estimated test RMSE from running KNN regression with 5-fold CV. How does your estimated test RMSE compare to that obtained from MLR?

KNN regression: standardized data

Now, we will run KNN regression where the predictors are standardized. Using the same folds, run 5-fold CV to obtain an estimate of the test RMSE using KNN regression on the standardized data. You should use your my_scale() function, and the same number of neighbors as in the previous section!

Remember that we should first standardize on the train data, and then use the mean and standard deviations from that standardization to standardize the test data!

State the number of neighbors, and report your estimated test RMSE from running KNN regression with 5-fold CV on the standardized predictors. How does your estimated test RMSE compare to the two previous test RMSEs?

MLR: standardized data

Finally, we will run MLR regression where the predictors are standardized. Using the same folds, run 5-fold CV to obtain an estimate of the test RMSE using MLR on the standardized data. You should use your my_scale() function!

Report your estimated test RMSE from running MLR with 5-fold CV on the standardized data. How does your estimate here compare to that obtained from running MLR on the non-standardized data?

Comprehension questions

Based on your results, if you had to recommend a model for the price of lift tickets, which model would you choose and why?
If you ran this analysis again with a different choice of seed in set.seed(), what would you expect to change and why? What would you expect to stay the same and why?
If you ran this analysis again with a larger number of folds, how would you expect the estimated test RMSEs to change? Why?
I mentioned that the fair way to compare models is to use the same folds/partitions for all models. Briefly explain why that is.
Based on your results for the two linear regression models, what might be one advantage of fitting a linear regression model compared to a KNN regression model?

Submission

When you’re finished, knit + commit + push to GitHub one last time. Then submit your knitted pdf to Canvas!