Lab 03: Cross-validation
Introduction
The data and .Rmd
file can be found in your lab-03-ski-resorts
GitHub project. Please clone it now! The data is ski_resorts.csv
, and come from Kaggle.
We have data on ski resorts on the east coast. Each observation is a ski resort, and we information about the features of each resort (e.g. number of lifts, price of ticket, average elvation). A data dictionary can be found in the README.
<- read.csv("data/ski_resorts.csv") ski_resorts
We will compare two models’ prediction performance for the lift ticket price
of these ski resorts using all the other quantitative variables as our predictors. We will compare the models via the estimated test RMSE obtained using k-fold cross-validation. The two models are:
- A multiple linear regression model, and
- A KNN regression model
This lab will also explore the effect of standardizing quantitative variables.
Now that you’ve implemented KNN regression and understand how it works, you are allowed to use the knnreg()
R function provided in the caret
library! This function is much faster than our own implementations.
To use knnreg()
:
- Install the
caret
package in your console - Load in the
caret
package at the top of your .Rmd file - The
knnreg()
function works slightly differently than our implementation. It only fits a model to the training data, where you pass in atrain_x
and atrain_y
. Then to obtain prediction, you use thepredict()
function, just as you would for linear regression.By default, the
knnreg()
chooses a neighbor set of \(K = 5\). If you want a different choice of neighbors, you must explicitly pass that into the argumentk
.The
train_y
you pass intoknnreg()
must be a vector, not a data frame!
# suppose my train data are stored as train_x and train_y
# suppose my test predictors are stores as test_x
<- knnreg(x= train_x, y = train_y, k = 7)
knn_mod <- predict(knn_mod, newdata = test_x) preds
This is a challenging lab assignment because there are a lot of moving parts! Please do not put it off until the last minute!
Define functions
To make our lives easier, we will write a function that standardizes data for you. Create a function called my_scale()
that takes in three arguments:
- A data frame (or matrix) that needs to be standardized,
- A vector of means, where element \(j\) is the mean of the \(j\)-th column of (1), and
- A vector of standard deviations where element \(j\) is the standard deviation of the \(j\)-th column of (1)
Your function my_scale()
should return a standardized version of the data frame that was input. The R
function sd()
takes a vector as input and outputs the standard deviation.
You can confirm your my_scale()
function is working by seeing if you get the same results as when you use the scale()
function provided by R on the following data temp
(you can also confirm if your mean_vec
and sd_vec
are correct by looking at the attr
ibutes center and scale in the following output):
<- data.frame(x = 1:5) %>%
temp mutate(y = sqrt(x))
scale(temp)
x y
[1,] -1.2649111 -1.3900560
[2,] -0.6324555 -0.5388977
[3,] 0.0000000 0.1142190
[4,] 0.6324555 0.6648219
[5,] 1.2649111 1.1499128
attr(,"scaled:center")
x y
3.000000 1.676466
attr(,"scaled:scale")
x y
1.5811388 0.4866469
Analysis
We will fit a total of four different models using k-fold cross-validation. For all the models, we will predict the price
of the lift tickets using all of the remaining quantitative variables. In order to have a fair comparison of the models, each model should be fit and tested on the same folds/partitions of the original data. Therefore, we will begin by creating a set of indices that tell us which fold each observation belongs to.
I suggest you modify your data such that it only contains the variables of interest for this analysis!
Obtain the indices for each fold
We will perform 5-fold cross-validation.
Randomly split the indices of the observations into 5 folds of equal size. Because you will be randomly splitting, it is important for you to set a seed for reproducibility. Use a seed of 3. Hint: you will most likely need to use a list!
MLR: original scale
Using your folds in the previous step, run 5-fold CV to obtain an estimate of the test RMSE using MLR.
Note: suppose you are running lm()
and the data you pass in only contains the response y
and all of the predictors of interest. Rather than explicitly typing out the name of each predictor in lm()
, you can simply type a .
and R
will recognize that you want to use all the other variables aside from y
in the data frame as predictors:
lm(y ~ ., data)
Report your estimated test RMSE from running MLR with 5-fold CV.
KNN: original scale
Using the same folds, run 5-fold CV to obtain an estimate of the test RMSE using KNN regression, with \(K = 10\) neighbors. You may either use your own implementation of KNN, or you may use the knnreg()
+ predict()
functions provided in R
.
State the number of neighbors, and report your estimated test RMSE from running KNN regression with 5-fold CV. How does your estimated test RMSE compare to that obtained from MLR?
KNN regression: standardized data
Now, we will run KNN regression where the predictors are standardized. Using the same folds, run 5-fold CV to obtain an estimate of the test RMSE using KNN regression on the standardized data. You should use your my_scale()
function, and the same number of neighbors as in the previous section!
Remember that we should first standardize on the train data, and then use the mean and standard deviations from that standardization to standardize the test data!
State the number of neighbors, and report your estimated test RMSE from running KNN regression with 5-fold CV on the standardized predictors. How does your estimated test RMSE compare to the two previous test RMSEs?
MLR: standardized data
Finally, we will run MLR regression where the predictors are standardized. Using the same folds, run 5-fold CV to obtain an estimate of the test RMSE using MLR on the standardized data. You should use your my_scale()
function!
Report your estimated test RMSE from running MLR with 5-fold CV on the standardized data. How does your estimate here compare to that obtained from running MLR on the non-standardized data?
Comprehension questions
- Based on your results, if you had to recommend a model for the
price
of lift tickets, which model would you choose and why? - If you ran this analysis again with a different choice of seed in
set.seed()
, what would you expect to change and why? What would you expect to stay the same and why? - If you ran this analysis again with a larger number of folds, how would you expect the estimated test RMSEs to change? Why?
- I mentioned that the fair way to compare models is to use the same folds/partitions for all models. Briefly explain why that is.
- Based on your results for the two linear regression models, what might be one advantage of fitting a linear regression model compared to a KNN regression model?
Submission
When you’re finished, knit + commit + push to GitHub one last time. Then submit your knitted pdf to Canvas!