library(tidyverse)
library(vegan)
data(mite)
data(mite.env)
<- mite.env %>%
mite_dat add_column(abundance = mite$LRUG)
KNN regression
Introduction
Now that you’ve implemented KNN regression, you will work on incorporating categorical features. Thankfully, you shouldn’t have to modify anything about your .knn()
function!
We will continue to work with the mite_dat
data, but will now extend the predictors to include Substrate
, Shrub
and Topo
.
Assignment
Recall the slides from the first week of class where we did some EDA exploring the relationship between each predictor and the abundance
of the mites. The whole class was of the opinion that the categorical predictors might be more useful than the quantitative predictors for predicting the abundance
. We will explore that here!
In the file called knn_regresssion2.Rmd
, copy and paste your .knn()
function (and any accompanying functions) into the appropriate code chunk.
Create new data frame
Create a new data frame mite_dat2
that holds the original quantitative variables and response, and the appropriately encoded versions of the three categorical predictors.
Split into train/test sets
The following code splits the original data into a train set and a test set. As a group, discuss what each line of code is doing.
Copy and paste this code into your .Rmd file in the appropriate code chunk.
Run KNN using all features
With \(K = 5\) neighbors, use your .knn()
function to obtain predictions for the test data using all the predictors. Report your RMSE for the test data.
Run KNN using quantitative features only
Now use your .knn()
function to obtain predictions for the test data using only the quantitative predictors WatrCont
and SubsDens
. Still use \(K = 5\) neighbors. Report your RMSE for the test data.
Run KNN using categorical features only
Now use your .knn()
function to obtain predictions for the test data using only the encoded categorical predictors. Still use \(K = 5\) neighbors. Report your RMSE for the test data.
Discussion
Compare and contrast the RMSEs from the three models you fit. What are you surprised or not surprised by? What do you think explains or contributes to the results you obtained?
Using the same test/train split and encodings, I also fit a fourth model where I used all the predictors but standardized the quantitative predictors. I obtained an RMSE of roughly 6.5 for the test data.
Comment on how this fourth RMSE compares to the ones you obtained. Are you surprised by this result? Why or why not? What do you think explains this result?
Submission
Once you’ve finished, push your changes to GitHub. Upload the finished PDF to Canvas.