library(tidyverse)
library(vegan)
library(tree)
data(mite)
data(mite.env)
<- mite.env %>%
mite_dat add_column(abundance = mite$LRUG)
<- nrow(mite_dat) n
Bagging
Note: this implementation is not graded.
Introduction
Suppose we want to fit a bagged regression tree model using B = 10
trees. We will first implement the models by hand before learning how to use functions from the randomForest
library.
Implement bagging: validation set
In this section, you will see that you are provided an 80/20 split of the data.
Implement a bagged regression tree model where the trees are fit on the train data, and you obtain predictions for the remaining 20% test data. Assume that we want to predict abundance
using all the remaining variables as predictors.
Remember, we are bagging B = 10
trees.
Discuss with someone next to you before coding:
Will you need to iterate multiple times? If so, how many times? What steps need to take place at every iteration?
What are you ultimately trying to obtain/output?
What information will you need keep track off? Will you need to create some vectors?
set.seed(18)
<- sample(1:n, 0.8*n)
train_ids <- length(train_ids)
n_train <- n - n_train n_test
# code here
Implement bagging: OOB
Rather than splitting the data into a train/test set, here we will leverage the out-of-bag (OOB) observations. Implement a bagged regression tree model where the OOB observations are used to estimate the test error.
Hint: remember that different observations will be excluded from each bootstrap sample. So you have to be clever about how you keep track of the predictions (and how many times an observation is OOB).
Discuss with someone next to you before coding:
- What will be different about OOB predictions compared to a bagged tree where we explicitly define a test/train split? What will be the same?
# code here