Bagging

Implementations
Regression trees
Published

March 30, 2023

Note: this implementation is not graded.

Introduction

Suppose we want to fit a bagged regression tree model using B = 10 trees. We will first implement the models by hand before learning how to use functions from the randomForest library.

library(tidyverse)
library(vegan)
library(tree)
data(mite)
data(mite.env)
mite_dat <- mite.env %>%
  add_column(abundance = mite$LRUG)
n <- nrow(mite_dat)

Implement bagging: validation set

In this section, you will see that you are provided an 80/20 split of the data.

Implement a bagged regression tree model where the trees are fit on the train data, and you obtain predictions for the remaining 20% test data. Assume that we want to predict abundance using all the remaining variables as predictors.

Remember, we are bagging B = 10 trees.

Discuss with someone next to you before coding:

  • Will you need to iterate multiple times? If so, how many times? What steps need to take place at every iteration?

  • What are you ultimately trying to obtain/output?

  • What information will you need keep track off? Will you need to create some vectors?

set.seed(18) 
train_ids <- sample(1:n, 0.8*n)
n_train <- length(train_ids)
n_test <- n - n_train
# code here

Implement bagging: OOB

Rather than splitting the data into a train/test set, here we will leverage the out-of-bag (OOB) observations. Implement a bagged regression tree model where the OOB observations are used to estimate the test error.

Hint: remember that different observations will be excluded from each bootstrap sample. So you have to be clever about how you keep track of the predictions (and how many times an observation is OOB).

Discuss with someone next to you before coding:

  • What will be different about OOB predictions compared to a bagged tree where we explicitly define a test/train split? What will be the same?
# code here