library(tidyverse)
library(tree)
Bagging
Introduction
Today, we will implement a bagged classification tree model using B
trees. For the final project, you are welcome to use the functions from the randomForest
library. We will specifically implement a bagged classification tree model where the predictions are obtained using the out-of-bag (OOB) observations for each tree \(b = 1, 2, \ldots, B\).
Starter code can be found in your bag-classification-tree-implementation
GitHub repo. We will be working with the seeds.txt
data, found in the data
folder of this repo.
OOB predictions for bagging classification trees
Remember that different observations will be excluded from each bootstrap sample. So you have to be clever about how you keep track of the predictions (and how many times an observation is OOB). Refer back to the live code from bagging regression trees for a refresher/starter code!
How do we obtain predictions for bagging classification trees?
Option 1
For classification trees, we typically “aggregate” the predictions by taking the majority vote approach: for a given test/OOB observation, we predict the label \(j\) that has the highest representation across the trees for which it was OOB. For example, if \(B = 10\), and observation \(y_{0}\) was OOB 6 times with the following counts of predicted labels:
label count
1 A 3
2 B 1
3 C 0
4 D 2
we would predict label \(\hat{y}_{0}\) = "A"
.
A slight twist here is that we will have more than observation to keep track of across the \(B\) trees!
Option 2
Another way we can make make a prediction is to keep track of the estimated conditional class probabilities from each tree \(b\). Then, after iterating through all the \(B\) trees, we predict the label that has the highest average conditional class probability (where the average is taken with respect to the number of times the observation was OOB). For example, if the average conditional probabilities are as follows:
label avg_cond_prob
1 A 0.1
2 B 0.5
3 C 0.0
4 D 0.4
then I would predict \(\hat{y}_{0} =\) "B"
.
Implementation
Now, implement bagged classification trees where we obtain predictions using the OOB observations. We will predict seed variety
using all the remaining predictors available in the data. Set a seed of 218, and use B = 100
trees.
You should obtain two sets of OOB predictions using bagged classification trees: one set where we use Option 1 (majority of majority class), and the other set were we use Option 2 (majority of average conditional class probabilities).
We will then compare if the predictions using both options agree with each other. For this reason, you should only run the for
loop over the \(B\) trees once in this implementation. We want to ensure that both options see the same bootstrapped datasets.
Results
After you’ve obtained your two sets of predictions, using the
table()
function to compare the predictions you obtained from Option 1 vs Option 2. Do the two methods lead to predictions that agree with each other? If there are are disagreements, clearly specify what the disagreements are.What proportion of the time was each observation in your dataset out-of-bag? Does this agree with the theoretical result we obtained a few weeks ago about bootstrapping?