Bagging

Implementations
Classification trees
Published

April 21, 2023

Introduction

Today, we will implement a bagged classification tree model using B trees. For the final project, you are welcome to use the functions from the randomForest library. We will specifically implement a bagged classification tree model where the predictions are obtained using the out-of-bag (OOB) observations for each tree \(b = 1, 2, \ldots, B\).

library(tidyverse)
library(tree)

Starter code can be found in your bag-classification-tree-implementation GitHub repo. We will be working with the seeds.txt data, found in the data folder of this repo.

OOB predictions for bagging classification trees

Remember that different observations will be excluded from each bootstrap sample. So you have to be clever about how you keep track of the predictions (and how many times an observation is OOB). Refer back to the live code from bagging regression trees for a refresher/starter code!

How do we obtain predictions for bagging classification trees?

Option 1

For classification trees, we typically “aggregate” the predictions by taking the majority vote approach: for a given test/OOB observation, we predict the label \(j\) that has the highest representation across the trees for which it was OOB. For example, if \(B = 10\), and observation \(y_{0}\) was OOB 6 times with the following counts of predicted labels:

  label count
1     A     3
2     B     1
3     C     0
4     D     2

we would predict label \(\hat{y}_{0}\) = "A".

A slight twist here is that we will have more than observation to keep track of across the \(B\) trees!

Note

Note: it’s possible that we have ties in predicted labels in bagging. For the purposes of this lab, you don’t have to worry about tied majority votes (so you can use the which.max() function if that’s helpful to you), but know that ties are possible!

Option 2

Another way we can make make a prediction is to keep track of the estimated conditional class probabilities from each tree \(b\). Then, after iterating through all the \(B\) trees, we predict the label that has the highest average conditional class probability (where the average is taken with respect to the number of times the observation was OOB). For example, if the average conditional probabilities are as follows:

  label avg_cond_prob
1     A           0.1
2     B           0.5
3     C           0.0
4     D           0.4

then I would predict \(\hat{y}_{0} =\) "B".

Note

Note: it’s possible that we have ties in predicted conditional class probabilities as well. For the purposes of this lab, you don’t have to worry about ties here either!

Implementation

Now, implement bagged classification trees where we obtain predictions using the OOB observations. We will predict seed variety using all the remaining predictors available in the data. Set a seed of 218, and use B = 100 trees.

You should obtain two sets of OOB predictions using bagged classification trees: one set where we use Option 1 (majority of majority class), and the other set were we use Option 2 (majority of average conditional class probabilities).

We will then compare if the predictions using both options agree with each other. For this reason, you should only run the for loop over the \(B\) trees once in this implementation. We want to ensure that both options see the same bootstrapped datasets.

Note

I suggest you begin by focusing on a single option! Would you like to keep track of counts (Option 1) or probabilities (Option 2) first?

Results

  • After you’ve obtained your two sets of predictions, using the table() function to compare the predictions you obtained from Option 1 vs Option 2. Do the two methods lead to predictions that agree with each other? If there are are disagreements, clearly specify what the disagreements are.

  • What proportion of the time was each observation in your dataset out-of-bag? Does this agree with the theoretical result we obtained a few weeks ago about bootstrapping?