Classification Trees

4/18/23

Housekeeping

Lab 05 due this Thursday!
Project proposals due this Sunday to Canvas!

Classification Trees

Classification trees

Interpretation of the tree is the same as for regression tree!
Will end up with paths to regions \(R_{1}, R_{2}, \ldots, R_{T}\) where \(T\) is the number of terminal nodes or leaves
What do you think we will predict for \(\hat{y}\) now in a classification task?
- Recall: in regression trees, the prediction \(\hat{y}\) for an observation that falls into a given region \(R_{m}\) is the average of the training responses in that \(R_{m}\)

Building the tree

Just like in regression trees, we will use recursive binary splitting to grow the tree
Top-down, greedy approach that makes the “best” split at that moment in time
In regression trees, we used residual sum of squares to quantify “best”, but we cannot use that here!
What might we use instead to quantify “best” to decide each split?
Consider the fraction of training observations in the region that belong to the most common class: \[E = 1 - \max_{k}(\hat{p}_{mk}),\] where \(\hat{p}_{mk}\) is the proportion of training observations in the region \(R_{m}\) that are from class \(k\)
Does smaller or larger \(E\) correspond to a “good” split?
Unfortunately, this error is not sufficiently sensitive to tree-growing

Gini Index

The Gini index is a measure of the total variance across the \(K\) classes

\[G_{m} = \sum_{k=1}^{K} \hat{p}_{mk} (1 - \hat{p}_{mk})\]

\(G_{m}\) is small if all the \(\hat{p}_{mk}\)’s are close to zero or one
For this reason, Gini index is referred to as a measure of node purity
A small \(G_{m}\) indicates that the node contains most of its observations from a single class

Example: 3 classes and 9 observations in three regions

  Region1 Region2 Region3
1       A       A       A
2       A       A       A
3       A       B       A
4       A       B       B
5       A       C       B
6       A       C       B
7       A       C       C
8       A       C       C
9       A       C       C

In these three regions, what are the Gini indices \(G_{1}, G_{2}, G_{3}\)?
- \(G_{1} = 1(1-1) + 0(1-0) + 0(1-0) = 0\)
- \(G_{2} = \frac{2}{9}(1-\frac{2}{9}) +\frac{2}{9}(1-\frac{2}{9}) + \frac{5}{9}(1-\frac{5}{9}) \approx 0.60\)
- \(G_{3} = \frac{1}{3}(1-\frac{1}{3}) + \frac{1}{3}(1-\frac{1}{3})+\frac{1}{3}(1-\frac{1}{3}) = \frac{2}{3} \approx 0.67\)

Entropy

Alternative to Gini index is cross-entropy:

\[D_{m} = -\sum_{k=1}^{K} \hat{p}_{mk} \log\hat{p}_{mk}\]

Very similar to Gini index, so cross-entropy is also a measure of node purity

In these three regions, what are the cross-entropies \(D_{1}, D_{2}, D_{3}\)?

  Region1 Region2 Region3
1       A       A       A
2       A       A       A
3       A       B       A
4       A       B       B
5       A       C       B
6       A       C       B
7       A       C       C
8       A       C       C
9       A       C       C

\(D_{1} = -(1\log(1) + 0\log(0) + 0\log(0)) = 0\)
\(D_{2} = -(\frac{2}{9}\log(\frac{2}{9}) +\frac{2}{9}\log(\frac{2}{9}) + \frac{5}{9}\log(\frac{5}{9})) \approx 1\)
\(D_{3} = \frac{1}{3}(1-\frac{1}{3}) + \frac{1}{3}(1-\frac{1}{3})+\frac{1}{3}(1-\frac{1}{3}) = \frac{2}{3} \approx 1.1\)

Exercise

Work through building a mini classification tree!