Part 2: Pruning
3/16/23
Lab 03 due tonight
No TA office hours tonight
Lab 04 posted later this afternoon or tomorrow (due Friday after break)
Enjoy your spring break!
This next section is a bit technical, but bear with me!
The process of building regression trees may produce good predictions on the training set, but is likely to overfit the data. Why?
A smaller tree with fewer splits/regions might lead to lower variance and better interpretation, at the cost of a little bias
A better strategy is to grow a very large tree \(T_{0}\), and then prune it back in order to obtain a smaller subtree
Idea: remove sections that are non-critical
Cost complexity pruning or weakest link pruning: consider a sequence of trees indexed by a nonnegative tuning parameter \(\alpha\). For each value of \(\alpha\), there is a subtree \(T \subset T_{0}\) such that \[\left(\sum_{m=1}^{|T|} \sum_{i: x_{i} \in R_{m}} (y_{i} - \hat{y}_{R_{m}})^2 \right)+ \alpha |T|\] is as small as possible.
\[\left(\sum_{m=1}^{|T|} \sum_{i: x_{i} \in R_{m}} (y_{i} - \hat{y}_{R_{m}})^2 \right)+ \alpha |T|\]
\(|T|\) = number of terminal nodes of tree \(T\)
\(R_{m}\) is the rectangle corresponding to the \(m\)-th terminal node
\(\alpha\) controls trade-off between subtree’s complexity and fit to the training data
What is the resultant tree \(T\) when \(\alpha = 0\)?
What happens as \(\alpha\) increases?
Note: for every value of \(\alpha\), we have a different fitted tree \(\rightarrow\) need to choose a best \(\alpha\)
Select an optimal \(\alpha^{*}\) using cross-validation, then return to full data set and obtain the subtree corresponding to \(\alpha^{*}\)
Suppose I just want to build a “best” regression tree to my data, but I’m not interesting in comparing the performance of my regression tree to a different model.
Using recursive binary splitting to grow a large tree on the data
Apply cost complexity pruning to the large tree in order to obtain a sequence of best trees as a function of \(\alpha\)
Use \(k\)-fold CV to choose \(\alpha\): divide data into \(K\) folds. For each \(k = 1,\ldots, K\):
Repeat Steps 1 and 2 on all but the \(k\)-th fold
Evaluate RMSE on the data in held-out \(k\)-th fold, as a function of \(\alpha\). Average the result for each \(\alpha\)
Choose \(\alpha^{*}\) that minimizes the average error. Return/choose the subtree from Step 2 that corresponds to \(\alpha^{*}\) as your “best” tree!
If instead I also want to compare my “best” regression tree against a different model, I also need some train/test data to compare the two models
Split data into train and validation sets
Using recursive binary splitting to grow a large tree on the training data
Apply cost complexity pruning to the large tree in order to obtain a sequence of best trees as a function of \(\alpha\)
Use \(k\)-fold CV to choose \(\alpha\): divide training data into \(K\) folds. For each \(k = 1,\ldots, K\):
Repeat Steps 1 and 2 on all but the \(k\)-th fold
Evaluate RMSE on the data in held-out \(k\)-th fold, as a function of \(\alpha\). Average the result for each \(\alpha\)
Choose \(\alpha^{*}\) that minimizes the average error. Return the subtree from Step 2 that corresponds to \(\alpha^{*}\) as “best” tree, and use that tree for predictions on the test data.
Let’s suppose I want to obtain the best regression tree for this mite data, and I want to use the tree to compare to other models
k-fold CV with k = 5. Best CV RMSE (red) at 3 leaves , so this would be our “best” tree
Note: while the CV error is computed as a function of \(\alpha\), we often display as a function of \(|T|\), the number of leaves
Pruning in R: thankfully much of this is automated!
Trees vs. linear models: which is better? Depends on the true relationships between the response and the predictors
Advantages:
Easy to explain, and may more closely mirror human decision-making than other approaches we’ve seen
Can be displayed graphically and interpreted by non-expert
Can easily handle qualitative predictors without the need to encode or create dummy variables
Making predictions is fast: no calculations needed!
Disadvantages:
Lower levels of predictive accuracy compared to some other approaches
Can be non-robust (high variance)
However, we may see that aggregating many trees can improve predictive performance!