Lab 04: Regression trees
Introduction
The purpose of this lab is to gain familiarity and practice with fitting and evaluating regression trees in R
. Go ahead and clone your new lab-04-forest-fires
GitHub project. You will work with the .Rmd
file called lab-04-forest-fires.Rmd
Data
For this assignment, you will predict the size of forest fires in the northeast region of Portugal using meteorological and other covariates. The original data were obtained from the UCI Machine Learning Repository, and I have modified them slightly for the purposes of this implementation.
Each row in the data set represents one fire. We have the following variables:
fire_id
: a variable to identify each fireX
,Y
: coordinates for the location of the firemonth
: month of yearday
: day of weekFFMC
: Fine Fuel Moisture Code, represents fuel moisture of forest litter fuels under the shade of a forest canopyDMC
: Duff Moisture Code, represents fuel moisture of decomposed organic material underneath the litterdrought
: drought status of location (“Low”, “Moderate”, “Extreme”)ISI
: Initial Spread Index, used to predict fire behaviortemp
: temperature (Celsius)RH
: relative humidity (%)wind
: wind speed (km/h)rain
: outside rain (mm/m2)area
: the burned area of the forest (hectares). Anarea
of 0 means that an area of lower than 100 square meters was burned.
Goal
Using regression trees, we will predict the size of a fire given some of these features. We will also compare prediction performance under different modeling choices.
Prepare data
The data are in forest_fires.csv
in your data
folder of this project. We will also require the tidyverse
and tree
packages. Go ahead and load the data and libraries now.
Please save the data using the variable name fire_dat
.
Wrangle
Wrangle your data to only retain observations from the months of March, July, August, and September, then remove month
from the data set. Also, we will not consider the fire_id
, day of week, nor geographic location as predictors. Lastly, recall that the tree()
function requires all categorical variables to be coded as factors.
Modify your data to make all the required changes.
EDA
Visualize and describe the distribution of the burned `area`.
Wrangle (again)
Lastly, if you were to make a histogram of the response variable area
, you would notice it is heavily right-skewed. One way to address this issue is to log-transform area
. However, many observations have an observed area = 0
, and the log of 0 is \(-\infty\). A common way to get around this is to take the log of (response variable + 1).
Create a new data frame called fire_dat_log
, where you over-write the area
variable using the appropriate log transform described above.
Using your new fire_dat_log
, create a summary table where for each level of drought
, the table displays the mean and standard deviation of the log-burned area
, and the total number of observations that fall into that level. Based on what you see, do you think drought
will be an important variable in our regression tree? Why or why not?
Regression tree for log area
Train/test ids
We will compare a pruned regression tree to an unpruned regression tree to see if the pruning is actually helpful for predictions.
Using a seed of 346, split your fire_dat_log
data into an 80% training set and a 20% test set.
Grow large tree
Fit a regression tree to the training data for your logged area
of the forest fires using all of the other variables as predictors. Explicitly let R
grow a large tree by setting the control
arguments in tree()
to have minsize = 2
(see live code for refresher).
Display a summary()
of your regression tree. How many leaves are there? Was your intuition correct about whether or not drought
would be an important predictor for the log burned area
?
Cost-complexity pruning
Now, we will prune back the tree using cost-complexity. Because we will be performing k-fold CV, we should set a seed again in order to have reproducibility of the assignment of observations to folds.
Set a seed of 346 again, and perform cost-complexity pruning using 10-fold CV.
From your output, make a plot of the size of the candidate pruned trees on the x-axis and the CV deviance estimates on the y-axis (see live code for example plot). Based on your plot, which size tree should we use?
Prune the tree
Based on your previous answer, prune your original tree to obtain the “best” tree. Plot the pruned tree. How does it compare to your original large tree in terms of the predictors used and number of leaves?
Model comparison
Now, compare your pruned and unpruned trees by making predictions on the test data. You can use the predict()
function just like we did for linear regression, passing in the fitted model first and specifying the newdata
argument.
Obtain and report the estimated test RMSEs from both models. Based on your results, did pruning seem to help? Why or why not?
Examining variability
In class, I mentioned that one disadvantage of regression trees is that they are highly variable. We will explore that here.
Repeat your same analysis from above, but now setting seeds of 5.
Maybe helpful hint: remember that the prediction for a new observation \(x_{0}\) is the average of the training responses in the terminal node that \(x_{0}\) falls into.
You only need to provide code and to fit, prune, and predict from the tree (i.e. I won’t be looking for plots). The only outputs I am looking for are the test RMSEs from the pruned and unpruned trees.
Instead of answering the questions from the previous section, answer the following:
- Based on this test/train split using a seed of 5, does the pruned or unpruned tree perform better on the test data?
- How “useful” would you say your pruned tree here is for someone who is trying to understand what may impact the area burned in a forest fire in Portugal?
- How does your pruned tree here (seed of 5) compare to the pruned tree in the previous section (seed of 346)?
Submission
When you’re finished, knit to PDF one last time and upload the PDF to Canvas. Commit and push your code back to GitHub one last time.