Lab 04: Regression trees

Forest fires

Introduction

The purpose of this lab is to gain familiarity and practice with fitting and evaluating regression trees in R. Go ahead and clone your new lab-04-forest-fires GitHub project. You will work with the .Rmd file called lab-04-forest-fires.Rmd

Data

For this assignment, you will predict the size of forest fires in the northeast region of Portugal using meteorological and other covariates. The original data were obtained from the UCI Machine Learning Repository, and I have modified them slightly for the purposes of this implementation.

Each row in the data set represents one fire. We have the following variables:

fire_id: a variable to identify each fire
X, Y: coordinates for the location of the fire
month: month of year
day: day of week
FFMC: Fine Fuel Moisture Code, represents fuel moisture of forest litter fuels under the shade of a forest canopy
DMC: Duff Moisture Code, represents fuel moisture of decomposed organic material underneath the litter
drought: drought status of location (“Low”, “Moderate”, “Extreme”)
ISI: Initial Spread Index, used to predict fire behavior
temp: temperature (Celsius)
RH: relative humidity (%)
wind: wind speed (km/h)
rain: outside rain (mm/m2)
area: the burned area of the forest (hectares). An area of 0 means that an area of lower than 100 square meters was burned.

Goal

Using regression trees, we will predict the size of a fire given some of these features. We will also compare prediction performance under different modeling choices.

Prepare data

The data are in forest_fires.csv in your data folder of this project. We will also require the tidyverse and tree packages. Go ahead and load the data and libraries now.

Please save the data using the variable name fire_dat.

Wrangle

Wrangle your data to only retain observations from the months of March, July, August, and September, then remove month from the data set. Also, we will not consider the fire_id, day of week, nor geographic location as predictors. Lastly, recall that the tree() function requires all categorical variables to be coded as factors.

Modify your data to make all the required changes.

EDA

Visualize and describe the distribution of the burned `area`.

Wrangle (again)

Lastly, if you were to make a histogram of the response variable area, you would notice it is heavily right-skewed. One way to address this issue is to log-transform area. However, many observations have an observed area = 0, and the log of 0 is \(-\infty\). A common way to get around this is to take the log of (response variable + 1).

Create a new data frame called fire_dat_log, where you over-write the area variable using the appropriate log transform described above.

Using your new fire_dat_log, create a summary table where for each level of drought, the table displays the mean and standard deviation of the log-burned area, and the total number of observations that fall into that level. Based on what you see, do you think drought will be an important variable in our regression tree? Why or why not?

Commit reminder

This would be a good time to knit, commit, and push your changes to GitHub!

Regression tree for log area

Train/test ids

We will compare a pruned regression tree to an unpruned regression tree to see if the pruning is actually helpful for predictions.

Using a seed of 346, split your fire_dat_log data into an 80% training set and a 20% test set.

Grow large tree

Fit a regression tree to the training data for your logged area of the forest fires using all of the other variables as predictors. Explicitly let R grow a large tree by setting the control arguments in tree() to have minsize = 2 (see live code for refresher).

Display a summary() of your regression tree. How many leaves are there? Was your intuition correct about whether or not drought would be an important predictor for the log burned area?

Cost-complexity pruning

Now, we will prune back the tree using cost-complexity. Because we will be performing k-fold CV, we should set a seed again in order to have reproducibility of the assignment of observations to folds.

Set a seed of 346 again, and perform cost-complexity pruning using 10-fold CV.

From your output, make a plot of the size of the candidate pruned trees on the x-axis and the CV deviance estimates on the y-axis (see live code for example plot). Based on your plot, which size tree should we use?

Prune the tree

Based on your previous answer, prune your original tree to obtain the “best” tree. Plot the pruned tree. How does it compare to your original large tree in terms of the predictors used and number of leaves?

Model comparison

Now, compare your pruned and unpruned trees by making predictions on the test data. You can use the predict() function just like we did for linear regression, passing in the fitted model first and specifying the newdata argument.

Obtain and report the estimated test RMSEs from both models. Based on your results, did pruning seem to help? Why or why not?

Commit reminder

This would be a good time to knit, commit, and push your changes to GitHub!

Examining variability

In class, I mentioned that one disadvantage of regression trees is that they are highly variable. We will explore that here.

Repeat your same analysis from above, but now setting seeds of 5.

Maybe helpful hint: remember that the prediction for a new observation \(x_{0}\) is the average of the training responses in the terminal node that \(x_{0}\) falls into.

You only need to provide code and to fit, prune, and predict from the tree (i.e. I won’t be looking for plots). The only outputs I am looking for are the test RMSEs from the pruned and unpruned trees.

Instead of answering the questions from the previous section, answer the following:

Based on this test/train split using a seed of 5, does the pruned or unpruned tree perform better on the test data?
How “useful” would you say your pruned tree here is for someone who is trying to understand what may impact the area burned in a forest fire in Portugal?
How does your pruned tree here (seed of 5) compare to the pruned tree in the previous section (seed of 346)?

Submission

When you’re finished, knit to PDF one last time and upload the PDF to Canvas. Commit and push your code back to GitHub one last time.