3/28/23
Lab 04: Forest Fires due to Canvas this Thursday 11:59pm
Lab 03 should be graded by end of today!
Take-home midterm assigned this Friday (will introduce it today)
Economically use a collected dataset by repeatedly drawing samples from the same training dataset and fitting a model of interest on each sample
Two methods: cross-validation and the bootstrap
These slides will focus on the bootstrap
The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method
Example: can be used to estimate the standard errors of the \(\beta\) coefficients in linear regression
One goal of statistics: learn about a population.
Bootstrapping operates by resampling this sample data to create many simulated samples
Bootstrapping resamples the original dataset with replacement
If the original dataset has \(n\) observations, then each bootstrap/resampled dataset also has \(n\) observations
Suppose I want to know the true proportion of plain M&M candies that are colored red
My sample is a bag of M&Ms that I purchased at a gas station, which contains 56 pieces with the following distribution:
[1] orange orange yellow red blue red red green green blue
[11] blue blue green orange brown orange green red orange brown
[21] red blue green blue orange orange blue orange brown orange
[31] orange yellow orange blue brown green brown blue green orange
[41] brown green brown yellow yellow brown blue orange green green
[51] orange brown orange blue blue blue
Levels: red orange yellow green blue brown
obs
red orange yellow green blue brown
5 15 4 10 13 9
Good first guess for the true proportion of red candies?
How would we go about creating a range of plausible estimates? We could bootstrap!
To obtain a single bootstrap sample, we repeatedly pull out an M&M, note its color, and return it to the bag until we have pulled out a total of \(n = 56\) candies
We typically repeat this process many times, to simulate taking multiple observations from the population
[1] blue red green orange green red blue yellow orange orange
[11] yellow red blue orange blue orange orange orange orange red
[21] green yellow red orange green blue green green green orange
[31] orange orange orange green brown brown green brown green blue
[41] brown orange yellow orange orange yellow red orange blue green
[51] green green green orange blue green
Levels: red orange yellow green blue brown
\(\hat{p}^{(2)}_{red} = 0.036\)
Repeat this process thousands of times!
…
After 1000 bootstrap samples, we end up with 1000 estimates for \(p_{red}\)
Average over all estimates is \(\hat{p} _{red}= 0.091\)
Suppose my original sample has the following \(n = 5\) observations: (1, 0, -2, 0.5, 4).
Which of the following are possible bootstrap samples we could obtain from the original sample?
(0, 0, 0, 0, 0)
(1, -2, -2, 3, 4)
(1, 0, -2, 0.5, 4)
(4, -2, 0)
Real world vs bootstrap world
Pros:
Cons: