Lab 02: Linear regression

Lab assignment

Moneyball

Note

Throughout this document, you will see text in different colors. The text in maroon/red denotes the “deliverable” (i.e. what I will be looking for you to code/answer/address in your submission.

Introduction

Moneyball tells the success story of the Oakland A baseball team in 2002. In 2001, the team was extremely poor, and so the General Manager named Billy Beane needed ideas on how to improve the team with limited financial resources.

Billy Beane and his colleague Paul DePodesta did some analysis and explored models to assemble a competitive baseball team. Beane hypothesized that some skills of a baseball player were overvalued, whereas others undervalued. If they could detect the undervalued skills, they could find good players at a bargain (i.e. cheaper) contract. We will re-create their findings here.

Load in the data using the following code (if you get an error, make sure you are set to the correct file directory):

baseball <- read.csv("data/baseball.csv")

For terminology, according to Wikipedia, “a run is scored when a player advances around first, second and third base and returns safely to home plate, touching the bases in that order, before three outs are recorded…The object of the game is for a team to score more runs than its opponent.”

Each observation represents a team in a given year. The data dictionary is as follows:

Team: MLB team
League: National League (NL) or American League (AL)
Year: Year
RS: Total runs scored
RA: Total runs allowed (the amount of runs that score against a pitcher)
W: Total wins in the season
OBP: On-base percentage (how frequently a batter reaches base per plate appearance)
SLG: Slugging percentage (total bases divided by at bats)
BA: Batting average (total hit divided by total at-bats)
Playoffs: If the team made it to the playoffs (1) or did not (0)
OOBP: Opponent’s on-base percentage
OSLG: Opponent’s slugging percentage

Part 1: EDA and Data Wrangling

Data wrangling

The data provided to you have data ranging from 1962 to 2012. To re-create this famous analysis, we need to pretend it’s the year 2002 and thus we only have data through 2001.

Create a new data frame called moneyball with the appropriate subset of the baseball data.

EDA

The goal of any MLB team (I think) is to make it to the playoffs. Billy Beane determined that a team needs to win at least 95 games to make it to the playoffs.

Using some appropriate EDA, demonstrate how you think Billy Beane arrived at the number 95.

More data wrangling

So, Billy Beane needed some way to understand what influences/determines the number of wins a team had in a given season. It was determined that the run differential was an important metric, calculated as the overall runs scored minus the runs allowed. A positive run differential means the team scores more than it allows (this is good).

Modify your moneyball data frame to add a new variable called RD that is the run differential.

More EDA

Create a scatterplot of a team’s wins versus its run differential in each season. Does there appear to be a linear relationship?

Commit reminder

This would be a good time to knit, commit, and push your changes to GitHub!

Part 2: Model for wins

Fit the model

Fit a simple linear regression model with a team’s wins as the response variable and the run differential as the predictor. Call this model mod_wins. Interpret the coefficients.

Recall that we need at least 95 wins to enter the playoffs. Based on your model, how large of a run differential do we need to get into the playoffs?

Commit reminder

This would be a good time to knit, commit, and push your changes to GitHub!

Part 3: Components of run differential

Recall that the runs scored and the runs allowed determine the run differential. So, we also need to understand which variables impact both of these components and how.

Model for runs scored

The Oakland A’s discovered that a team’s on-base percentage (OBP), the slugging percentage (SLG), and the batting average (BA) were important for determining how many runs are scored (RS).

Fit a linear regression model called mod_rs1 for the runs scored as the response regressed on these three predictors. Is a team’s batting average important for explaining its runs scored? Why or why not?

The Oakland A’s determined that a team’s batting average was overvalued. Because of this, the Oakland A’s decided to not consider batting average.

Use your fitted model mod_rs1 to explain how they came to this conclusion. Then fit another linear regression model called mod_rs2 for runs scored regressed only on on-base percentage and the slugging percentage.

Model for runs allowed

The Oakland A’s found that the runs allowed (RA) were influenced by the opponents on-base percentage (OOBP) and the opponents slugging percentage (OSLG).

Fit a multiple linear regression model called mod_ra for this relationship.

Commit reminder

This would be a good time to knit, commit, and push your changes to GitHub!

Part 4: Putting it all together

Recap: for the upcoming 2002 baseball season, we need at least 95 wins to enter the playoffs. We fit a model (mod_wins) for a team’s wins based on its run differential. So, we need to predict the run differential for our team in the upcoming season. To do this, we can predict the runs scored and runs allowed for our new team given some statistics.

Create new team

Paul DePodesta ultimately formulated a team of players with the following statistics:

OBP: 0.339
SLG: 0.430
OOBP: 0.307
OSLG: 0.373

Create a new data frame called pauls_team that contains these four statistics of the new team (i.e. pauls_team should have one row and four columns).

Predictions

Using your models mod_rs2 and mod_ra and the predict() function, predict the runs scored and runs allowed for pauls_team. Based on these two predictions, what is the predicted run differential?

Based on the predicted run differential, what is the predicted number of wins for our team in the upcoming season? Should we expect to enter the playoffs?

Commit reminder

This would be a good time to knit, commit, and push your changes to GitHub!

Part 5: Prediction performance

Billy Beane and Paul DePodesta ultimately decided to remove batting average from their model for runs scored because they had limited financial resources and wanted to find the skills that were undervalued. However, it could be the case that knowing the batting average of a player is important to explaining the runs scored, and also predicting the runs scored. After all, we had to make predictions for the 2002 season, so we would like a model that predicts the runs scored well. Let’s examine this now to see if the team would have made better predictions of runs scored if they had kept the batting average in the model. We will predict for the upcoming year 2002.

Data preparation

Create a new data frame called baseball2002 that contains the observations from the year 2002 in the original baseball data.

Then, create a new a variable called RS_true that holds the vector of true runs scored in 2002.

Predict

Obtain predictions of runs scored for the baseball2002 data using both mod_rs1 and mod_rs2 (i.e. you should have two vectors of predictions).

Please do not report/display/print the vectors of predictions! Just store them using an appropriate variable name.

Prediction error

Lastly, calculate and report the root mean squared error (RMSE) for the predictions from both of the models.

Based on your results, do you think the Oakland A’s were correct to remove batting average from their model for runs scored? Why or why not?

Brief comprehension questions

Please answer each of the following questions. Each question only requires a 1-2 sentence response at most!

For this analysis, did we have train and test datasets? If so, what was the train set and what was the test set?
For this analysis, did we ask questions concerning prediction? If so, where?
For this analysis, did we ask questions concerning inference? If so, where?

Submission

When you’re finished, knit to PDF one last time and upload the PDF to Canvas. Commit and push your code back to GitHub one last time.