Live code:
Running logistic regression
To run a logistic regression model in R, we will use the glm() function, which will take similar inputs to lm(). Besides the different function name, we need to specify we would like to run a logistic regression model as opposed to a different kind of generalized linear model. This is achieved by setting the family argument equal to “binomial”:
Then, we can use the summary() function and predict() function as usual:
Call:
glm(formula = present ~ WatrCont + Topo, family = "binomial",
data = presence_dat[-test_ids, ])
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6817 -0.5995 0.3407 0.5027 1.9611
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.538291 1.284897 -0.419 0.67526
WatrCont 0.006958 0.003160 2.202 0.02769 *
TopoHummock -2.237473 0.714437 -3.132 0.00174 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 77.977 on 66 degrees of freedom
Residual deviance: 51.959 on 64 degrees of freedom
AIC: 57.959
Number of Fisher Scoring iterations: 5
For logistic regression using glm(), your response variable must be encoded as either:
A numeric 0 or 1, where 1 represents the success class
A factor variable with two levels, where the base level is the 0 class
You will get a strange error if your response variable is categorical!
Predictions
Log-odds and probabilities
Notice that this prediction doesn’t seem the most intuitive. We are interested in the probability of the mite being present, but the prediction here is negative! This is because the default option from predict() for logistic regression is to return the predicted log-odds of the success outcome, as that is the response for the regression, i.e. \(\log\left(\frac{p(x)}{1-p(x)}\right) = \beta_{0} + \beta_{1}\text{WatrCont} + \beta_{2} \text{Topo}\).
If we want the predictions on the probability scale, we need specify an additional argument in predict():
1 2 3
0.4159841 0.5621314 0.4528434
As a sanity check, we can confirm this matches the predicted log-odds based on the model:
Success/failure
If we want to explicitly predict a class, we need to choose a threshold value and classify a test observation as a “success” or “failure” depending on the threshold the predicted probability of success:
n_test <- length(test_ids)
threshold <- 0.5
pred_class <- rep(NA, n_test)
for(i in 1:n_test){
if(pred_prob[i] >= threshold){
pred_class[i] <- 1
}
else{
pred_class[i] <- 0
}
}
pred_class[1] 0 1 0
This for() loop seems like too much code for such a simple idea! Let’s try taking advantage of vectors and booleans. Note that in R, the boolean TRUE is encoded as a numeric 1, and FALSE as a 0:
1 2 3
FALSE TRUE FALSE
[1] 0 1 0
What is our misclassification error rate here?