Live code:
Running logistic regression
To run a logistic regression model in R
, we will use the glm()
function, which will take similar inputs to lm()
. Besides the different function name, we need to specify we would like to run a logistic regression model as opposed to a different kind of g
eneralized linear model. This is achieved by setting the family
argument equal to “binomial”:
Then, we can use the summary()
function and predict()
function as usual:
Call:
glm(formula = present ~ WatrCont + Topo, family = "binomial",
data = presence_dat[-test_ids, ])
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6817 -0.5995 0.3407 0.5027 1.9611
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.538291 1.284897 -0.419 0.67526
WatrCont 0.006958 0.003160 2.202 0.02769 *
TopoHummock -2.237473 0.714437 -3.132 0.00174 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 77.977 on 66 degrees of freedom
Residual deviance: 51.959 on 64 degrees of freedom
AIC: 57.959
Number of Fisher Scoring iterations: 5
Predictions
Log-odds and probabilities
Notice that this prediction doesn’t seem the most intuitive. We are interested in the probability of the mite being present
, but the prediction here is negative! This is because the default option from predict()
for logistic regression is to return the predicted log-odds of the success outcome, as that is the response for the regression, i.e. \(\log\left(\frac{p(x)}{1-p(x)}\right) = \beta_{0} + \beta_{1}\text{WatrCont} + \beta_{2} \text{Topo}\).
If we want the predictions on the probability scale, we need specify an additional argument in predict()
:
1 2 3
0.4159841 0.5621314 0.4528434
As a sanity check, we can confirm this matches the predicted log-odds based on the model:
Success/failure
If we want to explicitly predict a class, we need to choose a threshold value and classify a test observation as a “success” or “failure” depending on the threshold the predicted probability of success:
n_test <- length(test_ids)
threshold <- 0.5
pred_class <- rep(NA, n_test)
for(i in 1:n_test){
if(pred_prob[i] >= threshold){
pred_class[i] <- 1
}
else{
pred_class[i] <- 0
}
}
pred_class
[1] 0 1 0
This for()
loop seems like too much code for such a simple idea! Let’s try taking advantage of vectors and booleans. Note that in R
, the boolean TRUE
is encoded as a numeric 1, and FALSE
as a 0:
1 2 3
FALSE TRUE FALSE
[1] 0 1 0
What is our misclassification error rate here?