Live code:

Live code
Logistic regression
Published

April 4, 2023

library(tidyverse)
library(vegan)
data(mite)
data(mite.env)
presence_dat <- mite.env %>%
  add_column(abundance = mite$LRUG) %>%
  mutate(present = ifelse(abundance > 0, 1, 0)) %>%
  select(-abundance)

Running logistic regression

To run a logistic regression model in R, we will use the glm() function, which will take similar inputs to lm(). Besides the different function name, we need to specify we would like to run a logistic regression model as opposed to a different kind of generalized linear model. This is achieved by setting the family argument equal to “binomial”:

test_ids <- 1:3
presence_mod <- glm(present ~ WatrCont + Topo, data = presence_dat[-test_ids,],
                    family = "binomial")

Then, we can use the summary() function and predict() function as usual:

summary(presence_mod)

Call:
glm(formula = present ~ WatrCont + Topo, family = "binomial", 
    data = presence_dat[-test_ids, ])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6817  -0.5995   0.3407   0.5027   1.9611  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -0.538291   1.284897  -0.419  0.67526   
WatrCont     0.006958   0.003160   2.202  0.02769 * 
TopoHummock -2.237473   0.714437  -3.132  0.00174 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 77.977  on 66  degrees of freedom
Residual deviance: 51.959  on 64  degrees of freedom
AIC: 57.959

Number of Fisher Scoring iterations: 5
Warning

For logistic regression using glm(), your response variable must be encoded as either:

  • A numeric 0 or 1, where 1 represents the success class

  • A factor variable with two levels, where the base level is the 0 class

You will get a strange error if your response variable is categorical!

Predictions

Log-odds and probabilities

predict(presence_mod, newdata = presence_dat[test_ids,])
         1          2          3 
-0.3392814  0.2498166 -0.1891887 

Notice that this prediction doesn’t seem the most intuitive. We are interested in the probability of the mite being present, but the prediction here is negative! This is because the default option from predict() for logistic regression is to return the predicted log-odds of the success outcome, as that is the response for the regression, i.e. \(\log\left(\frac{p(x)}{1-p(x)}\right) = \beta_{0} + \beta_{1}\text{WatrCont} + \beta_{2} \text{Topo}\).

If we want the predictions on the probability scale, we need specify an additional argument in predict():

predict(presence_mod, newdata = presence_dat[test_ids,], type = "response")
        1         2         3 
0.4159841 0.5621314 0.4528434 

As a sanity check, we can confirm this matches the predicted log-odds based on the model:

pred_prob <- predict(presence_mod, newdata = presence_dat[test_ids,], type = "response")
log(pred_prob / (1-pred_prob))
         1          2          3 
-0.3392814  0.2498166 -0.1891887 

Success/failure

If we want to explicitly predict a class, we need to choose a threshold value and classify a test observation as a “success” or “failure” depending on the threshold the predicted probability of success:

n_test <- length(test_ids)
threshold <- 0.5
pred_class <- rep(NA, n_test)
for(i in 1:n_test){
  if(pred_prob[i] >= threshold){
    pred_class[i] <- 1
  }
  else{
   pred_class[i] <- 0
  }
}

pred_class
[1] 0 1 0

This for() loop seems like too much code for such a simple idea! Let’s try taking advantage of vectors and booleans. Note that in R, the boolean TRUE is encoded as a numeric 1, and FALSE as a 0:

pred_prob >= threshold
    1     2     3 
FALSE  TRUE FALSE 
pred_class <- as.numeric(pred_prob >= threshold)
pred_class
[1] 0 1 0

What is our misclassification error rate here?

true_class <- presence_dat$present[test_ids]
true_class
[1] 0 0 0
mean(true_class == pred_class)
[1] 0.6666667