Live code:

Live code
Encoding categorical variables
Published

March 7, 2023

Set-up

library(tidyverse)
library(vegan)
data(mite)
data(mite.env)
mite_dat <- mite.env %>%
  add_column(abundance = mite$LRUG)

Recall that in our mite_dat, we have the following three categorical predictors:

  1. Shrub, which takes values “None”, “Few”, and “Many”

  2. Topo, which takes values “Hummock” and “Blanket”

  3. Substrate, which takes values “Sphagn1”, “Spaghn2”, “Sphagn3”, “Sphagn4”, “Litter”, “Barepeat”, “Interface”

We would like to be able to convert these categorical predictors into quantitative ones in order to compute distances.

Integer encoding

mite_dat <- mite_dat %>%
  mutate(Shrub_encode = case_when(
    Shrub == "None" ~ 0,
    Shrub == "Few" ~ 1, 
    Shrub == "Many" ~ 2
      )
    )

# compare your new variable to confirm it's correct:
mite_dat %>%
  select(Shrub, Shrub_encode) %>%
  View()

One-hot encoding (few levels)

mite_dat <-  mite_dat %>%
  mutate(
    Topo_hummock = if_else(Topo == "Hummock", 1, 0),
    Topo_blanket = if_else(Topo == "Blanket", 1, 0)
  ) 

# compare your new variable to confirm it's correct

One-hot encoding (many levels)

The Substrate variable has 7 levels! We could write 7 different if_else() statements, but that seems rather inefficient…

Instead, we will make clever use of the of the pivot_wider() function. In the code below:

  • Line 2: create a new place-holder variable value that gives us the mechanism to create dummy variables

  • Line 3: pivot_wider() to create new variables, one for each level of Substrate. Each new variable gets its value from value (i.e. a 1) if the original Substrate variable belonged to that level.

mite_dat %>%
  mutate(value = 1) %>% 
  pivot_wider(names_from = Substrate, values_from = value) 

You should notice that we get a lot of NA values! We just need to replace those NA’s with 0s. In the code below:

  • Line 4: use the values_fill argument to specify that NAs should be 0s

  • Line 5: modify the names of our new variables to more clearly indicate that they correspond to the same original variable

mite_dat <- mite_dat %>%
  mutate(value = 1) %>%
  pivot_wider(names_from = Substrate, values_from = value, 
              values_fill = 0,
              names_prefix = "Sub_")
mite_dat %>%
  slice(1:6)
# A tibble: 6 × 12
  SubsDens WatrCont Shrub Topo   abund…¹ Sub_S…² Sub_L…³ Sub_I…⁴ Sub_S…⁵ Sub_S…⁶
     <dbl>    <dbl> <ord> <fct>    <int>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1     39.2     350. Few   Hummo…       0       1       0       0       0       0
2     55.0     435. Few   Hummo…       0       0       1       0       0       0
3     46.1     372. Few   Hummo…       0       0       0       1       0       0
4     48.2     360. Few   Hummo…       0       1       0       0       0       0
5     23.6     204. Few   Hummo…       0       1       0       0       0       0
6     57.3     312. Few   Hummo…       0       1       0       0       0       0
# … with 2 more variables: Sub_Sphagn2 <dbl>, Sub_Barepeat <dbl>, and
#   abbreviated variable names ¹​abundance, ²​Sub_Sphagn1, ³​Sub_Litter,
#   ⁴​Sub_Interface, ⁵​Sub_Sphagn3, ⁶​Sub_Sphagn4