Live code:
Set-up
Recall that in our mite_dat
, we have the following three categorical predictors:
Shrub
, which takes values “None”, “Few”, and “Many”Topo
, which takes values “Hummock” and “Blanket”Substrate
, which takes values “Sphagn1”, “Spaghn2”, “Sphagn3”, “Sphagn4”, “Litter”, “Barepeat”, “Interface”
We would like to be able to convert these categorical predictors into quantitative ones in order to compute distances.
Integer encoding
One-hot encoding (few levels)
One-hot encoding (many levels)
The Substrate
variable has 7 levels! We could write 7 different if_else()
statements, but that seems rather inefficient…
Instead, we will make clever use of the of the pivot_wider()
function. In the code below:
Line 2: create a new place-holder variable
value
that gives us the mechanism to create dummy variablesLine 3:
pivot_wider()
to create new variables, one for each level ofSubstrate.
Each new variable gets its value fromvalue
(i.e. a 1) if the originalSubstrate
variable belonged to that level.
You should notice that we get a lot of NA
values! We just need to replace those NA
’s with 0s. In the code below:
Line 4: use the
values_fill
argument to specify thatNA
s should be 0sLine 5: modify the names of our new variables to more clearly indicate that they correspond to the same original variable
mite_dat <- mite_dat %>%
mutate(value = 1) %>%
pivot_wider(names_from = Substrate, values_from = value,
values_fill = 0,
names_prefix = "Sub_")
mite_dat %>%
slice(1:6)
# A tibble: 6 × 12
SubsDens WatrCont Shrub Topo abund…¹ Sub_S…² Sub_L…³ Sub_I…⁴ Sub_S…⁵ Sub_S…⁶
<dbl> <dbl> <ord> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 39.2 350. Few Hummo… 0 1 0 0 0 0
2 55.0 435. Few Hummo… 0 0 1 0 0 0
3 46.1 372. Few Hummo… 0 0 0 1 0 0
4 48.2 360. Few Hummo… 0 1 0 0 0 0
5 23.6 204. Few Hummo… 0 1 0 0 0 0
6 57.3 312. Few Hummo… 0 1 0 0 0 0
# … with 2 more variables: Sub_Sphagn2 <dbl>, Sub_Barepeat <dbl>, and
# abbreviated variable names ¹abundance, ²Sub_Sphagn1, ³Sub_Litter,
# ⁴Sub_Interface, ⁵Sub_Sphagn3, ⁶Sub_Sphagn4