library(tidyverse)
Live code:
This lab is intended to re-familiarize yourself with R and RStudio, as well as begin practicing to code in base R. You will need the tidyverse
package.
Vectors
In R, a vector is a data structure that holds or stores elements of the same type. Type may be numeric, integer, character, boolean, etc.
The c()
function
Generally, we create vectors using the c()
function and then save the vector into a variable. In the code below, I create a vector of three values (10, 11, and 12), and save the results into v
:
<- c(10, 11, 12) v
The :
operator
Now, sometimes it’s really useful to create a vector of consecutive numbers, for example, the values 1 through 10. Rather than type out every number explicitly and wrap it in c()
, I can use the :
operator, which looks like a:b
where a
and b
are integers of your choosing. If a < b
, R
will then create a vector of numbers a, a+1, a+2,…, b-1, b
.
<- 1:10
x x
[1] 1 2 3 4 5 6 7 8 9 10
What do you think happens if a > b
? Try the following code for yourself:
<- 10:1 y
The rep()
function
One function that I personally use a lot to create vectors is the rep(a,b)
function, which takes in two argument. The first is the value a
you wish to rep
eat, and the second argument is the number of times b
you’d like to rep
eat it. How would we create a vector of 20 0’s? Think about it, and check:
Code
rep(0, 20)
Matrices
Matrices are the 2D extension of the one-dimensional vectors. When a matrix has n
rows and p
columns, we denote its dimensions as n x p
or “n by p”. We create matrices using the matrix()
function. Because of the multiple dimensions, we need to specify the number of rows and the number of columns:
matrix(NA, nrow = 2, ncol = 3)
[,1] [,2] [,3]
[1,] NA NA NA
[2,] NA NA NA
This code above creates a 2 x 3
matrix of NA
values. The first argument takes in the elements you want to fill the matrix with. This can either be a single value, or a single vector of values.
Data frames
We will create a data frame called my_df
here, which holds the two vectors we created before.
<- data.frame(xvar = x, yvar = y) my_df
Now, if I wanted to only take the variable xvar
from my_df
, how would I do so using dplyr
functions? Take a second to think about it, then check:
Code
%>%
my_df select(xvar)
We will now use base R to access that xvar
variable by using $
notation: <df>$<var_name>
. If you do this yourself, you should notice that immediately after typing the $
, a menu pops up with all the variables contained in the data frame.
$xvar my_df
Now, do you notice the difference between the two outputs?
%>%
my_df select(xvar)
xvar
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
$xvar my_df
[1] 1 2 3 4 5 6 7 8 9 10
Indexing
One of the most useful tools we will use is indexing and index notation.
An index is essentially a numerical representation of an item’s position in a list or vector. It is typically an integer, starting from either 0 or 1 depending on the programming language. In R, our index positions always start at 1!
For example, in the word “learning”, the l is at index 1. Similarly, in our vector v
of the numbers \((10, 11, 12)\), the value at index 1 is 10. We can confirm this with code:
1] v[
[1] 10
Notice that we access the item held in index 1 using the square bracket notation [ ]
Now, I can also replace or modify an element at a given index. I will still access the location using [ ]
, but now I will store/save the new value:
1] <- 13
v[ v
[1] 13 11 12
We can also modify multiple elements at once by passing in a vector of indices to modify, as well as a vector of new values:
2:3] <- c(14, 15) v[
What does v
look like now?
We can also use indices to refer to elements or entire rows and columns of data frames! Unlike vectors, data frame are two-dimensional, i.e. there are both rows and columns. Thus, our index notation will need to adapt to accommodate this feature. We will still use [ ]
notation, but now commas will be introduced:
1,2] my_df[
[1] 10
Based on my_df
, what do you think the [1,2]
means?
Now, we already saw how to access a column of a data frame using the $
notation, but we can also use index notation. To access the first column, we would type:
1] my_df[,
The 1
after the comma tells R
that we want to focus on column 1
.
As I do not type anything before the comma, R
reads this as: “since you did not want a specific row, you must want all the rows”.
How do you think we would access the third column? How about both the first and second row?
Functions
There are a lot of simple functions in R
that we will rely on. We already saw c()
and rep()
. Most of the functions we will use take in a vector or matrix of numeric values, and return either a single number or vector in return.
The function mean()
takes in a vector, and returns the mean of the vector.
mean(x)
[1] 5.5
The function length()
takes in a vector, and returns the number of elements in the vector:
length(x)
[1] 10
The functions max()
and min()
return what you would expect them to!
An extremely useful function we will use is the which()
function. Unlike the previous functions, which()
does not take in a numeric vector. Rather, it takes a vector of boolean values (i.e. TRUE/FALSE values). Then, it returns the indices of the TRUE values in the vector. For example, an input might be:
== -5 y
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
This is comparing each value in y
to see if it equals -5
(recall the double equals sign check for equality). Notice that only one value evaluates to TRUE, specifically the element in index 6. Therefore, if we wrap the which()
function around that argument, we should get 6
in return:
which(y == -5)
integer(0)
Personally, I tend to read this line of code as a question: Which element(s) of y
are exactly equal to -5
?
We know that y
only holds negative values. What do you think happens if we try to evaluate the following. Try it yourself!
which(y == 0)
It’s also entirely possible that many values in the boolean vector are true, in which case the function would return multiple indices. For example, if I want to know which values in y
are negative, I could code:
which(y < 0)
integer(0)
Indexing with boolean vectors
Above, we saw how to access elements held at specific indices of interest. We can also use boolean vectors to return values. Recall our vector v
:
v
[1] 13 14 15
I can index v
by indexing using TRUE
’s for each value that I want, and FALSE
’s otherwise.
c(F, T, F)] v[
[1] 14
Your turn!
Please complete the following exercises in order:
- Create a vector called
my_vec
that holds the values 50 through 100. - Create a new vector called
less60
where an element isTRUE
if the corresponding element inmy_vec
is less than60
, andFALSE
otherwise. - Confirm that the length of your two vectors are the same.
- Pass
less60
into the functionsum()
function. Relate the value obtained to the elements ofless60
. - Modify
my_vec
such that the value at index 10 is 100. - Obtain the index of the maximum values of
my_vec
using functions described above. - Now, pass
my_vec
into thewhich.max()
function. Even though we haven’t seen it before, based on the function name, the name of the function is intuitive. Does the result fromwhich.max()
differ from what you obtained in Ex. 6? How so? - Create a
2 x 5
matrix of the values 1 through 10, where the first row holds the values 1-5, and the second row holds the values 6-10. Hint: look at the help file formatrix
.