Subsetting is an important operation on data objects in R. This section will cover how to:
[
), a range of elements (:
), the dollar sign ($
), and supplying a vector to subset vectors using c()
[ , ]
, [
, and c()
.data.frame
s and tibble
s, and how to subset them using a variety of methods.[[
, [
, and $
), and recognize the class of each element that’s returned.Load the tidyverse
package.
library(tidyverse)
Subsetting is an important topic because it’s how we can get access to the information stored in R objects. Inevitably we’ll end up with some kind of data stored in some kind of object, and in order to do anything to that data, we need to access it.
There are three primary methods for subetting objects in are:
[
[[
$
Vectors are the fundamental data type in R. Below we assign five numbers to the vector num_vec
.
num_vec <- c(2, 9, 4, 3, 7)
num_vec
#> [1] 2 9 4 3 7
These five numbers exist in five different positions within the vector, and we can subset them using a bracket with the numerical index.
[]
Here are the numbers located at positions [1]
and position [5]
num_vec[1]
#> [1] 2
num_vec[5]
#> [1] 7
:
If we want a range of values from num_vec
, we can use the colon in the index:
num_vec[1:3]
#> [1] 2 9 4
This is returning a ‘subvector’ of num_vec
consisting of elements 1 through 3.
str(num_vec[1:3])
#> num [1:3] 2 9 4
is.vector(num_vec[1:3])
#> [1] TRUE
[]
with c()
We can also use vectors to subset other vectors. The code below returns the same result as num_vec[1:3]
:
num_vec[c(1, 2, 3)]
#> [1] 2 9 4
We can also create a new vector, and use this to subset num_vec
.
x <- c(1, 2, 3)
num_vec[x]
#> [1] 2 9 4
[]
vs. <-
So far, we’ve been subsetting num_vec
in a way that only returns the requested elements. If we wanted to subset num_vec
and store the output in a new vector, we would need the assignment operator (<-
):
small_num_vec <- num_vec[1:3]
small_num_vec
#> [1] 2 9 4
We’ll now move into subsetting higher-dimensional objects. We went over these objects in a previous lesson.
Here we create mat_data
, a matrix with 3 rows and two columns. We also supply a set of dimnames
.
mat_data <- matrix(
data = c(0.2, 0.4, 0.8, 5, 15, 150),
nrow = 3,
ncol = 2,
dimnames = list(
c("row_1", "row_2", "row_3"),
c("col_1", "col_2")
),
byrow = FALSE
)
mat_data
#> col_1 col_2
#> row_1 0.2 5
#> row_2 0.4 15
#> row_3 0.8 150
We can see this is a two-dimensional object with rows and columns.
[, ]
To subset a matrix, the syntax is object[row, column]
. So if we wanted the number at the intersection of the third row and second column (150
), we can pass these positions inside brackets [3, 2]
.
mat_data[3, 2]
#> [1] 150
If we only want a single row or column from mat_data
, we can omit the second number index:
mat_data[3, ]
#> col_1 col_2
#> 0.8 150.0
mat_data[, 2]
#> row_1 row_2 row_3
#> 5 15 150
[]
with c()
We can also control how the matrix elements are returned. If we want to access the second and first rows of mat_data
(in that order), we can pass c(2, 1)
inside []
and R will return both columns.
mat_data[c(2, 1), ]
#> col_1 col_2
#> row_2 0.4 15
#> row_1 0.2 5
It’s also important to note that if we subset a matrix in a way that returns a single element, it will return a vector.
mat_data[2, 1]
#> [1] 0.4
Arrays contain a collection of equal-dimension matrices. Just like matrices, they have fixed number of rows and columns, but they also have a third dimension called a layer. See the image below for conceptual illustration a three-row, three-column, two-layer array (3 × 3 × 2).
Now we’ll create an array (array_dat
) with 3 columns, 3 rows, and 3 layers.
array_dat <- array(
data = c(
seq(0.3, 2.7, by = 0.3),
seq(0.5, 4.5, by = 0.5),
seq(3, 27, by = 3)
),
dim = c(3, 3, 3)
)
array_dat
#> , , 1
#>
#> [,1] [,2] [,3]
#> [1,] 0.3 1.2 2.1
#> [2,] 0.6 1.5 2.4
#> [3,] 0.9 1.8 2.7
#>
#> , , 2
#>
#> [,1] [,2] [,3]
#> [1,] 0.5 2.0 3.5
#> [2,] 1.0 2.5 4.0
#> [3,] 1.5 3.0 4.5
#>
#> , , 3
#>
#> [,1] [,2] [,3]
#> [1,] 3 12 21
#> [2,] 6 15 24
#> [3,] 9 18 27
[]
As the number of dimensions increase, so do the number of commas required for subsetting. If we want the third row of the second layer of array_dat
, we would use the following:
array_dat[3, , 2]
#> [1] 1.5 3.0 4.5
Note that this returns a vector.
[]
with c()
We can also use a vector to subset the array_dat
by index (or position). Below we get the first row and third and second columns (in that order) of layer two.
array_dat[1, c(3, 2), 2]
#> [1] 3.5 2.0
If we only supply a single row array_dat[1, , ]
, we will see R returns the rows as a column in a single matrix. They are also arranged by columns, not rows.
array_dat[1, , ]
#> [,1] [,2] [,3]
#> [1,] 0.3 0.5 3
#> [2,] 1.2 2.0 12
#> [3,] 2.1 3.5 21
Here is the original arrangement of the first rows:
And here is the returned matrix, presented as columns:
Data frames and tibbles are rectangular representations of data (like spreadsheets). Data frames and tibbles contain vectors of equal length.
data.frame
/tibble
To create the Simpsons
data.frame
we can use the following function:
Simpsons <- data.frame(
character = c("Homer", "Marge", "Bart", "Lisa"),
age = c(39, 36, 10, 8),
sex = factor(c("Male", "Female", "Male", "Female"))
)
str(Simpsons)
#> 'data.frame': 4 obs. of 3 variables:
#> $ character: chr "Homer" "Marge" "Bart" "Lisa"
#> $ age : num 39 36 10 8
#> $ sex : Factor w/ 2 levels "Female","Male": 2 1 2 1
To create the AmericanDad
tibble
we can use the following function:
AmericanDad <- tibble::tribble(
~character, ~age, ~sex,
"Stan", 42, "Male",
"Francine", 40, "Female",
"Steven", 15, "Male",
"Hayley", 19, "Female"
) %>%
# convert to factor
mutate(sex = factor(sex, levels = c("Female", "Male")))
str(AmericanDad)
#> tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
#> $ character: chr [1:4] "Stan" "Francine" "Steven" "Hayley"
#> $ age : num [1:4] 42 40 15 19
#> $ sex : Factor w/ 2 levels "Female","Male": 2 1 2 1
$
The dollar sign ($
) can be used to subset named vectors.
Simpsons$character
#> [1] "Homer" "Marge" "Bart" "Lisa"
Note that both of these return vectors.
AmericanDad$character
#> [1] "Stan" "Francine" "Steven" "Hayley"
[]
We can use the row and column index to subset data frames and tibbles just like matrices and arrays.
# homer's age
Simpsons[1, 2]
#> [1] 39
The output is a little different for subsetting tibbles:
# Stan's age
AmericanDad[1, 2]
Note that when we subset the data frame with a value in the row index (i.e. Simpsons[ 2, ]
), R returns a data frame. However, if we subset Simpsons with a value in the column index (i.e. Simpsons[ , 2]
) we get a vector.
Simpsons[2, ]
Simpsons[, 2]
#> [1] 39 36 10 8
But when we subset a tibble, both return a tibble:
AmericanDad[2, ]
AmericanDad[, 2]
The same is true if we supply values to both rows and column indexes.
# check structure
Simpsons[1, 2] # Homer's age
#> [1] 39
AmericanDad[1, 2] # Stan's age
[]
& c()
We can use a numeric index for rows along with the names of the vectors (or columns) to subset data frames and tibbles:
# Lisa's age and sex
Simpsons[4, c("age", "sex")]
Note that the Simpsons
data frame gives us a row-name (4
), while the AmericanDad
tibble only returns the two columns.
# Francine's age and sex
AmericanDad[2, c("age", "sex")]
$
and []
Because the dollar-sign returns a vector, we can subset this output by combing it with brackets ([]
)
# Bart's age
Simpsons$age[3]
#> [1] 10
Both of these return a vector.
# Steven's age
AmericanDad$age[3]
#> [1] 15
$
& ==
We can combine $
with ==
to return a logical vector:
Simpsons$age == 36
#> [1] FALSE TRUE FALSE FALSE
AmericanDad$age == 15
#> [1] FALSE FALSE TRUE FALSE
We can also pass $
and ==
a set of values with c()
to return a logical vector.
Simpsons$age == c(39, 8)
#> [1] TRUE FALSE FALSE TRUE
AmericanDad$age == c(42, 40)
#> [1] TRUE TRUE FALSE FALSE
This might not seem very helpful, but it comes in handy when we combine this with []
.
[]
, $
and ==
Below we combine the logical output from $
and ==
with []
to subset data frames and tibbles:
# return Stan and Hayley's age and sex
AmericanDad[AmericanDad$age == c(42, 19), ]
# return the rows in Simpsons where age is 36
Simpsons[Simpsons$age == 36, ]
Again, we see the output from the data frame (Simpsons
) returns a data frame with a row-name (2
).
[[]]
We can control the subsetted results for both tibbles and data frames using double brackets ([[]]
).
Let’s review the behavior of single brackets. If we use a single bracket (without commas) and a numerical index, we get the first column in both Simpsons
and AmericanDad
as a rectangular object.
# character column from data frame
str(Simpsons[1])
#> 'data.frame': 4 obs. of 1 variable:
#> $ character: chr "Homer" "Marge" "Bart" "Lisa"
# character column from tibble
str(AmericanDad[1])
#> tibble [4 × 1] (S3: tbl_df/tbl/data.frame)
#> $ character: chr [1:4] "Stan" "Francine" "Steven" "Hayley"
If we use double-brackets, we get the same first column, but as a vector.
# character column as vectors
str(Simpsons[[1]])
#> chr [1:4] "Homer" "Marge" "Bart" "Lisa"
str(AmericanDad[[1]])
#> chr [1:4] "Stan" "Francine" "Steven" "Hayley"
Lists are special kinds of objects. Their contents can be items of different data types and lengths. Read more about lists in Advanced R.
list
Below we’ll create a list of lesser-known Star Wars characters called sw_list
.
list(
name = c(
"Wedge Antilles", "Boba Fett",
"Mon Mothma", "Darth Maul", "Dud Bolt"
),
height = c(170L, 183L, 150L, 175L, 94L),
sex = factor(c(2L, 2L, 1L, 2L, 2L), labels = c("female", "male")),
films = list(
c(
"The Empire Strikes Back", "Return of the Jedi",
"A New Hope"
),
c(
"The Empire Strikes Back", "Attack of the Clones",
"Return of the Jedi"
),
"Return of the Jedi",
"The Phantom Menace",
"The Phantom Menace"
)
) -> sw_list
str(sw_list)
#> List of 4
#> $ name : chr [1:5] "Wedge Antilles" "Boba Fett" "Mon Mothma" "Darth Maul" ...
#> $ height: int [1:5] 170 183 150 175 94
#> $ sex : Factor w/ 2 levels "female","male": 2 2 1 2 2
#> $ films :List of 5
#> ..$ : chr [1:3] "The Empire Strikes Back" "Return of the Jedi" "A New Hope"
#> ..$ : chr [1:3] "The Empire Strikes Back" "Attack of the Clones" "Return of the Jedi"
#> ..$ : chr "Return of the Jedi"
#> ..$ : chr "The Phantom Menace"
#> ..$ : chr "The Phantom Menace"
We can see the first few vectors in sw_list
looks like a data.frame (name
through films
), but films
has multiple lengths, because each character has been in a varying number of films.
$
If we use the $
symbol, R returns the object according to type.
# heights
str(sw_list$height)
#> int [1:5] 170 183 150 175 94
The films
are stored as a list in sw_list
, so using $
will return a list of character values (chr
).
# films
str(sw_list$films)
#> List of 5
#> $ : chr [1:3] "The Empire Strikes Back" "Return of the Jedi" "A New Hope"
#> $ : chr [1:3] "The Empire Strikes Back" "Attack of the Clones" "Return of the Jedi"
#> $ : chr "Return of the Jedi"
#> $ : chr "The Phantom Menace"
#> $ : chr "The Phantom Menace"
[]
& [[]]
There are two levels for subsetting lists with brackets: []
and [[]]
. A great way to think about these two levels of subsetting is captured in the tweet below:
If the #rstats list “
x
” is a train carrying objects, thenx[[5]]
is the object in car5
;x[4:6]
is a train of cars4
-6
. One R Tip a Day [@RLangTip](https://twitter.com/RLangTip/)
So, if sw_list
is the ‘train’, then sw_list[[4]]
is the object in car 4.
# object in car 4
str(sw_list[[4]])
#> List of 5
#> $ : chr [1:3] "The Empire Strikes Back" "Return of the Jedi" "A New Hope"
#> $ : chr [1:3] "The Empire Strikes Back" "Attack of the Clones" "Return of the Jedi"
#> $ : chr "Return of the Jedi"
#> $ : chr "The Phantom Menace"
#> $ : chr "The Phantom Menace"
And sw_list[4:6]
is the train of cars 1
-4
str(sw_list[1:4])
#> List of 4
#> $ name : chr [1:5] "Wedge Antilles" "Boba Fett" "Mon Mothma" "Darth Maul" ...
#> $ height: int [1:5] 170 183 150 175 94
#> $ sex : Factor w/ 2 levels "female","male": 2 2 1 2 2
#> $ films :List of 5
#> ..$ : chr [1:3] "The Empire Strikes Back" "Return of the Jedi" "A New Hope"
#> ..$ : chr [1:3] "The Empire Strikes Back" "Attack of the Clones" "Return of the Jedi"
#> ..$ : chr "Return of the Jedi"
#> ..$ : chr "The Phantom Menace"
#> ..$ : chr "The Phantom Menace"
$
, []
, and [[]]
Below we compare subsetting lists with $
, []
, and [[]]
. We can see $
and [[]]
return identical()
objects.
# check $
str(sw_list$name)
#> chr [1:5] "Wedge Antilles" "Boba Fett" "Mon Mothma" "Darth Maul" ...
# check []
str(sw_list[[1]])
#> chr [1:5] "Wedge Antilles" "Boba Fett" "Mon Mothma" "Darth Maul" ...
# test for identical?
identical(x = str(sw_list$name), y = str(sw_list[[1]]))
#> chr [1:5] "Wedge Antilles" "Boba Fett" "Mon Mothma" "Darth Maul" ...
#> chr [1:5] "Wedge Antilles" "Boba Fett" "Mon Mothma" "Darth Maul" ...
#> [1] TRUE
However, if we use [[]]
, we get a list.
str(sw_list[1])
#> List of 1
#> $ name: chr [1:5] "Wedge Antilles" "Boba Fett" "Mon Mothma" "Darth Maul" ...