This lesson outlines some common data manipulation tasks with dplyr
.
View the slides for this section here.
View the exercises for this section here.
Read more about dplyr
here on the tidyverse website, or in the Data Transformation chapter of R for Data Science.
dplyr
is part of the core tidyverse
packages, so we install and load this meta-package below.
install.packages("tidyverse")
library(tidyverse)
We’ll cover two methods for importing data into RStudio.
We have the path to the original_starwars
data stored in our params
, but we will also go over how to build this dataset from dplyr::starwars
.
Below we import the original_starwars
dataset from the slides using the url. This is similar to providing a local file path (data/original-starwars.csv
).
read_csv("https://bit.ly/3qgjqSC")
#> Rows: 6 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): name, hair_color, species, homeworld
#> dbl (2): height, mass
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
params
We have the params
list from our YAML header, which we can also use to import the data.
params:
data_file: !r file.path("https://bit.ly/3qgjqSC")
readr::read_csv(params$data_file)
#> Rows: 6 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): name, hair_color, species, homeworld
#> dbl (2): height, mass
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dplyr
verbsThis section covers four common dplyr
verbs for data manipulation:
select
filter
arrange
mutate
These are exercises to try on your own using the select()
function.
Alter the code below to select just the name
and homeoworld
column:
select(original_starwars, name, species, homeworld)
select(original_starwars, name, homeworld)
Select only the columns starting with the letter h
.
select(original_starwars, starts_with("_"))
select(original_starwars, starts_with("h"))
These are some additional exercises for filter()
ing data.
Change the code below so original_starwars
only includes the droids.
filter(original_starwars, species == "____")
filter(original_starwars, species == "Droid")
Change the code below so original_starwars
only includes data from the homeworld
s of Tatooine
and Alderaan
filter(original_starwars,
homeworld %in% c("________", "________"))
filter(original_starwars,
homeworld %in% c("Tatooine", "Alderaan"))
arrange()
sorts variables on their content, numeric or character.
Sort original_starwars
according to the hair_color
.
arrange(original_starwars, "____ _____")
Note that the missing values are sorted to the bottom.
arrange(original_starwars, hair_color)
Sort original_starwars
by height
and mass
, descending.
arrange(original_starwars, desc(______, ____))
Including two variables is helpful if some of the values ‘tie’.
arrange(original_starwars, desc(height, mass))
#> Error: `desc()` must be called with exactly one argument.
mutate()
can create new columns, or change existing columns.
Alter the code below to create a bmi
column for starwars
characters in original_starwars
.
mutate(original_starwars,
bmi = ____ / ((______ / 100) ^ 2))
Note the use of parentheses here.
mutate(original_starwars,
bmi = mass / ((height / 100) ^ 2))
Round the new bmi
variable to 1 digit.
mutate(original_starwars,
bmi = mass / ((height / 100) ^ 2),
bmi = _____(___, digits = _))
mutate(original_starwars,
bmi = mass / ((height / 100) ^ 2),
bmi = round(bmi, digits = 1))
Clearly written code makes it easier for machines and humans to read. The pipe (%>%
) from magrittr
package allows us to chain together multiple operations into functional ‘pipelines’.
The pipe (%>%
) takes an object that comes before it, and it drops the object into the function that comes after it.
Rewrite the code below to use the pipe
select(filter(
original_starwars, mass < ___),
____, hair_color, _______, homeworld)
This actually wouldn’t matter what order it was in–both would return the same result.
original_starwars %>%
filter(mass < 100) %>%
select(name, hair_color, species, homeworld)
Perform the following operations without using the pipe.
x
with three values, (3
, 7
, 12
)mean()
of x
, and store it in mean_x
mean_x
# 1)
_ <- c(_, _, __)
# 2)
mean_x <- ____(x)
# 3
sqrt(______)
This returns a vector, not a tibble
.
# 1)
x <- c(3, 7, 12)
# 2)
mean_x <- mean(x)
# 3
sqrt(mean_x)
#> [1] 2.708013
Perform the following operations with the pipe.
x
with three values, (3
, 7
, 12
)mean()
of x
, and store it in mean_x
mean_x
c(_, _, __) %>%
____() %>%
____()
Note that we can create a pipeline without even creating an object.
c(3, 7, 12) %>%
mean() %>%
sqrt()
#> [1] 2.708013