This lesson outlines some common data manipulation tasks with dplyr
.
The slides for this presentation are here.
There is also an accompanying RStudio.Cloud project here.
Read more about dplyr
here on the tidyverse website, or in the Data Transformation chapter of R for Data Science.
dplyr
is part of the core tidyverse
packages, so we install and load this meta-package below.
install.packages("tidyverse")
library(tidyverse)
We’ll cover two methods for importing data into RStudio.
We have the path to the original_starwars
data stored in our params
, but we will also go over how to build this dataset from dplyr::starwars
.
Below we import the original_starwars
dataset from the slides using the url. This is similar to providing a local file path (data/original-starwars.csv
).
read_csv("https://bit.ly/3qgjqSC")
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> name = col_character(),
#> height = col_double(),
#> mass = col_double(),
#> hair_color = col_character(),
#> species = col_character(),
#> homeworld = col_character()
#> )
params
We have the params
list from our YAML header, which we can also use to import the data.
params:
data_file: !r file.path("https://bit.ly/3qgjqSC")
readr::read_csv(params$data_file)
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> name = col_character(),
#> height = col_double(),
#> mass = col_double(),
#> hair_color = col_character(),
#> species = col_character(),
#> homeworld = col_character()
#> )
dplyr
verbsThis section covers four common dplyr
verbs for data manipulation:
select
filter
arrange
mutate
These are exercises to try on your own using the select()
function.
Alter the code below to select just the name
and homeoworld
column:
select(original_starwars, name, species, homeworld)
select(original_starwars, name, homeworld)
Select only the columns starting with the letter h
.
select(original_starwars, starts_with("_"))
select(original_starwars, starts_with("h"))
These are some additional exercises for filter()
ing data.
Change the code below so original_starwars
only includes the droids.
filter(original_starwars, species == "____")
filter(original_starwars, species == "Droid")
Change the code below so original_starwars
only includes data from the homeworld
s of Tatooine
and Alderaan
filter(original_starwars,
homeworld %in% c("________", "________"))
filter(original_starwars,
homeworld %in% c("Tatooine", "Alderaan"))
arrange()
sorts variables on their content, numeric or character.
Sort original_starwars
according to the hair_color
.
arrange(original_starwars, "____ _____")
Note that the missing values are sorted to the bottom.
arrange(original_starwars, hair_color)
Sort original_starwars
by height
and mass
, descending.
arrange(original_starwars, desc(______, ____))
arrange(original_starwars, desc(height, mass))
Including two variables is helpful if some of the values ‘tie’.
mutate()
can create new columns, or change existing columns.
Alter the code below to create a bmi
column for starwars
characters in original_starwars
.
mutate(original_starwars,
bmi = ____ / ((______ / 100) ^ 2))
Note the use of parentheses here.
mutate(original_starwars,
bmi = mass / ((height / 100) ^ 2))
Round the new bmi
variable to 1 digit.
mutate(original_starwars,
bmi = mass / ((height / 100) ^ 2),
bmi = _____(___, digits = _))
mutate(original_starwars,
bmi = mass / ((height / 100) ^ 2),
bmi = round(bmi, digits = 1))
Clearly written code makes it easier for machines and humans to read. The pipe (%>%
) from magrittr
package allows us to chain together multiple operations into functional ‘pipelines’.
The pipe (%>%
) takes an object that comes before it, and it drops the object into the function that comes after it.
Rewrite the code below to use the pipe
select(filter(
original_starwars, mass < ___),
____, hair_color, _______, homeworld)
This actually wouldn’t matter what order it was in–both would return the same result.
original_starwars %>%
filter(mass < 100) %>%
select(name, hair_color, species, homeworld)
Perform the following operations without using the pipe.
x
with three values, (3
, 7
, 12
)mean()
of x
, and store it in mean_x
mean_x
# 1)
_ <- c(_, _, __)
# 2)
mean_x <- ____(x)
# 3
sqrt(______)
This returns a vector, not a tibble
.
# 1)
x <- c(3, 7, 12)
# 2)
mean_x <- mean(x)
# 3
sqrt(mean_x)
#> [1] 2.708013
Perform the following operations with the pipe.
x
with three values, (3
, 7
, 12
)mean()
of x
, and store it in mean_x
mean_x
c(_, _, __) %>%
____() %>%
____()
Note that we can create a pipeline without even creating an object.
c(3, 7, 12) %>%
mean() %>%
sqrt()
#> [1] 2.708013