This lesson outlines some common data manipulation tasks with dplyr.
View the slides for this section here.
View the exercises for this section here.
Read more about dplyr here on the tidyverse website, or in the Data Transformation chapter of R for Data Science.
dplyr is part of the core tidyverse packages, so we install and load this meta-package below.
install.packages("tidyverse")
library(tidyverse)
We’ll cover two methods for importing data into RStudio.
We have the path to the original_starwars data stored in our params, but we will also go over how to build this dataset from dplyr::starwars.
Below we import the original_starwars dataset from the slides using the url. This is similar to providing a local file path (data/original-starwars.csv).
read_csv("https://bit.ly/3qgjqSC")
#> Rows: 6 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): name, hair_color, species, homeworld
#> dbl (2): height, mass
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
paramsWe have the params list from our YAML header, which we can also use to import the data.
params:
data_file: !r file.path("https://bit.ly/3qgjqSC")
readr::read_csv(params$data_file)
#> Rows: 6 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): name, hair_color, species, homeworld
#> dbl (2): height, mass
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dplyr verbsThis section covers four common dplyr verbs for data manipulation:
selectfilterarrangemutateThese are exercises to try on your own using the select() function.
Alter the code below to select just the name and homeoworld column:
select(original_starwars, name, species, homeworld)
select(original_starwars, name, homeworld)
Select only the columns starting with the letter h.
select(original_starwars, starts_with("_"))
select(original_starwars, starts_with("h"))
These are some additional exercises for filter()ing data.
Change the code below so original_starwars only includes the droids.
filter(original_starwars, species == "____")
filter(original_starwars, species == "Droid")
Change the code below so original_starwars only includes data from the homeworlds of Tatooine and Alderaan
filter(original_starwars,
homeworld %in% c("________", "________"))
filter(original_starwars,
homeworld %in% c("Tatooine", "Alderaan"))
arrange() sorts variables on their content, numeric or character.
Sort original_starwars according to the hair_color.
arrange(original_starwars, "____ _____")
Note that the missing values are sorted to the bottom.
arrange(original_starwars, hair_color)
Sort original_starwars by height and mass, descending.
arrange(original_starwars, desc(______, ____))
Including two variables is helpful if some of the values ‘tie’.
arrange(original_starwars, desc(height, mass))
#> Error: `desc()` must be called with exactly one argument.
mutate() can create new columns, or change existing columns.
Alter the code below to create a bmi column for starwars characters in original_starwars.
mutate(original_starwars,
bmi = ____ / ((______ / 100) ^ 2))
Note the use of parentheses here.
mutate(original_starwars,
bmi = mass / ((height / 100) ^ 2))
Round the new bmi variable to 1 digit.
mutate(original_starwars,
bmi = mass / ((height / 100) ^ 2),
bmi = _____(___, digits = _))
mutate(original_starwars,
bmi = mass / ((height / 100) ^ 2),
bmi = round(bmi, digits = 1))
Clearly written code makes it easier for machines and humans to read. The pipe (%>%) from magrittr package allows us to chain together multiple operations into functional ‘pipelines’.
The pipe (%>%) takes an object that comes before it, and it drops the object into the function that comes after it.
Rewrite the code below to use the pipe
select(filter(
original_starwars, mass < ___),
____, hair_color, _______, homeworld)
This actually wouldn’t matter what order it was in–both would return the same result.
original_starwars %>%
filter(mass < 100) %>%
select(name, hair_color, species, homeworld)
Perform the following operations without using the pipe.
x with three values, (3, 7, 12)mean() of x, and store it in mean_xmean_x# 1)
_ <- c(_, _, __)
# 2)
mean_x <- ____(x)
# 3
sqrt(______)
This returns a vector, not a tibble.
# 1)
x <- c(3, 7, 12)
# 2)
mean_x <- mean(x)
# 3
sqrt(mean_x)
#> [1] 2.708013
Perform the following operations with the pipe.
x with three values, (3, 7, 12)mean() of x, and store it in mean_xmean_xc(_, _, __) %>%
____() %>%
____()
Note that we can create a pipeline without even creating an object.
c(3, 7, 12) %>%
mean() %>%
sqrt()
#> [1] 2.708013