1 Objectives

This lesson outlines some common data manipulation tasks with dplyr.

2 Materials

View the slides for this section here.

View the exercises for this section here.

Read more about dplyr here on the tidyverse website, or in the Data Transformation chapter of R for Data Science.

3 Load packages

dplyr is part of the core tidyverse packages, so we install and load this meta-package below.

install.packages("tidyverse")
library(tidyverse)

4 Import data

We’ll cover two methods for importing data into RStudio.

4.1 Import data

We have the path to the original_starwars data stored in our params, but we will also go over how to build this dataset from dplyr::starwars.

4.1.1 URLs

Below we import the original_starwars dataset from the slides using the url. This is similar to providing a local file path (data/original-starwars.csv).

read_csv("https://bit.ly/3qgjqSC")
#> Rows: 6 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): name, hair_color, species, homeworld
#> dbl (2): height, mass
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

4.1.2 `params`

We have the params list from our YAML header, which we can also use to import the data.

params: 
  data_file: !r file.path("https://bit.ly/3qgjqSC")

readr::read_csv(params$data_file)
#> Rows: 6 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): name, hair_color, species, homeworld
#> dbl (2): height, mass
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

5 `dplyr` verbs

This section covers four common dplyr verbs for data manipulation:

select
filter
arrange
mutate

5.1 Selecting columns

These are exercises to try on your own using the select() function.

5.1.1 exercise

Alter the code below to select just the name and homeoworld column:

select(original_starwars, name, species, homeworld)

5.1.2 solution

select(original_starwars, name, homeworld)

5.1.3 exercise

Select only the columns starting with the letter h.

select(original_starwars, starts_with("_"))

5.1.4 solution

select(original_starwars, starts_with("h"))

5.2 Filtering rows

These are some additional exercises for filter()ing data.

5.2.1 exercise

Change the code below so original_starwars only includes the droids.

filter(original_starwars, species == "____")

5.2.2 solution

filter(original_starwars, species == "Droid")

5.2.3 exercise

Change the code below so original_starwars only includes data from the homeworlds of Tatooine and Alderaan

filter(original_starwars, 
       homeworld %in% c("________", "________"))

5.2.4 solution

filter(original_starwars, 
       homeworld %in% c("Tatooine", "Alderaan"))

5.3 Arranging data

arrange() sorts variables on their content, numeric or character.

5.3.1 exercise

Sort original_starwars according to the hair_color.

arrange(original_starwars, "____ _____")

5.3.2 solution

Note that the missing values are sorted to the bottom.

arrange(original_starwars, hair_color)

5.3.3 exercise

Sort original_starwars by height and mass, descending.

arrange(original_starwars, desc(______, ____))

5.3.4 solution

Including two variables is helpful if some of the values ‘tie’.

arrange(original_starwars, desc(height, mass))
#> Error: `desc()` must be called with exactly one argument.

5.4 Mutate columns

mutate() can create new columns, or change existing columns.

5.4.1 exercise

Alter the code below to create a bmi column for starwars characters in original_starwars.

mutate(original_starwars, 
       bmi = ____ / ((______ / 100) ^ 2))

5.4.2 solution

Note the use of parentheses here.

mutate(original_starwars, 
       bmi = mass / ((height / 100) ^ 2))

5.4.3 exercise

Round the new bmi variable to 1 digit.

mutate(original_starwars, 
       bmi = mass / ((height / 100) ^ 2),
       bmi = _____(___, digits = _))

5.4.4 solution

mutate(original_starwars, 
       bmi = mass / ((height / 100) ^ 2),
       bmi = round(bmi, digits = 1))

6 The pipe

Clearly written code makes it easier for machines and humans to read. The pipe (%>%) from magrittr package allows us to chain together multiple operations into functional ‘pipelines’.

6.1 Function pipelines

The pipe (%>%) takes an object that comes before it, and it drops the object into the function that comes after it.

6.1.1 exercise

Rewrite the code below to use the pipe

select(filter(
          original_starwars, mass < ___), 
              ____, hair_color, _______, homeworld)

6.1.2 solution

This actually wouldn’t matter what order it was in–both would return the same result.

original_starwars %>% 
  filter(mass < 100) %>% 
  select(name, hair_color, species, homeworld)

6.1.3 exercise

Perform the following operations without using the pipe.

create object x with three values, (3, 7, 12)
get the mean() of x, and store it in mean_x
take the squared root of mean_x

# 1) 
_ <- c(_, _, __)
# 2) 
mean_x <- ____(x)
# 3 
sqrt(______)

6.1.4 soluton

This returns a vector, not a tibble.

# 1) 
x <- c(3, 7, 12)
# 2) 
mean_x <- mean(x)
# 3 
sqrt(mean_x)
#> [1] 2.708013

6.1.5 exercise

Perform the following operations with the pipe.

create object x with three values, (3, 7, 12)
get the mean() of x, and store it in mean_x
take the squared root of mean_x

c(_, _, __) %>% 
  ____() %>% 
  ____()

6.1.6 solution

Note that we can create a pipeline without even creating an object.

c(3, 7, 12) %>% 
  mean() %>% 
  sqrt()
#> [1] 2.708013

Introduction to Data Manipulation