This document outlines some common data reshaping tasks with tidyr and introduces the concept of tidy data.
View the slides for this section here.
Read more about dplyr here and tidyr on the tidyverse website, or in the Data Transformation chapter of R for Data Science.
tidyr is part of the core tidyverse packages, so we install and load this meta-package below.
install.packages("tidyverse")
library(tidyverse)
library(fivethirtyeight)
There are two principles for tidy data:
For these first few exercises, we’re going to be using data from the fivethirtyeight package.
EndorseSampleThere are multiple ways data can be ‘less-than tidy.’ We’ll start with a common example, which is two variables (and their values) stored in a single column.
Below is a subset of the endorsements_2020 dataset named EndorseSample. Read more about the endorsements_2020 dataset by typing ??endorsements_2020 into the console.
EndorseSample <- readr::read_csv("https://bit.ly/3x6wfTg")
head(EndorseSample)
Notice the location variable has two values (state and city) stored in the same column.
FandangoSampleIn another example of un-tidy data, we’ll look at the dataset below which comes from the fandango dataset.
FandangoSample <- readr::read_csv(file = "https://bit.ly/3kS34yj")
head(FandangoSample)
This dataset has two similar scores combined into a single column rottentomatoes_scores
rottentomatoes_user: The Rotten Tomatoes user score for the film (expressed as a percent)
rt_user_norm: The Rotten Tomatoes user score for the film , normalized to a 0 to 5 point system (expressed as ‘score out of 5’)
DailyShowSampleBelow is a sample from the daily_show_guests dataset. The guests names are separated into first_name and last_name columns.
DailyShowSample <- readr::read_csv("https://bit.ly/3HGx8XG")
head(DailyShowSample)
In the next few exercises, we’ll go over how (and when) to use separate(), separate_rows(), and unite() these variables.
separate()This allows us to split columns apart into multiple variables.
Take a look at EndorseSample below:
EndorseSample
Separate location into state and city
EndorseSample %>%
separate(col = ___, into = c("_______", "_______"), sep = " ")
EndorseSample %>%
separate(col = location, into = c("state", "city"), sep = ", ")
separate_rows()separate_rows() is similar to separate(), but drops the values down the rows. This is commonly used when we have more than one similar value in each column.
Take a look at FandangoSample below.
FandangoSample
Put all the values into their own cell.
FandangoSample %>%
separate_rows(____________________, sep = ___)
This isn’t as clear as it could be, so we will use mutate(), case_when(), and str_detect() to create an index variable for the rottentomatoes_scores, which we will call rottentomatoes_score. We’ll also remove the % and out of 5 text from rottentomatoes_scores, and rename this to rottentomatoes_value.
FandangoSampleTidy <- FandangoSample %>%
separate_rows(rottentomatoes_scores, sep = "/", convert = TRUE) %>%
mutate(rottentomatoes_score = case_when(
str_detect(rottentomatoes_scores, "%") ~ "raw user score (%)",
str_detect(rottentomatoes_scores, "out of 5") ~ "normalized user score (out of 5)",
TRUE ~ NA_character_),
rottentomatoes_value = str_remove_all(string = rottentomatoes_scores, "%|out of 5"),
rottentomatoes_value = as.numeric(rottentomatoes_value)
) %>%
select(-rottentomatoes_scores)
FandangoSampleTidy
Notice this new dataset, FandangoSampleTidy, has twice the number of rows as the original FandangoSample.
FandangoSample
unite()Just like separate/separate_rows allows us to split up columns, unite() lets us stick them back together! Take a look at DailyShowSample.
DailyShowSample
We can see the guests have had their names in separate columns (first_name and last_name).
Unite first_name and last_name into guest_name (don’t remove the previous columns with remove = FALSE).
DailyShowSample %>%
unite(col = "_______", first_name:last_name, sep = " ", remove = _____)
Unite first_name and last_name into guest_name (don’t remove the previous columns with remove = FALSE).
DailyShowSample %>%
unite(col = "guest_name", first_name:last_name, sep = " ", remove = FALSE)
Joins give us the ability to combine multiple datasets on a common column. We’re going to be using the starwarsdb package, which contains data from the Star Wars API.
We’ll load the data below:
library(starwarsdb)
library(dm)
starwarsdbThe code below creates and displays the data model for the tables in this package
StarWarsDataModel <- dm_draw(dm = starwars_dm(),
graph_name = "StarWarsDataModel")
StarWarsDataModel
Load the starwarsdb::films dataset, removing the opening_crawl column, and filter to only episodes 4-6. Store these data in StarWarsFilms
StarWarsFilms <- starwarsdb::films %>%
select(-____________) %>%
filter(episode_id >= ___ & episode_id <= ___)
StarWarsFilms
StarWarsFilms <- starwarsdb::films %>%
select(-c(opening_crawl, director, producer)) %>%
filter(episode_id >= 4 & episode_id <= 6)
StarWarsFilms
left_joinA left_join keeps all of x, and joins it to all matching rows from dataset y
Left-join StarWarsFilms to starwarsdb::films_vehicles on "title", then select() the vehicle column, then everything() else.
Assign to StarWarsFilmVeh
left_join(x = _____________,
y = starwarsdb::films_vehicles,
by = ______) %>%
select(_____________,
everything()) -> StarWarsFilmVeh
StarWarsFilmVeh
left_join(x = StarWarsFilms,
y = starwarsdb::films_vehicles,
by = "title") %>%
select(vehicle,
everything()) -> StarWarsFilmVeh
StarWarsFilmVeh
inner_joinThe inner_join() joins dataset x and y, and keeps only matching rows from both.
Use an inner_join to join StarWarsFilmVeh to starwarsdb::pilots on "vehicle" and assign it to StarWarsFilmVehPil.
inner_join(x = StarWarsFilmVeh,
y = _________,
by = _________) -> _____________
StarWarsFilmVehPil
inner_join(x = StarWarsFilmVeh,
y = starwarsdb::pilots,
by = "vehicle") -> StarWarsFilmVehPil
StarWarsFilmVehPil
right_joinThe right_join() includes all the rows in dataset y, and joins it the matching rows in dataset x.
Use a right_join() to join StarWarsFilmVehPlt to starwarsdb::films_planets on "title", keeping all the rows in starwarsdb::films_planets. Use select() to rename planet to film_planet and assign this to StarWarsFilmVehPltPlnt
right_join(x = __________________,
y = __________________,
by = "title") %>%
select(vehicle:release_date,
film_planet = _______,
pilot) -> StarWarsFilmVehPltPlnt
right_join(x = StarWarsFilmVehPil,
y = starwarsdb::films_planets,
by = "title") %>%
select(vehicle:release_date,
film_planet = planet,
pilot) -> StarWarsFilmVehPltPlnt
StarWarsFilmVehPltPlnt
anti_joinThe anti_join() keeps all rows in dataset x that do not have a match in dataset y. Note the last join created empty values for vehicle, episode_id, release_date, and pilot.
StarWarsFilmVehPltPlnt %>%
filter(is.na(episode_id))
Use the anti_join() to join StarWarsFilmVehPltPlnt to StarWarsFilms and create a dataset with only title and film_planet (these should only contain the titles not in StarWarsFilms).
anti_join(x = StarWarsFilmVehPltPlnt,
y = ____________,
by = c("title", "episode_id", "release_date")) %>%
select(________, ________)
anti_join(x = StarWarsFilmVehPltPlnt,
y = StarWarsFilms,
by = c("title", "episode_id", "release_date")) %>%
select(title, film_planet)
Nested data refers to a dataset in which a column is holding a list of data frames (or tibbles). These can be useful for manipulating data, and we’re going to give an example of using nested data to reshape a column combining two sets of dates.
unnest() deals with nested data, which can be complicated to explain, so we’ll use an example. Run the code below to download data from the Anti-Doping Agency Sanction database.
source("code/create-usada-data.R")
These data have two dates (original and updated) in a single column (sanction_announced). We only want updated dates in their own columns.
UsadaBadDates
str_splitUse the stringr package’s str_split function to split the bad dates on the "updated" pattern.
UsadaBadDates %>%
dplyr::mutate(sanction_dates =
# 1) split this on the "updated" pattern
stringr::str_split(string = sanction_announced,
pattern = "updated")) %>%
# only return what we're interested in
dplyr::select(athlete, sanction_announced, sanction_dates)
This creates a nested column (sanction_dates), which we can deal with by unnest()ing it. Let’s take a better look with listviewer (but only look at athlete and sanction_dates).
library(listviewer)
UsadaBadDates %>%
dplyr::mutate(sanction_dates =
# 1) split this on the "updated" pattern
stringr::str_split(string = sanction_announced,
pattern = "updated")) %>%
# only return what we're interested in
dplyr::select(athlete, sanction_dates) %>%
listviewer::jsonedit(mode = "view")
unnestAfter unnesting, we can see the dates have been repeated down the rows (like separate_rows()).
UsadaBadDates %>%
dplyr::mutate(sanction_dates =
# 1) split this on the "updated" pattern
stringr::str_split(string = sanction_announced,
pattern = "updated")) %>%
# convert the output from split into multiple rows
tidyr::unnest(sanction_dates) %>%
# check this to see what is happening
dplyr::select(athlete, sanction_announced, sanction_dates)
filterWe can see all of the updated dates are on a new line, and the original dates still contain the text. Now I can filter out the rows with an original date (because I am only interested in the updated dates).
UsadaBadDates %>%
dplyr::mutate(sanction_dates =
# 1) split this on the "updated" pattern
stringr::str_split(string = sanction_announced,
pattern = "updated")) %>%
# 2) convert the output from split into multiple rows
tidyr::unnest(sanction_dates) %>%
# 3) remove the "original" dates
dplyr::filter(!str_detect(string = sanction_dates,
pattern = "original")) %>%
# check this to see what is happening
dplyr::select(athlete, sanction_announced, sanction_dates)
str_remove_allNow I can clean up the updated sanction_dates and format it.
UsadaBadDates %>%
dplyr::mutate(sanction_dates =
# 1) split this on the "updated" pattern
stringr::str_split(string = sanction_announced,
pattern = "updated")) %>%
# 2) convert the output from split into multiple rows
tidyr::unnest(sanction_dates) %>%
# 3) remove the "original" dates
dplyr::filter(!str_detect(string = sanction_dates,
pattern = "original")) %>%
# 4) remove the colon from sanction_dates
dplyr::mutate(sanction_dates = stringr::str_remove_all(
string = sanction_dates,
pattern = ":"),
# 5) remove any whitespace
sanction_dates = stringr::str_trim(sanction_dates),
# 6) format as date
sanction_dates = lubridate::mdy(sanction_dates)) %>%
# check this to see what is happening
dplyr::select(athlete, sanction_announced, sanction_dates)
ggplot2How far can this pipe go??
Well, when the data are arranged in the way we want, we can test it with a visualization to check the number of sanctions per sport with bad dates.
UsadaBadDates %>%
dplyr::mutate(sanction_dates =
# 1) split this on the "updated" pattern
stringr::str_split(string = sanction_announced,
pattern = "updated")) %>%
# 2) convert the output from split into multiple rows
tidyr::unnest(sanction_dates) %>%
# 3) remove the "original" dates
dplyr::filter(!str_detect(string = sanction_dates,
pattern = "original")) %>%
# 4) remove the colon from sanction_dates
dplyr::mutate(sanction_dates = stringr::str_remove_all(
string = sanction_dates,
pattern = ":"),
# 5) remove any whitespace
sanction_dates = stringr::str_trim(sanction_dates),
# 6) format as date
sanction_dates = lubridate::mdy(sanction_dates)) %>%
# count by sport
dplyr::count(sport, sort = TRUE) %>%
# rename n
rename(`Sport Count` = n,
Sport = sport) %>%
ggplot2::ggplot(aes(x = `Sport Count`,
y = forcats::fct_reorder(as.factor(Sport), `Sport Count`),
fill = as.character(`Sport Count`))) +
geom_col(show.legend = FALSE) +
labs(x = "Number of Sanctions", y = "Sport",
title = "Number of Sanctions with Bad Dates",
caption = "source: https://www.usada.org/news/sanctions/") +
ggthemes::theme_fivethirtyeight() +
ggthemes::scale_fill_few()