separate()
separate_rows()
unite()
starwarsdb
left_join
inner_join
right_join
anti_join
This document outlines some common data reshaping tasks with tidyr
.
The slides for this presentation are here
There is also an accompanying RStudio.Cloud project
Read more about dplyr
here and tidyr
on the tidyverse website, or in the Data Transformation chapter of R for Data Science.
tidyr
is part of the core tidyverse
packages, so we install and load this meta-package below.
install.packages("tidyverse")
library(tidyverse)
We’re going to use some of the fs
package functions for managing our files and data.
Check your files and folders with dir_tree
or dir_ls
.
fs::dir_tree("data", recurse = FALSE)
## data
## ├── 2020-11-24-TopPharmCompRaw.csv
## ├── 2020-12-11-BioTechDrugStocks.csv
## ├── 2020-12-11-BmrmGoogle.rds
## ├── 2020-12-20-BioTechStocks.csv
## ├── 2020-12-24-BioTechStocks.csv
## ├── 2020-12-29-PricesWide.csv
## ├── 2020-12-30-PricesWide.csv
## ├── AppleMobRaw.csv
## ├── FasterCures.csv
## ├── Infected.csv
## ├── LabData.csv
## ├── LabProc.csv
## ├── SmallLabData.csv
## ├── TopPharmComp.csv
## ├── UsadaBadDates.csv
## ├── UsadaRaw.csv
## ├── VisitNAData.csv
## ├── original-starwars.csv
## └── starwars.rds
We can also use dir_ls
to get the paths to print to the screen so we can easily import.
fs::dir_ls("data", recurse = FALSE)
## data/2020-11-24-TopPharmCompRaw.csv data/2020-12-11-BioTechDrugStocks.csv
## data/2020-12-11-BmrmGoogle.rds data/2020-12-20-BioTechStocks.csv
## data/2020-12-24-BioTechStocks.csv data/2020-12-29-PricesWide.csv
## data/2020-12-30-PricesWide.csv data/AppleMobRaw.csv
## data/FasterCures.csv data/Infected.csv
## data/LabData.csv data/LabProc.csv
## data/SmallLabData.csv data/TopPharmComp.csv
## data/UsadaBadDates.csv data/UsadaRaw.csv
## data/VisitNAData.csv data/original-starwars.csv
## data/starwars.rds
fs
also gives us the metadata of any folder we have access to. This can come in handy if we’re looking for a specific version of a file, or a file that was created at a specific date.
fs::dir_info("data", all = TRUE) %>%
dplyr::arrange(desc(modification_time))
path <fs_path> | type <fct> | size <fs_bytes> | permissions <fs_perms> | modification_time <dttm> | user <chr> | group <chr> | device_id <dbl> | |
---|---|---|---|---|---|---|---|---|
data/2020-12-30-PricesWide.csv | file | 734 | rw-rw-rw- | 2020-12-30 00:35:31 | mjfrigaard | staff | 16777220 | |
data/2020-12-24-BioTechStocks.csv | file | 2.31K | rw-rw-rw- | 2020-12-30 00:35:15 | mjfrigaard | staff | 16777220 | |
data/TopPharmComp.csv | file | 16.14K | rw-r--r-- | 2020-12-30 00:31:28 | mjfrigaard | staff | 16777220 | |
data/2020-12-29-PricesWide.csv | file | 734 | rw-rw-rw- | 2020-12-29 23:25:49 | mjfrigaard | staff | 16777220 | |
data/.DS_Store | file | 6K | rw-r--r-- | 2020-12-29 22:35:31 | mjfrigaard | staff | 16777220 | |
data/2020-12-20-BioTechStocks.csv | file | 2.29K | rw-rw-rw- | 2020-12-20 23:51:38 | mjfrigaard | staff | 16777220 | |
data/2020-12-11-BioTechDrugStocks.csv | file | 2.29K | rw-rw-rw- | 2020-12-11 17:43:51 | mjfrigaard | staff | 16777220 | |
data/2020-12-11-BmrmGoogle.rds | file | 26.5K | rw-rw-rw- | 2020-12-11 17:19:02 | mjfrigaard | staff | 16777220 | |
data/FasterCures.csv | file | 233.1K | rw-rw-rw- | 2020-12-11 16:41:47 | mjfrigaard | staff | 16777220 | |
data/original-starwars.csv | file | 265 | rw-r--r-- | 2020-12-01 20:31:19 | mjfrigaard | staff | 16777220 |
For example, lets look for the largest file
in the data
folder.
fs::dir_info("data", all = TRUE) %>%
filter(type == "file") %>%
arrange(desc(size)) %>%
select(path, size) %>%
slice(1)
path <fs_path> | size <fs_bytes> | |||
---|---|---|---|---|
data/AppleMobRaw.csv | 9.39M |
Lets import all the Lab
data files in data
using fs
and purrr
.
fs::dir_ls(path = "data", regexp = "Lab") %>%
purrr::map_dfr(read_csv, .id = "source") -> AllLabs
AllLabs
source <chr> | lab <chr> | value <chr> | ref <chr> | visit <date> | |
---|---|---|---|---|---|
data/LabData.csv | Albumin | 3.5, 4.2, 5.0 | 3.4 - 5.4 g/dL | <NA> | |
data/LabData.csv | Potassium, Whole Blood | 1.7, 4.3 | >6.5-7.0 mmol/L | <NA> | |
data/LabData.csv | % Hemoglobin A1c | 4.1, 4.8 | 6.50% | <NA> | |
data/LabProc.csv | NA | NA | NA | 2020-10-01 | |
data/LabProc.csv | NA | NA | NA | 2020-10-10 | |
data/LabProc.csv | NA | NA | NA | 2020-10-20 | |
data/SmallLabData.csv | Albumin | NA | 3.4-5.4 g/dL | <NA> | |
data/SmallLabData.csv | Potassium, Whole Blood | NA | 6.5-7.0 mmol/L | <NA> | |
data/SmallLabData.csv | % Hemoglobin A1c | NA | 6.50 % | <NA> |
separate()
This allows us to split columns apart into multiple variables.
Take a look at SmallLabData
below (we’ve used some dplyr
and the janitor::remove_empty()
function to get SmallLabData
from AllLabs
).
AllLabs %>%
filter(source == "data/SmallLabData.csv") %>%
select(-source) %>%
janitor::remove_empty() -> SmallLabData
SmallLabData
lab <chr> | ref <chr> | |||
---|---|---|---|---|
Albumin | 3.4-5.4 g/dL | |||
Potassium, Whole Blood | 6.5-7.0 mmol/L | |||
% Hemoglobin A1c | 6.50 % |
Separate ref
into ref_value
and ref_units
SmallLabData %>%
separate(col = ___, into = c("_______", "_______"), sep = " ")
SmallLabData %>%
separate(col = ref, into = c("ref_value", "ref_units"), sep = " ")
lab <chr> | ref_value <chr> | ref_units <chr> | ||
---|---|---|---|---|
Albumin | 3.4-5.4 | g/dL | ||
Potassium, Whole Blood | 6.5-7.0 | mmol/L | ||
% Hemoglobin A1c | 6.50 | % |
separate_rows()
separate_rows()
is similar to separate()
, but drops the values down the rows. This is commonly used when we have more than one value in each column.
Take a look at LabData
below.
AllLabs %>%
filter(source == "data/LabData.csv") %>%
select(-source) %>%
janitor::remove_empty() -> LabData
LabData
lab <chr> | value <chr> | ref <chr> | ||
---|---|---|---|---|
Albumin | 3.5, 4.2, 5.0 | 3.4 - 5.4 g/dL | ||
Potassium, Whole Blood | 1.7, 4.3 | >6.5-7.0 mmol/L | ||
% Hemoglobin A1c | 4.1, 4.8 | 6.50% |
Put all the value
s into their own cell, and make sure the new variable is numeric.
LabData %>%
separate_rows(_____, sep = _____, convert = TRUE)
Note the convert
argument to make value
numeric.
LabData %>%
separate_rows(value, sep = ", ", convert = TRUE)
lab <chr> | value <dbl> | ref <chr> | ||
---|---|---|---|---|
Albumin | 3.5 | 3.4 - 5.4 g/dL | ||
Albumin | 4.2 | 3.4 - 5.4 g/dL | ||
Albumin | 5.0 | 3.4 - 5.4 g/dL | ||
Potassium, Whole Blood | 1.7 | >6.5-7.0 mmol/L | ||
Potassium, Whole Blood | 4.3 | >6.5-7.0 mmol/L | ||
% Hemoglobin A1c | 4.1 | 6.50% | ||
% Hemoglobin A1c | 4.8 | 6.50% |
unite()
Just like separate
/separate_rows
allows us to split up columns, unite()
lets us stick them back together!
Take a look at LabProc
.
AllLabs %>%
filter(source == "data/LabProc.csv") %>%
select(-source) %>%
janitor::remove_empty() -> LabProc
LabProc
visit <date> | lab_proc <chr> | |||
---|---|---|---|---|
2020-10-01 | Albumin 3.4 g/dL; Potassium 3.8 mmol/L; A1c 4.8 % | |||
2020-10-10 | Albumin 41 g/dL; A1c 5.0 % | |||
2020-10-20 | Albumin 4.8 g/dL; Potassium 2.2 mmol/L |
Separate the labs from each other, and drop them down the rows.
LabProc %>%
separate_rows(lab_proc, sep = "___")
Separate the labs from each other, and drop them down the rows.
LabProc %>%
separate_rows(lab_proc, sep = "; ")
visit <date> | lab_proc <chr> | |||
---|---|---|---|---|
2020-10-01 | Albumin 3.4 g/dL | |||
2020-10-01 | Potassium 3.8 mmol/L | |||
2020-10-01 | A1c 4.8 % | |||
2020-10-10 | Albumin 41 g/dL | |||
2020-10-10 | A1c 5.0 % | |||
2020-10-20 | Albumin 4.8 g/dL | |||
2020-10-20 | Potassium 2.2 mmol/L |
Put lab
, value
, and units
in to their own columns.
LabProc %>%
separate_rows(lab_proc, sep = "___") %>%
separate(col = lab_proc,into = c('__', '_____', '_____'),
sep = " ", convert = ____)
Put lab
, value
, and units
in to their own columns.
LabProc %>%
separate_rows(lab_proc, sep = "; ") %>%
separate(col = lab_proc,into = c('lab', 'value', 'units'),
sep = " ", convert = TRUE)
visit <date> | lab <chr> | value <dbl> | units <chr> | |
---|---|---|---|---|
2020-10-01 | Albumin | 3.4 | g/dL | |
2020-10-01 | Potassium | 3.8 | mmol/L | |
2020-10-01 | A1c | 4.8 | % | |
2020-10-10 | Albumin | 41.0 | g/dL | |
2020-10-10 | A1c | 5.0 | % | |
2020-10-20 | Albumin | 4.8 | g/dL | |
2020-10-20 | Potassium | 2.2 | mmol/L |
Unite lab
, value
and units
into lab_name
(don’t remove the previous columns with remove = FALSE
).
LabProc %>%
separate_rows(lab_proc, sep = "___") %>%
separate(col = lab_proc,into = c('__', '_____', '_____'),
sep = " ", convert = ____)
unite(col = "_______", lab:units, sep = " ", remove = _____)
Unite lab
, value
and units
into lab_name
(don’t remove the previous columns with remove = FALSE
).
LabProc %>%
separate_rows(lab_proc, sep = "; ") %>%
separate(col = lab_proc,into = c('lab', 'value', 'units'),
sep = " ", convert = TRUE) %>%
unite(col = "lab_name", lab:units, sep = " ", remove = FALSE)
visit <date> | lab_name <chr> | lab <chr> | value <dbl> | units <chr> |
---|---|---|---|---|
2020-10-01 | Albumin 3.4 g/dL | Albumin | 3.4 | g/dL |
2020-10-01 | Potassium 3.8 mmol/L | Potassium | 3.8 | mmol/L |
2020-10-01 | A1c 4.8 % | A1c | 4.8 | % |
2020-10-10 | Albumin 41 g/dL | Albumin | 41.0 | g/dL |
2020-10-10 | A1c 5 % | A1c | 5.0 | % |
2020-10-20 | Albumin 4.8 g/dL | Albumin | 4.8 | g/dL |
2020-10-20 | Potassium 2.2 mmol/L | Potassium | 2.2 | mmol/L |
Joins give us the ability to combine multiple datasets on a common column. We’re going to be using the starwarsdb
package, which contains data from the Star Wars API.
We’ll load the data below:
library(starwarsdb)
library(dm)
starwarsdb
The code below creates and displays the data model for the tables in this package
StarWarsDataModel <- dm_draw(dm = starwars_dm(),
graph_name = "StarWarsDataModel")
StarWarsDataModel
Load the starwarsdb::films
dataset, removing the opening_crawl
column, and filter to only episodes 4-6. Store these data in StarWarsFilms
StarWarsFilms <- starwarsdb::films %>%
select(-____________) %>%
filter(episode_id >= ___ & episode_id <= ___)
StarWarsFilms
StarWarsFilms <- starwarsdb::films %>%
select(-c(opening_crawl, director, producer)) %>%
filter(episode_id >= 4 & episode_id <= 6)
StarWarsFilms
title <chr> | episode_id <int> | release_date <date> | ||
---|---|---|---|---|
A New Hope | 4 | 1977-05-25 | ||
The Empire Strikes Back | 5 | 1980-05-17 | ||
Return of the Jedi | 6 | 1983-05-25 |
The first join we will cover is dplyr::left_join()
.
left_join
A left_join
keeps all of x
, and joins it to all matching rows from dataset y
Left-join StarWarsFilms
to starwarsdb::films_vehicles
on "title"
, then select()
the vehicle
column, then everything()
else.
Assign to StarWarsFilmVeh
left_join(x = _____________,
y = starwarsdb::films_vehicles,
by = ______) %>%
select(_____________,
everything()) -> StarWarsFilmVeh
StarWarsFilmVeh
See below:
left_join(x = StarWarsFilms,
y = starwarsdb::films_vehicles,
by = "title") %>%
select(vehicle,
everything()) -> StarWarsFilmVeh
StarWarsFilmVeh
vehicle <chr> | title <chr> | episode_id <int> | release_date <date> | |
---|---|---|---|---|
CR90 corvette | A New Hope | 4 | 1977-05-25 | |
Star Destroyer | A New Hope | 4 | 1977-05-25 | |
Sentinel-class landing craft | A New Hope | 4 | 1977-05-25 | |
Death Star | A New Hope | 4 | 1977-05-25 | |
Millennium Falcon | A New Hope | 4 | 1977-05-25 | |
Y-wing | A New Hope | 4 | 1977-05-25 | |
X-wing | A New Hope | 4 | 1977-05-25 | |
TIE Advanced x1 | A New Hope | 4 | 1977-05-25 | |
Sand Crawler | A New Hope | 4 | 1977-05-25 | |
T-16 skyhopper | A New Hope | 4 | 1977-05-25 |
inner_join
The inner_join()
joins dataset x
and y
, and keeps only matching rows from both.
Use an inner_join
to join StarWarsFilmVeh
to starwarsdb::pilots
on "vehicle"
and assign it to StarWarsFilmVehPil
.
inner_join(x = StarWarsFilmVeh,
y = _________,
by = _________) -> _____________
StarWarsFilmVehPil
See below:
inner_join(x = StarWarsFilmVeh,
y = starwarsdb::pilots,
by = "vehicle") -> StarWarsFilmVehPil
StarWarsFilmVehPil
vehicle <chr> | title <chr> | episode_id <int> | release_date <date> | pilot <chr> |
---|---|---|---|---|
Millennium Falcon | A New Hope | 4 | 1977-05-25 | Chewbacca |
Millennium Falcon | A New Hope | 4 | 1977-05-25 | Han Solo |
Millennium Falcon | A New Hope | 4 | 1977-05-25 | Lando Calrissian |
Millennium Falcon | A New Hope | 4 | 1977-05-25 | Nien Nunb |
X-wing | A New Hope | 4 | 1977-05-25 | Luke Skywalker |
X-wing | A New Hope | 4 | 1977-05-25 | Biggs Darklighter |
X-wing | A New Hope | 4 | 1977-05-25 | Wedge Antilles |
X-wing | A New Hope | 4 | 1977-05-25 | Jek Tono Porkins |
TIE Advanced x1 | A New Hope | 4 | 1977-05-25 | Darth Vader |
Millennium Falcon | The Empire Strikes Back | 5 | 1980-05-17 | Chewbacca |
right_join
The right_join()
includes all the rows in dataset y
, and joins it the matching rows in dataset x
.
Use a right_join()
to join StarWarsFilmVehPlt
to starwarsdb::films_planets
on "title"
, keeping all the rows in starwarsdb::films_planets
. Use select()
to rename planet
to film_planet
and assign this to StarWarsFilmVehPltPlnt
right_join(x = __________________,
y = __________________,
by = "title") %>%
select(vehicle:release_date,
film_planet = _______,
pilot) -> StarWarsFilmVehPltPlnt
See below:
right_join(x = StarWarsFilmVehPil,
y = starwarsdb::films_planets,
by = "title") %>%
select(vehicle:release_date,
film_planet = planet,
pilot) -> StarWarsFilmVehPltPlnt
StarWarsFilmVehPltPlnt
vehicle <chr> | title <chr> | episode_id <int> | release_date <date> | film_planet <chr> | pilot <chr> |
---|---|---|---|---|---|
Millennium Falcon | A New Hope | 4 | 1977-05-25 | Tatooine | Chewbacca |
Millennium Falcon | A New Hope | 4 | 1977-05-25 | Alderaan | Chewbacca |
Millennium Falcon | A New Hope | 4 | 1977-05-25 | Yavin IV | Chewbacca |
Millennium Falcon | A New Hope | 4 | 1977-05-25 | Tatooine | Han Solo |
Millennium Falcon | A New Hope | 4 | 1977-05-25 | Alderaan | Han Solo |
Millennium Falcon | A New Hope | 4 | 1977-05-25 | Yavin IV | Han Solo |
Millennium Falcon | A New Hope | 4 | 1977-05-25 | Tatooine | Lando Calrissian |
Millennium Falcon | A New Hope | 4 | 1977-05-25 | Alderaan | Lando Calrissian |
Millennium Falcon | A New Hope | 4 | 1977-05-25 | Yavin IV | Lando Calrissian |
Millennium Falcon | A New Hope | 4 | 1977-05-25 | Tatooine | Nien Nunb |
anti_join
The anti_join()
keeps all rows in dataset x
that do not have a match in dataset y
. Note the last join created empty values for vehicle
, episode_id
, release_date
, and pilot
.
StarWarsFilmVehPltPlnt %>%
filter(is.na(episode_id))
vehicle <chr> | title <chr> | episode_id <int> | release_date <date> | film_planet <chr> | pilot <chr> |
---|---|---|---|---|---|
NA | The Phantom Menace | NA | <NA> | Tatooine | NA |
NA | The Phantom Menace | NA | <NA> | Naboo | NA |
NA | The Phantom Menace | NA | <NA> | Coruscant | NA |
NA | Attack of the Clones | NA | <NA> | Tatooine | NA |
NA | Attack of the Clones | NA | <NA> | Naboo | NA |
NA | Attack of the Clones | NA | <NA> | Coruscant | NA |
NA | Attack of the Clones | NA | <NA> | Kamino | NA |
NA | Attack of the Clones | NA | <NA> | Geonosis | NA |
NA | Revenge of the Sith | NA | <NA> | Tatooine | NA |
NA | Revenge of the Sith | NA | <NA> | Alderaan | NA |
Use the anti_join()
to join StarWarsFilmVehPltPlnt
to StarWarsFilms
and create a dataset with only title
and film_planet
(these should only contain the title
s not in StarWarsFilms
).
anti_join(x = StarWarsFilmVehPltPlnt,
y = ____________,
by = c("title", "episode_id", "release_date")) %>%
select(________, ________)
anti_join(x = StarWarsFilmVehPltPlnt,
y = StarWarsFilms,
by = c("title", "episode_id", "release_date")) %>%
select(title, film_planet)
title <chr> | film_planet <chr> | |||
---|---|---|---|---|
The Phantom Menace | Tatooine | |||
The Phantom Menace | Naboo | |||
The Phantom Menace | Coruscant | |||
Attack of the Clones | Tatooine | |||
Attack of the Clones | Naboo | |||
Attack of the Clones | Coruscant | |||
Attack of the Clones | Kamino | |||
Attack of the Clones | Geonosis | |||
Revenge of the Sith | Tatooine | |||
Revenge of the Sith | Alderaan |
Nested data refers to a dataset in which a column is holding a list of data frames (or tibble
s). These can be useful for manipulating data, and we’re going to give an example of using nested data to reshape a column combining two sets of dates.
unnest()
deals with nested data, which can be complicated to explain, so we’ll use an example. Run the code below to download data from the Anti-Doping Agency Sanction database.
source("code/01-create-usada-data.R")
These data have two dates (original
and updated
) in a single column (sanction_announced
). We only want updated
dates in their own columns.
UsadaBadDates
athlete <chr> | sanction_announced <chr> | sport <chr> | |
---|---|---|---|
fischbach, dylan | original: 12/18/2015; updated: 04/11/2016 | paralympic basketball | |
trafeh, mohamed | original: 12/18/2014; updated: 08/25/2015 | track and field | |
lea, robert | original: 12/17/2015; updated: 02/25/2016 | cycling | |
young, jerome | original: 11/10/2004; updated: 06/17/2008 | track and field | |
hay, amy | original: 10/31/2017; updated: 12/16/2019 | weightlifting | |
hunter, adam | original: 10/28/2016; updated: 09/26/2018 | mixed martial arts | |
tovar, oscar | original: 10/28/2015; updated: 10/04/2016 | cycling | |
dotti, juan pablo | original: 10/20/2011; updated: 06/05/2013 | cycling | |
murdock, vince | original: 09/05/2019; updated: 08/26/2020 | mixed martial arts | |
thomas, tammy | original: 08/30/2002; updated: 02/13/2017 | cycling |
Use the stringr
package’s str_split
function to split the bad dates on the "updated"
pattern.
UsadaBadDates %>%
dplyr::mutate(sanction_dates =
# 1) split this on the "updated" pattern
stringr::str_split(string = sanction_announced,
pattern = "updated")) %>%
# only return what we're interested in
dplyr::select(athlete, sanction_announced, sanction_dates)
athlete <chr> | sanction_announced <chr> | sanction_dates <list> | ||
---|---|---|---|---|
fischbach, dylan | original: 12/18/2015; updated: 04/11/2016 | <chr [2]> | ||
trafeh, mohamed | original: 12/18/2014; updated: 08/25/2015 | <chr [2]> | ||
lea, robert | original: 12/17/2015; updated: 02/25/2016 | <chr [2]> | ||
young, jerome | original: 11/10/2004; updated: 06/17/2008 | <chr [2]> | ||
hay, amy | original: 10/31/2017; updated: 12/16/2019 | <chr [2]> | ||
hunter, adam | original: 10/28/2016; updated: 09/26/2018 | <chr [2]> | ||
tovar, oscar | original: 10/28/2015; updated: 10/04/2016 | <chr [2]> | ||
dotti, juan pablo | original: 10/20/2011; updated: 06/05/2013 | <chr [2]> | ||
murdock, vince | original: 09/05/2019; updated: 08/26/2020 | <chr [2]> | ||
thomas, tammy | original: 08/30/2002; updated: 02/13/2017 | <chr [2]> |
This creates a nested column (sanction_dates
), which we can deal with by unnest()
ing it. Let’s take a better look with listviewer
(but only look at athlete
and sanction_dates
).
library(listviewer)
UsadaBadDates %>%
dplyr::mutate(sanction_dates =
# 1) split this on the "updated" pattern
stringr::str_split(string = sanction_announced,
pattern = "updated")) %>%
# only return what we're interested in
dplyr::select(athlete, sanction_dates) %>%
listviewer::jsonedit(mode = "view")
After unnesting, we can see the dates have been repeated down the rows (like separate_rows()
).
UsadaBadDates %>%
dplyr::mutate(sanction_dates =
# 1) split this on the "updated" pattern
stringr::str_split(string = sanction_announced,
pattern = "updated")) %>%
# convert the output from split into multiple rows
tidyr::unnest(sanction_dates) %>%
# check this to see what is happening
dplyr::select(athlete, sanction_announced, sanction_dates)
athlete <chr> | sanction_announced <chr> | sanction_dates <chr> | ||
---|---|---|---|---|
fischbach, dylan | original: 12/18/2015; updated: 04/11/2016 | original: 12/18/2015; | ||
fischbach, dylan | original: 12/18/2015; updated: 04/11/2016 | : 04/11/2016 | ||
trafeh, mohamed | original: 12/18/2014; updated: 08/25/2015 | original: 12/18/2014; | ||
trafeh, mohamed | original: 12/18/2014; updated: 08/25/2015 | : 08/25/2015 | ||
lea, robert | original: 12/17/2015; updated: 02/25/2016 | original: 12/17/2015; | ||
lea, robert | original: 12/17/2015; updated: 02/25/2016 | : 02/25/2016 | ||
young, jerome | original: 11/10/2004; updated: 06/17/2008 | original: 11/10/2004; | ||
young, jerome | original: 11/10/2004; updated: 06/17/2008 | : 06/17/2008 | ||
hay, amy | original: 10/31/2017; updated: 12/16/2019 | original: 10/31/2017; | ||
hay, amy | original: 10/31/2017; updated: 12/16/2019 | : 12/16/2019 |
We can see all of the updated dates are on a new line, and the original dates still contain the text. Now I can filter out the rows with an original date (because I am only interested in the updated dates).
UsadaBadDates %>%
dplyr::mutate(sanction_dates =
# 1) split this on the "updated" pattern
stringr::str_split(string = sanction_announced,
pattern = "updated")) %>%
# 2) convert the output from split into multiple rows
tidyr::unnest(sanction_dates) %>%
# 3) remove the "original" dates
dplyr::filter(!str_detect(string = sanction_dates,
pattern = "original")) %>%
# check this to see what is happening
dplyr::select(athlete, sanction_announced, sanction_dates)
athlete <chr> | sanction_announced <chr> | sanction_dates <chr> | ||
---|---|---|---|---|
fischbach, dylan | original: 12/18/2015; updated: 04/11/2016 | : 04/11/2016 | ||
trafeh, mohamed | original: 12/18/2014; updated: 08/25/2015 | : 08/25/2015 | ||
lea, robert | original: 12/17/2015; updated: 02/25/2016 | : 02/25/2016 | ||
young, jerome | original: 11/10/2004; updated: 06/17/2008 | : 06/17/2008 | ||
hay, amy | original: 10/31/2017; updated: 12/16/2019 | : 12/16/2019 | ||
hunter, adam | original: 10/28/2016; updated: 09/26/2018 | : 09/26/2018 | ||
tovar, oscar | original: 10/28/2015; updated: 10/04/2016 | : 10/04/2016 | ||
dotti, juan pablo | original: 10/20/2011; updated: 06/05/2013 | : 06/05/2013 | ||
murdock, vince | original: 09/05/2019; updated: 08/26/2020 | : 08/26/2020 | ||
thomas, tammy | original: 08/30/2002; updated: 02/13/2017 | : 02/13/2017 |
Now I can clean up the updated sanction_dates
and format it.
UsadaBadDates %>%
dplyr::mutate(sanction_dates =
# 1) split this on the "updated" pattern
stringr::str_split(string = sanction_announced,
pattern = "updated")) %>%
# 2) convert the output from split into multiple rows
tidyr::unnest(sanction_dates) %>%
# 3) remove the "original" dates
dplyr::filter(!str_detect(string = sanction_dates,
pattern = "original")) %>%
# 4) remove the colon from sanction_dates
dplyr::mutate(sanction_dates = stringr::str_remove_all(
string = sanction_dates,
pattern = ":"),
# 5) remove any whitespace
sanction_dates = stringr::str_trim(sanction_dates),
# 6) format as date
sanction_dates = lubridate::mdy(sanction_dates)) %>%
# check this to see what is happening
dplyr::select(athlete, sanction_announced, sanction_dates)
athlete <chr> | sanction_announced <chr> | sanction_dates <date> | ||
---|---|---|---|---|
fischbach, dylan | original: 12/18/2015; updated: 04/11/2016 | 2016-04-11 | ||
trafeh, mohamed | original: 12/18/2014; updated: 08/25/2015 | 2015-08-25 | ||
lea, robert | original: 12/17/2015; updated: 02/25/2016 | 2016-02-25 | ||
young, jerome | original: 11/10/2004; updated: 06/17/2008 | 2008-06-17 | ||
hay, amy | original: 10/31/2017; updated: 12/16/2019 | 2019-12-16 | ||
hunter, adam | original: 10/28/2016; updated: 09/26/2018 | 2018-09-26 | ||
tovar, oscar | original: 10/28/2015; updated: 10/04/2016 | 2016-10-04 | ||
dotti, juan pablo | original: 10/20/2011; updated: 06/05/2013 | 2013-06-05 | ||
murdock, vince | original: 09/05/2019; updated: 08/26/2020 | 2020-08-26 | ||
thomas, tammy | original: 08/30/2002; updated: 02/13/2017 | 2017-02-13 |
How far can this pipe go??
Well, when the data are arranged in the way we want, we can test it with a visualization to check the number of sanctions per sport with bad dates.
UsadaBadDates %>%
dplyr::mutate(sanction_dates =
# 1) split this on the "updated" pattern
stringr::str_split(string = sanction_announced,
pattern = "updated")) %>%
# 2) convert the output from split into multiple rows
tidyr::unnest(sanction_dates) %>%
# 3) remove the "original" dates
dplyr::filter(!str_detect(string = sanction_dates,
pattern = "original")) %>%
# 4) remove the colon from sanction_dates
dplyr::mutate(sanction_dates = stringr::str_remove_all(
string = sanction_dates,
pattern = ":"),
# 5) remove any whitespace
sanction_dates = stringr::str_trim(sanction_dates),
# 6) format as date
sanction_dates = lubridate::mdy(sanction_dates)) %>%
# count by sport
dplyr::count(sport, sort = TRUE) %>%
# rename n
rename(`Sport Count` = n,
Sport = sport) %>%
ggplot2::ggplot(aes(x = `Sport Count`,
y = forcats::fct_reorder(as.factor(Sport), `Sport Count`),
fill = as.character(`Sport Count`))) +
geom_col(show.legend = FALSE) +
labs(x = "Number of Sanctions", y = "Sport",
title = "Number of Sanctions with Bad Dates",
caption = "source: https://www.usada.org/news/sanctions/") +
ggthemes::theme_fivethirtyeight() +
ggthemes::scale_fill_few()