• 1 Objectives
  • 2 Materials
    • 2.1 Load packages
    • 2.2 Manging data
  • 3 Reshaping Columns and Rows
    • 3.1 separate()
    • 3.2 separate_rows()
    • 3.3 unite()
  • 4 Joins
    • 4.1 starwarsdb
    • 4.2 left_join
    • 4.3 inner_join
    • 4.4 right_join
    • 4.5 anti_join
  • 5 Nested data
    • 5.1 Anti-Doping Sanctions Database

1 Objectives

This document outlines some common data reshaping tasks with tidyr.

2 Materials

The slides for this presentation are here

There is also an accompanying RStudio.Cloud project

Read more about dplyr here and tidyr on the tidyverse website, or in the Data Transformation chapter of R for Data Science.

2.1 Load packages

tidyr is part of the core tidyverse packages, so we install and load this meta-package below.

install.packages("tidyverse")
library(tidyverse)

2.2 Manging data

We’re going to use some of the fs package functions for managing our files and data.

Check your files and folders with dir_tree or dir_ls.

fs::dir_tree("data", recurse = FALSE)
## data
## ├── 2020-11-24-TopPharmCompRaw.csv
## ├── 2020-12-11-BioTechDrugStocks.csv
## ├── 2020-12-11-BmrmGoogle.rds
## ├── 2020-12-20-BioTechStocks.csv
## ├── 2020-12-24-BioTechStocks.csv
## ├── 2020-12-29-PricesWide.csv
## ├── 2020-12-30-PricesWide.csv
## ├── AppleMobRaw.csv
## ├── FasterCures.csv
## ├── Infected.csv
## ├── LabData.csv
## ├── LabProc.csv
## ├── SmallLabData.csv
## ├── TopPharmComp.csv
## ├── UsadaBadDates.csv
## ├── UsadaRaw.csv
## ├── VisitNAData.csv
## ├── original-starwars.csv
## └── starwars.rds

We can also use dir_ls to get the paths to print to the screen so we can easily import.

fs::dir_ls("data", recurse = FALSE)
## data/2020-11-24-TopPharmCompRaw.csv   data/2020-12-11-BioTechDrugStocks.csv 
## data/2020-12-11-BmrmGoogle.rds        data/2020-12-20-BioTechStocks.csv     
## data/2020-12-24-BioTechStocks.csv     data/2020-12-29-PricesWide.csv        
## data/2020-12-30-PricesWide.csv        data/AppleMobRaw.csv                  
## data/FasterCures.csv                  data/Infected.csv                     
## data/LabData.csv                      data/LabProc.csv                      
## data/SmallLabData.csv                 data/TopPharmComp.csv                 
## data/UsadaBadDates.csv                data/UsadaRaw.csv                     
## data/VisitNAData.csv                  data/original-starwars.csv            
## data/starwars.rds

fs also gives us the metadata of any folder we have access to. This can come in handy if we’re looking for a specific version of a file, or a file that was created at a specific date.

fs::dir_info("data", all = TRUE) %>% 
    dplyr::arrange(desc(modification_time))
ABCDEFGHIJ0123456789
path
<fs_path>
type
<fct>
size
<fs_bytes>
permissions
<fs_perms>
modification_time
<dttm>
user
<chr>
group
<chr>
device_id
<dbl>
data/2020-12-30-PricesWide.csvfile734rw-rw-rw-2020-12-30 00:35:31mjfrigaardstaff16777220
data/2020-12-24-BioTechStocks.csvfile2.31Krw-rw-rw-2020-12-30 00:35:15mjfrigaardstaff16777220
data/TopPharmComp.csvfile16.14Krw-r--r--2020-12-30 00:31:28mjfrigaardstaff16777220
data/2020-12-29-PricesWide.csvfile734rw-rw-rw-2020-12-29 23:25:49mjfrigaardstaff16777220
data/.DS_Storefile6Krw-r--r--2020-12-29 22:35:31mjfrigaardstaff16777220
data/2020-12-20-BioTechStocks.csvfile2.29Krw-rw-rw-2020-12-20 23:51:38mjfrigaardstaff16777220
data/2020-12-11-BioTechDrugStocks.csvfile2.29Krw-rw-rw-2020-12-11 17:43:51mjfrigaardstaff16777220
data/2020-12-11-BmrmGoogle.rdsfile26.5Krw-rw-rw-2020-12-11 17:19:02mjfrigaardstaff16777220
data/FasterCures.csvfile233.1Krw-rw-rw-2020-12-11 16:41:47mjfrigaardstaff16777220
data/original-starwars.csvfile265rw-r--r--2020-12-01 20:31:19mjfrigaardstaff16777220

For example, lets look for the largest file in the data folder.

fs::dir_info("data", all = TRUE) %>% 
    filter(type == "file") %>% 
    arrange(desc(size)) %>% 
    select(path, size) %>% 
    slice(1)
ABCDEFGHIJ0123456789
path
<fs_path>
size
<fs_bytes>
data/AppleMobRaw.csv9.39M

Lets import all the Lab data files in data using fs and purrr.

fs::dir_ls(path = "data", regexp = "Lab") %>% 
    purrr::map_dfr(read_csv, .id = "source") -> AllLabs
AllLabs
ABCDEFGHIJ0123456789
source
<chr>
lab
<chr>
value
<chr>
ref
<chr>
visit
<date>
data/LabData.csvAlbumin3.5, 4.2, 5.03.4 - 5.4 g/dL<NA>
data/LabData.csvPotassium, Whole Blood1.7, 4.3>6.5-7.0 mmol/L<NA>
data/LabData.csv% Hemoglobin A1c4.1, 4.86.50%<NA>
data/LabProc.csvNANANA2020-10-01
data/LabProc.csvNANANA2020-10-10
data/LabProc.csvNANANA2020-10-20
data/SmallLabData.csvAlbuminNA3.4-5.4 g/dL<NA>
data/SmallLabData.csvPotassium, Whole BloodNA6.5-7.0 mmol/L<NA>
data/SmallLabData.csv% Hemoglobin A1cNA6.50 %<NA>

3 Reshaping Columns and Rows

3.1 separate()

This allows us to split columns apart into multiple variables.

Take a look at SmallLabData below (we’ve used some dplyr and the janitor::remove_empty() function to get SmallLabData from AllLabs).

AllLabs %>% 
    filter(source == "data/SmallLabData.csv") %>% 
    select(-source) %>% 
    janitor::remove_empty() -> SmallLabData
SmallLabData
ABCDEFGHIJ0123456789
lab
<chr>
ref
<chr>
Albumin3.4-5.4 g/dL
Potassium, Whole Blood6.5-7.0 mmol/L
% Hemoglobin A1c6.50 %

Separate ref into ref_value and ref_units

SmallLabData %>% 
    separate(col = ___, into = c("_______", "_______"), sep = " ")
SmallLabData %>% 
    separate(col = ref, into = c("ref_value", "ref_units"), sep = " ")
ABCDEFGHIJ0123456789
lab
<chr>
ref_value
<chr>
ref_units
<chr>
Albumin3.4-5.4g/dL
Potassium, Whole Blood6.5-7.0mmol/L
% Hemoglobin A1c6.50%

3.2 separate_rows()

separate_rows() is similar to separate(), but drops the values down the rows. This is commonly used when we have more than one value in each column.

Take a look at LabData below.

AllLabs %>% 
    filter(source == "data/LabData.csv") %>% 
    select(-source) %>% 
    janitor::remove_empty() -> LabData
LabData
ABCDEFGHIJ0123456789
lab
<chr>
value
<chr>
ref
<chr>
Albumin3.5, 4.2, 5.03.4 - 5.4 g/dL
Potassium, Whole Blood1.7, 4.3>6.5-7.0 mmol/L
% Hemoglobin A1c4.1, 4.86.50%

Put all the values into their own cell, and make sure the new variable is numeric.

LabData %>% 
  separate_rows(_____, sep = _____, convert = TRUE)

Note the convert argument to make value numeric.

LabData %>% 
  separate_rows(value, sep = ", ", convert = TRUE)
ABCDEFGHIJ0123456789
lab
<chr>
value
<dbl>
ref
<chr>
Albumin3.53.4 - 5.4 g/dL
Albumin4.23.4 - 5.4 g/dL
Albumin5.03.4 - 5.4 g/dL
Potassium, Whole Blood1.7>6.5-7.0 mmol/L
Potassium, Whole Blood4.3>6.5-7.0 mmol/L
% Hemoglobin A1c4.16.50%
% Hemoglobin A1c4.86.50%

3.3 unite()

Just like separate/separate_rows allows us to split up columns, unite() lets us stick them back together!

Take a look at LabProc.

AllLabs %>% 
    filter(source == "data/LabProc.csv") %>% 
    select(-source) %>% 
    janitor::remove_empty() -> LabProc
LabProc
ABCDEFGHIJ0123456789
visit
<date>
lab_proc
<chr>
2020-10-01Albumin 3.4 g/dL; Potassium 3.8 mmol/L; A1c 4.8 %
2020-10-10Albumin 41 g/dL; A1c 5.0 %
2020-10-20Albumin 4.8 g/dL; Potassium 2.2 mmol/L

Separate the labs from each other, and drop them down the rows.

LabProc %>% 
    separate_rows(lab_proc, sep = "___")

Separate the labs from each other, and drop them down the rows.

LabProc %>% 
     separate_rows(lab_proc, sep = "; ") 
ABCDEFGHIJ0123456789
visit
<date>
lab_proc
<chr>
2020-10-01Albumin 3.4 g/dL
2020-10-01Potassium 3.8 mmol/L
2020-10-01A1c 4.8 %
2020-10-10Albumin 41 g/dL
2020-10-10A1c 5.0 %
2020-10-20Albumin 4.8 g/dL
2020-10-20Potassium 2.2 mmol/L

Put lab, value, and units in to their own columns.

LabProc %>% 
    separate_rows(lab_proc, sep = "___") %>% 
    separate(col = lab_proc,into = c('__', '_____', '_____'), 
           sep = " ", convert = ____)

Put lab, value, and units in to their own columns.

LabProc %>% 
     separate_rows(lab_proc, sep = "; ") %>% 
     separate(col = lab_proc,into = c('lab', 'value', 'units'), 
            sep = " ", convert = TRUE) 
ABCDEFGHIJ0123456789
visit
<date>
lab
<chr>
value
<dbl>
units
<chr>
2020-10-01Albumin3.4g/dL
2020-10-01Potassium3.8mmol/L
2020-10-01A1c4.8%
2020-10-10Albumin41.0g/dL
2020-10-10A1c5.0%
2020-10-20Albumin4.8g/dL
2020-10-20Potassium2.2mmol/L

Unite lab, value and units into lab_name (don’t remove the previous columns with remove = FALSE).

LabProc %>% 
    separate_rows(lab_proc, sep = "___") %>% 
    separate(col = lab_proc,into = c('__', '_____', '_____'), 
           sep = " ", convert = ____)
    unite(col = "_______", lab:units, sep = " ", remove = _____)

Unite lab, value and units into lab_name (don’t remove the previous columns with remove = FALSE).

LabProc %>% 
     separate_rows(lab_proc, sep = "; ") %>% 
     separate(col = lab_proc,into = c('lab', 'value', 'units'), 
            sep = " ", convert = TRUE) %>% 
     unite(col = "lab_name", lab:units, sep = " ", remove = FALSE)
ABCDEFGHIJ0123456789
visit
<date>
lab_name
<chr>
lab
<chr>
value
<dbl>
units
<chr>
2020-10-01Albumin 3.4 g/dLAlbumin3.4g/dL
2020-10-01Potassium 3.8 mmol/LPotassium3.8mmol/L
2020-10-01A1c 4.8 %A1c4.8%
2020-10-10Albumin 41 g/dLAlbumin41.0g/dL
2020-10-10A1c 5 %A1c5.0%
2020-10-20Albumin 4.8 g/dLAlbumin4.8g/dL
2020-10-20Potassium 2.2 mmol/LPotassium2.2mmol/L

4 Joins

Joins give us the ability to combine multiple datasets on a common column. We’re going to be using the starwarsdb package, which contains data from the Star Wars API.

We’ll load the data below:

library(starwarsdb)
library(dm)

4.1 starwarsdb

The code below creates and displays the data model for the tables in this package

StarWarsDataModel <- dm_draw(dm = starwars_dm(), 
                             graph_name = "StarWarsDataModel")
StarWarsDataModel
%0 films films title films_people films_people title character films_people:title->films:title people people name homeworld species films_people:character->people:name films_planets films_planets title planet films_planets:title->films:title planets planets name films_planets:planet->planets:name films_vehicles films_vehicles title vehicle films_vehicles:title->films:title vehicles vehicles name films_vehicles:vehicle->vehicles:name people:homeworld->planets:name species species name people:species->species:name pilots pilots pilot vehicle pilots:pilot->people:name pilots:vehicle->vehicles:name

Load the starwarsdb::films dataset, removing the opening_crawl column, and filter to only episodes 4-6. Store these data in StarWarsFilms

StarWarsFilms <- starwarsdb::films %>% 
                    select(-____________) %>% 
                    filter(episode_id >= ___ & episode_id <= ___)
StarWarsFilms
StarWarsFilms <- starwarsdb::films %>% 
                    select(-c(opening_crawl, director, producer)) %>% 
                    filter(episode_id >= 4 & episode_id <= 6)
StarWarsFilms
ABCDEFGHIJ0123456789
title
<chr>
episode_id
<int>
release_date
<date>
A New Hope41977-05-25
The Empire Strikes Back51980-05-17
Return of the Jedi61983-05-25

The first join we will cover is dplyr::left_join().

4.2 left_join

A left_join keeps all of x, and joins it to all matching rows from dataset y

Left-join StarWarsFilms to starwarsdb::films_vehicles on "title", then select() the vehicle column, then everything() else.

Assign to StarWarsFilmVeh

left_join(x = _____________, 
          y = starwarsdb::films_vehicles, 
          by = ______) %>% 
    select(_____________, 
           everything()) -> StarWarsFilmVeh
StarWarsFilmVeh

See below:

left_join(x = StarWarsFilms, 
          y = starwarsdb::films_vehicles, 
          by = "title") %>% 
  select(vehicle, 
         everything()) -> StarWarsFilmVeh
StarWarsFilmVeh
ABCDEFGHIJ0123456789
vehicle
<chr>
title
<chr>
episode_id
<int>
release_date
<date>
CR90 corvetteA New Hope41977-05-25
Star DestroyerA New Hope41977-05-25
Sentinel-class landing craftA New Hope41977-05-25
Death StarA New Hope41977-05-25
Millennium FalconA New Hope41977-05-25
Y-wingA New Hope41977-05-25
X-wingA New Hope41977-05-25
TIE Advanced x1A New Hope41977-05-25
Sand CrawlerA New Hope41977-05-25
T-16 skyhopperA New Hope41977-05-25

4.3 inner_join

The inner_join() joins dataset x and y, and keeps only matching rows from both.

Use an inner_join to join StarWarsFilmVeh to starwarsdb::pilots on "vehicle" and assign it to StarWarsFilmVehPil.

inner_join(x = StarWarsFilmVeh, 
           y = _________, 
           by = _________) -> _____________
StarWarsFilmVehPil

See below:

inner_join(x = StarWarsFilmVeh, 
           y = starwarsdb::pilots, 
           by = "vehicle") -> StarWarsFilmVehPil
StarWarsFilmVehPil
ABCDEFGHIJ0123456789
vehicle
<chr>
title
<chr>
episode_id
<int>
release_date
<date>
pilot
<chr>
Millennium FalconA New Hope41977-05-25Chewbacca
Millennium FalconA New Hope41977-05-25Han Solo
Millennium FalconA New Hope41977-05-25Lando Calrissian
Millennium FalconA New Hope41977-05-25Nien Nunb
X-wingA New Hope41977-05-25Luke Skywalker
X-wingA New Hope41977-05-25Biggs Darklighter
X-wingA New Hope41977-05-25Wedge Antilles
X-wingA New Hope41977-05-25Jek Tono Porkins
TIE Advanced x1A New Hope41977-05-25Darth Vader
Millennium FalconThe Empire Strikes Back51980-05-17Chewbacca

4.4 right_join

The right_join() includes all the rows in dataset y, and joins it the matching rows in dataset x.

Use a right_join() to join StarWarsFilmVehPlt to starwarsdb::films_planets on "title", keeping all the rows in starwarsdb::films_planets. Use select() to rename planet to film_planet and assign this to StarWarsFilmVehPltPlnt

right_join(x = __________________, 
             y = __________________, 
             by = "title") %>% 
  select(vehicle:release_date,
         film_planet = _______,
         pilot) -> StarWarsFilmVehPltPlnt

See below:

right_join(x = StarWarsFilmVehPil, 
           y = starwarsdb::films_planets, 
           by = "title") %>% 
  select(vehicle:release_date,
         film_planet = planet, 
         pilot) -> StarWarsFilmVehPltPlnt
StarWarsFilmVehPltPlnt
ABCDEFGHIJ0123456789
vehicle
<chr>
title
<chr>
episode_id
<int>
release_date
<date>
film_planet
<chr>
pilot
<chr>
Millennium FalconA New Hope41977-05-25TatooineChewbacca
Millennium FalconA New Hope41977-05-25AlderaanChewbacca
Millennium FalconA New Hope41977-05-25Yavin IVChewbacca
Millennium FalconA New Hope41977-05-25TatooineHan Solo
Millennium FalconA New Hope41977-05-25AlderaanHan Solo
Millennium FalconA New Hope41977-05-25Yavin IVHan Solo
Millennium FalconA New Hope41977-05-25TatooineLando Calrissian
Millennium FalconA New Hope41977-05-25AlderaanLando Calrissian
Millennium FalconA New Hope41977-05-25Yavin IVLando Calrissian
Millennium FalconA New Hope41977-05-25TatooineNien Nunb

4.5 anti_join

The anti_join() keeps all rows in dataset x that do not have a match in dataset y. Note the last join created empty values for vehicle, episode_id, release_date, and pilot.

StarWarsFilmVehPltPlnt %>% 
  filter(is.na(episode_id))
ABCDEFGHIJ0123456789
vehicle
<chr>
title
<chr>
episode_id
<int>
release_date
<date>
film_planet
<chr>
pilot
<chr>
NAThe Phantom MenaceNA<NA>TatooineNA
NAThe Phantom MenaceNA<NA>NabooNA
NAThe Phantom MenaceNA<NA>CoruscantNA
NAAttack of the ClonesNA<NA>TatooineNA
NAAttack of the ClonesNA<NA>NabooNA
NAAttack of the ClonesNA<NA>CoruscantNA
NAAttack of the ClonesNA<NA>KaminoNA
NAAttack of the ClonesNA<NA>GeonosisNA
NARevenge of the SithNA<NA>TatooineNA
NARevenge of the SithNA<NA>AlderaanNA

Use the anti_join() to join StarWarsFilmVehPltPlnt to StarWarsFilms and create a dataset with only title and film_planet (these should only contain the titles not in StarWarsFilms).

anti_join(x = StarWarsFilmVehPltPlnt, 
            y = ____________, 
            by = c("title", "episode_id", "release_date")) %>% 
  select(________, ________)
anti_join(x = StarWarsFilmVehPltPlnt, 
            y = StarWarsFilms, 
            by = c("title", "episode_id", "release_date")) %>% 
  select(title, film_planet)
ABCDEFGHIJ0123456789
title
<chr>
film_planet
<chr>
The Phantom MenaceTatooine
The Phantom MenaceNaboo
The Phantom MenaceCoruscant
Attack of the ClonesTatooine
Attack of the ClonesNaboo
Attack of the ClonesCoruscant
Attack of the ClonesKamino
Attack of the ClonesGeonosis
Revenge of the SithTatooine
Revenge of the SithAlderaan

5 Nested data

Nested data refers to a dataset in which a column is holding a list of data frames (or tibbles). These can be useful for manipulating data, and we’re going to give an example of using nested data to reshape a column combining two sets of dates.

5.1 Anti-Doping Sanctions Database

unnest() deals with nested data, which can be complicated to explain, so we’ll use an example. Run the code below to download data from the Anti-Doping Agency Sanction database.

source("code/01-create-usada-data.R")

These data have two dates (original and updated) in a single column (sanction_announced). We only want updated dates in their own columns.

UsadaBadDates
ABCDEFGHIJ0123456789
athlete
<chr>
sanction_announced
<chr>
sport
<chr>
fischbach, dylanoriginal: 12/18/2015; updated: 04/11/2016paralympic basketball
trafeh, mohamedoriginal: 12/18/2014; updated: 08/25/2015track and field
lea, robertoriginal: 12/17/2015; updated: 02/25/2016cycling
young, jeromeoriginal: 11/10/2004; updated: 06/17/2008track and field
hay, amyoriginal: 10/31/2017; updated: 12/16/2019weightlifting
hunter, adamoriginal: 10/28/2016; updated: 09/26/2018mixed martial arts
tovar, oscaroriginal: 10/28/2015; updated: 10/04/2016cycling
dotti, juan pablooriginal: 10/20/2011; updated: 06/05/2013cycling
murdock, vinceoriginal: 09/05/2019; updated: 08/26/2020mixed martial arts
thomas, tammyoriginal: 08/30/2002; updated: 02/13/2017cycling

Use the stringr package’s str_split function to split the bad dates on the "updated" pattern.

UsadaBadDates %>%
        dplyr::mutate(sanction_dates = 
    # 1) split this on the "updated" pattern
             stringr::str_split(string = sanction_announced, 
                                pattern = "updated")) %>% 
    # only return what we're interested in 
    dplyr::select(athlete, sanction_announced, sanction_dates)
ABCDEFGHIJ0123456789
athlete
<chr>
sanction_announced
<chr>
sanction_dates
<list>
fischbach, dylanoriginal: 12/18/2015; updated: 04/11/2016<chr [2]>
trafeh, mohamedoriginal: 12/18/2014; updated: 08/25/2015<chr [2]>
lea, robertoriginal: 12/17/2015; updated: 02/25/2016<chr [2]>
young, jeromeoriginal: 11/10/2004; updated: 06/17/2008<chr [2]>
hay, amyoriginal: 10/31/2017; updated: 12/16/2019<chr [2]>
hunter, adamoriginal: 10/28/2016; updated: 09/26/2018<chr [2]>
tovar, oscaroriginal: 10/28/2015; updated: 10/04/2016<chr [2]>
dotti, juan pablooriginal: 10/20/2011; updated: 06/05/2013<chr [2]>
murdock, vinceoriginal: 09/05/2019; updated: 08/26/2020<chr [2]>
thomas, tammyoriginal: 08/30/2002; updated: 02/13/2017<chr [2]>

This creates a nested column (sanction_dates), which we can deal with by unnest()ing it. Let’s take a better look with listviewer (but only look at athlete and sanction_dates).

library(listviewer)
UsadaBadDates %>%
        dplyr::mutate(sanction_dates = 
    # 1) split this on the "updated" pattern
             stringr::str_split(string = sanction_announced, 
                                pattern = "updated")) %>% 
    # only return what we're interested in 
    dplyr::select(athlete, sanction_dates) %>% 
    listviewer::jsonedit(mode = "view")
object
{2}
athlete
[20]
sanction_dates
[20]

After unnesting, we can see the dates have been repeated down the rows (like separate_rows()).

UsadaBadDates %>%
        dplyr::mutate(sanction_dates = 
    # 1) split this on the "updated" pattern
             stringr::str_split(string = sanction_announced, 
                                pattern = "updated")) %>% 
    # convert the output from split into multiple rows
        tidyr::unnest(sanction_dates) %>% 
    # check this to see what is happening 
    dplyr::select(athlete, sanction_announced, sanction_dates)
ABCDEFGHIJ0123456789
athlete
<chr>
sanction_announced
<chr>
sanction_dates
<chr>
fischbach, dylanoriginal: 12/18/2015; updated: 04/11/2016original: 12/18/2015;
fischbach, dylanoriginal: 12/18/2015; updated: 04/11/2016: 04/11/2016
trafeh, mohamedoriginal: 12/18/2014; updated: 08/25/2015original: 12/18/2014;
trafeh, mohamedoriginal: 12/18/2014; updated: 08/25/2015: 08/25/2015
lea, robertoriginal: 12/17/2015; updated: 02/25/2016original: 12/17/2015;
lea, robertoriginal: 12/17/2015; updated: 02/25/2016: 02/25/2016
young, jeromeoriginal: 11/10/2004; updated: 06/17/2008original: 11/10/2004;
young, jeromeoriginal: 11/10/2004; updated: 06/17/2008: 06/17/2008
hay, amyoriginal: 10/31/2017; updated: 12/16/2019original: 10/31/2017;
hay, amyoriginal: 10/31/2017; updated: 12/16/2019: 12/16/2019

We can see all of the updated dates are on a new line, and the original dates still contain the text. Now I can filter out the rows with an original date (because I am only interested in the updated dates).

UsadaBadDates %>%
        dplyr::mutate(sanction_dates = 
    # 1) split this on the "updated" pattern
             stringr::str_split(string = sanction_announced, 
                                pattern = "updated")) %>% 
    # 2) convert the output from split into multiple rows
        tidyr::unnest(sanction_dates) %>% 
    
    # 3) remove the "original" dates 
        dplyr::filter(!str_detect(string = sanction_dates, 
                                  pattern = "original")) %>% 
    # check this to see what is happening 
    dplyr::select(athlete, sanction_announced, sanction_dates)
ABCDEFGHIJ0123456789
athlete
<chr>
sanction_announced
<chr>
sanction_dates
<chr>
fischbach, dylanoriginal: 12/18/2015; updated: 04/11/2016: 04/11/2016
trafeh, mohamedoriginal: 12/18/2014; updated: 08/25/2015: 08/25/2015
lea, robertoriginal: 12/17/2015; updated: 02/25/2016: 02/25/2016
young, jeromeoriginal: 11/10/2004; updated: 06/17/2008: 06/17/2008
hay, amyoriginal: 10/31/2017; updated: 12/16/2019: 12/16/2019
hunter, adamoriginal: 10/28/2016; updated: 09/26/2018: 09/26/2018
tovar, oscaroriginal: 10/28/2015; updated: 10/04/2016: 10/04/2016
dotti, juan pablooriginal: 10/20/2011; updated: 06/05/2013: 06/05/2013
murdock, vinceoriginal: 09/05/2019; updated: 08/26/2020: 08/26/2020
thomas, tammyoriginal: 08/30/2002; updated: 02/13/2017: 02/13/2017

Now I can clean up the updated sanction_dates and format it.

UsadaBadDates %>%
        dplyr::mutate(sanction_dates = 
    # 1) split this on the "updated" pattern
             stringr::str_split(string = sanction_announced, 
                                pattern = "updated")) %>% 
    # 2) convert the output from split into multiple rows
        tidyr::unnest(sanction_dates) %>% 
    
    # 3) remove the "original" dates 
        dplyr::filter(!str_detect(string = sanction_dates, 
                                  pattern = "original")) %>% 
    # 4) remove the colon from sanction_dates
        dplyr::mutate(sanction_dates = stringr::str_remove_all(
                                                      string = sanction_dates,
                                                      pattern = ":"),
                      # 5) remove any whitespace
                      sanction_dates = stringr::str_trim(sanction_dates),
                      # 6) format as date
                      sanction_dates = lubridate::mdy(sanction_dates)) %>%
    # check this to see what is happening 
    dplyr::select(athlete, sanction_announced, sanction_dates)
ABCDEFGHIJ0123456789
athlete
<chr>
sanction_announced
<chr>
sanction_dates
<date>
fischbach, dylanoriginal: 12/18/2015; updated: 04/11/20162016-04-11
trafeh, mohamedoriginal: 12/18/2014; updated: 08/25/20152015-08-25
lea, robertoriginal: 12/17/2015; updated: 02/25/20162016-02-25
young, jeromeoriginal: 11/10/2004; updated: 06/17/20082008-06-17
hay, amyoriginal: 10/31/2017; updated: 12/16/20192019-12-16
hunter, adamoriginal: 10/28/2016; updated: 09/26/20182018-09-26
tovar, oscaroriginal: 10/28/2015; updated: 10/04/20162016-10-04
dotti, juan pablooriginal: 10/20/2011; updated: 06/05/20132013-06-05
murdock, vinceoriginal: 09/05/2019; updated: 08/26/20202020-08-26
thomas, tammyoriginal: 08/30/2002; updated: 02/13/20172017-02-13

How far can this pipe go??

Well, when the data are arranged in the way we want, we can test it with a visualization to check the number of sanctions per sport with bad dates.

UsadaBadDates %>%
        dplyr::mutate(sanction_dates = 
                        
    # 1) split this on the "updated" pattern
             stringr::str_split(string = sanction_announced, 
                                pattern = "updated")) %>% 
  
    # 2) convert the output from split into multiple rows
        tidyr::unnest(sanction_dates) %>% 
  
    # 3) remove the "original" dates 
        dplyr::filter(!str_detect(string = sanction_dates, 
                                  pattern = "original")) %>% 
    # 4) remove the colon from sanction_dates
        dplyr::mutate(sanction_dates = stringr::str_remove_all(
                                                      string = sanction_dates,
                                                      pattern = ":"),
                      
                      # 5) remove any whitespace
                      sanction_dates = stringr::str_trim(sanction_dates),
                      
                      # 6) format as date
                      sanction_dates = lubridate::mdy(sanction_dates)) %>% 
  
    # count by sport
    dplyr::count(sport, sort = TRUE) %>% 
  
    # rename n 
    rename(`Sport Count` = n,
            Sport = sport) %>% 
    ggplot2::ggplot(aes(x = `Sport Count`, 
                        y = forcats::fct_reorder(as.factor(Sport), `Sport Count`), 
                        fill = as.character(`Sport Count`))) + 
    geom_col(show.legend = FALSE) + 
    labs(x = "Number of Sanctions", y = "Sport",
         title = "Number of Sanctions with Bad Dates",
         caption = "source: https://www.usada.org/news/sanctions/") + 
    ggthemes::theme_fivethirtyeight() + 
    ggthemes::scale_fill_few()