layout: true <!-- this adds the link footer to all slides, depends on footer-small class in css--> <div class="footer-small"><span>https://github.com/mjfrigaard/csuc-data-journalism</div> --- name: title-slide class: title-slide, center, middle, inverse # Data Reshaping in R #.fancy[Reshaping and joining data in R] <br> .large[by Martin Frigaard] Written: September 21 2021 Updated: November 30 2021 .footer-large[.right[.fira[ <br><br><br><br><br>[Created using the "λέξις" theme](https://jhelvy.github.io/lexis/index.html#what-does-%CE%BB%CE%AD%CE%BE%CE%B9%CF%82-mean) ]]] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 7% background-size: 7% # Objectives <br><br> ## 1) Recap `dplyr`'s data manipulation verbs -- ## 2) `tidyr`'s pivoting functions -- ## 3) Joins with `dplyr` --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 7% background-size: 7% # Materials ### Follow along with the exercises: https://mjfrigaard.github.io/csuc-data-journalism/lessons.html ### A web version of these slides is located: https://mjfrigaard.github.io/csuc-data-journalism/slides.html --- class: left, top background-image: url(img/dplyr.png) background-position: 95% 7% background-size: 7% # Data Manipulation Recap ## We previously learned how to: -- .leftcol[ > #### 1) View data with `glimpse()` > #### 2) Select columns with `select()` > #### 3) Filter rows with `filter()` ] -- .rightcol[ > #### 4) Arrange data with `arrange()` > #### 5) Create/change columns with `mutate()` ] --- class: inverse, center, bottom background-image: url(img/dplyr.png) background-position: 50% 10% background-size: 35% # `dplyr` = a package for *manipulating* data --- class: inverse, center, bottom background-image: url(img/tidyr.png) background-position: 50% 10% background-size: 35% # `tidyr` = a package for *reshaping* data --- class: left, top background-image: url(img/tidyr.png) background-position: 90% 10% background-size: 12% # Tidy data ## What are tidy data? -- .leftcol[ ### Observations are in rows ### Variables are in columns ### Values are in cells ] -- .rightcol[ <img src="img/tidy-data.png" width="80%" height="80%" style="display: block; margin: auto;" /> ] --- class: left, top background-image: url(img/tidyr.png) background-position: 90% 10% background-size: 12% # Tidy data ### Values are the *intersection* of observations and variables -- <img src="img/intersection.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- class: left, top background-image: url(img/tidyr.png) background-position: 90% 10% background-size: 8% # Non-tidy data ### Copy and paste the code below to create `NotTidy` ```r # copy and paste me! NotTidy <- tibble::tribble( ~group, ~`2019`, ~`2020`, ~`2021`, "A", "102/100", "123/100", "161/100", "B", "179/100", "199/100", "221/100", "C", "223/100", "146/100", "288/100") ``` .small[ <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["group"],"name":[1],"type":["chr"],"align":["left"]},{"label":["2019"],"name":[2],"type":["chr"],"align":["left"]},{"label":["2020"],"name":[3],"type":["chr"],"align":["left"]},{"label":["2021"],"name":[4],"type":["chr"],"align":["left"]}],"data":[{"1":"A","2":"102/100","3":"123/100","4":"161/100"},{"1":"B","2":"179/100","3":"199/100","4":"221/100"},{"1":"C","2":"223/100","3":"146/100","4":"288/100"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] --- class: left, top background-image: url(img/tidyr.png) background-position: 90% 10% background-size: 12% # Non-tidy data ## Why aren't they tidy? .leftcol[ ```r NotTidy %>% View("NotTidy") ``` .border[ <img src="img/NotTidy.png" width="90%" height="90%" style="display: block; margin: auto;" /> ]] -- .rightcol[.large[ > `year` is across columns... > multiple values in cells... ]] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 8% background-size: 8% ## Tidying un-tidy data -- *covered in slides* + `pivot_longer()` - wide to long + `pivot_wider()` - long to wide *** -- *covered in exercises* .leftcol[ `separate()` - pull columns apart `separate_rows()` - split columns down rows `unite()` - stick columns together ] .rightcol[ `unnest()` - flatten columns `uncount()` - duplicate rows according to a weighting variable ] --- class: left, top background-image: url(img/tidyr.png) background-position: 90% 10% background-size: 7% # Quick tip: viewing your data <br><br> ## Make sure you view the data before assigning it to an object <br><br> ## Use `glimpse()` or `View("Name")` --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 8% background-size: 7% ## `tidyr::pivot_longer()` .leftcol[ Make wide data long ```r NotTidy %>% pivot_longer( cols = -group, names_to = "year", values_to = "rate") %>% View("Tidy") ``` ] -- .rightcol[ How it `pivot_longer()` works <img src="img/NotTidy-2-Tidy.png" width="57%" height="57%" style="display: block; margin: auto;" /> ] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 8% background-size: 7% ## `tidyr::pivot_longer()` #### How it works: -- .leftcol[ `cols` = these are the **columns we want to reshape** .small[ ```r NotTidy %>% * pivot_longer(cols = -group, ``` ] `names_to` = this is the new variable that will contain the previous **column names** .small[ ```r NotTidy %>% pivot_longer(cols = -group, * names_to = "year", ``` ] ] -- .rightcol[ `values_to` = this is the new variable that will contain the reshaped **values** .small[ ```r NotTidy %>% pivot_longer(cols = -group, names_to = "year", * values_to = "rate") ``` ] ] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 8% background-size: 7% ## `tidyr::pivot_longer()` .leftcol[ ### Looks correct? ```r NotTidy %>% pivot_longer(cols = -group, names_to = "year", values_to = "rate") %>% View("Tidy") ``` ] -- .rightcol[ ### Not quite .border[ <img src="img/TidyView.png" width="60%" height="60%" style="display: block; margin: auto;" /> ]] --- class: left, top background-image: url(img/tidyr.png) background-position: 93% 7% background-size: 7% ## `pivot_longer()` = names_transform ### Print `Tidy` to console .leftcol[ ```r Tidy ``` .small[ <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["group"],"name":[1],"type":["chr"],"align":["left"]},{"label":["year"],"name":[2],"type":["chr"],"align":["left"]},{"label":["rate"],"name":[3],"type":["chr"],"align":["left"]}],"data":[{"1":"A","2":"2019","3":"102/100"},{"1":"A","2":"2020","3":"123/100"},{"1":"A","2":"2021","3":"161/100"},{"1":"B","2":"2019","3":"179/100"},{"1":"B","2":"2020","3":"199/100"},{"1":"B","2":"2021","3":"221/100"},{"1":"C","2":"2019","3":"223/100"},{"1":"C","2":"2020","3":"146/100"},{"1":"C","2":"2021","3":"288/100"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ]] -- .rightcol[ <br><br> #### *Note the format of the columns* #### *The new `year` variable should be numeric* ] --- class: left, top background-image: url(img/tidyr.png) background-position: 93% 7% background-size: 7% ## `pivot_longer()` = `names_transform` #### We can control this behavior with `names_transform = list()` -- .leftcol[ ```r NotTidy %>% pivot_longer( cols = -group, names_to = "year", values_to = "rate", * names_transform = list( * year = as.numeric)) ``` ] -- .rightcol[.small[ <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["group"],"name":[1],"type":["chr"],"align":["left"]},{"label":["year"],"name":[2],"type":["dbl"],"align":["right"]},{"label":["rate"],"name":[3],"type":["chr"],"align":["left"]}],"data":[{"1":"A","2":"2019","3":"102/100"},{"1":"A","2":"2020","3":"123/100"},{"1":"A","2":"2021","3":"161/100"},{"1":"B","2":"2019","3":"179/100"},{"1":"B","2":"2020","3":"199/100"},{"1":"B","2":"2021","3":"221/100"},{"1":"C","2":"2019","3":"223/100"},{"1":"C","2":"2020","3":"146/100"},{"1":"C","2":"2021","3":"288/100"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ]] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 8% background-size: 7% ## `pivot_longer()` = names_sep <br><br><br> .leftcol40[ Create the `SiteRates` data (example of COVID rates as various hospitals) ] .rightcol60[ .code60[ ```r SiteRates <- tibble::tribble( ~site, ~`2019_Q1`, ~`2019_Q2`, ~`2019_Q3`, ~`2019_Q4`, "Boston", 52, 31, 26, 33.4, "Philadelphia", 7.42, 5.51, 5.82, 6.99, "Cincinnati", 6.73, 4.87, 5.02, 4.66, "Texas", 18.2, 16.6, 17, 19) ``` ]] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 8% background-size: 7% ## `pivot_longer()` = names_sep Create the `SiteRates` data (example of COVID rates as various hospitals) ```r SiteRates %>% View("SiteRates") ``` .border[ <img src="img/SiteRates.png" width="85%" height="85%" style="display: block; margin: auto;" /> ] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## `pivot_longer()` = names_sep #### `SiteRates` has *two* variables in the same column #### We can use `names_sep ` uses a pattern to split the column (`_Q`) .border[ <img src="img/SiteRatesView.png" width="75%" height="75%" style="display: block; margin: auto;" /> ] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## `pivot_longer()` = names_sep + Add `names_sep = "_Q"` + `year` and `quarter` should also be numeric .leftcol[ ```r SiteRates %>% pivot_longer( -site, names_to = c("year", "quarter"), values_to = "rate", * names_sep = "_Q", names_transform = list( * year = as.integer, * quarter = as.integer)) %>% View("TidySites") ``` ] -- .rightcol[ .border[ <img src="img/TidySites.png" width="47%" height="47%" style="display: block; margin: auto;" /> ]] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## More `pivot_longer()` options Create the `SpaceDogs` data. These are names of dogs in the [Soviet Space Dogs database](https://airtable.com/universe/expG3z2CFykG1dZsp/sovet-space-dogs). *Some dogs have multiple names, and some share names.* .code70[ ```r # data.frame prints to the screen better SpaceDogs <- data.frame( date = c("1966-02-22", "1961-03-25", "1961-03-09", "1960-12-22", "1960-12-01", "1960-09-22"), result = c("recovered safely", "recovered safely", "recovered safely", "recovered", "both dogs died","recovered safely"), name_1 = c("Ugolyok / Snezhok", "Zvezdochka", "Chernuskha", "Shutka", "Mushka", "Kusachka / Otvazhnaya"), name_2 = c("Veterok / Bzdunok", NA, NA, "Kometka", "Pchyolka", "Neva")) ``` ] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## More `pivot_longer()` options ```r SpaceDogs %>% View("SpaceDogs") ``` .border[ <img src="img/SpaceDogs.png" width="50%" height="50%" style="display: block; margin: auto;" /> ] The `SpaceDogs` data has missing values in `name_2` -- *...two of the columns have a similar prefix (`name_`)* --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## More `pivot_longer()` options ### To tidy these data, we can combine what we've learned: -- #### 1. `dplyr::starts_with()` to select the `name_` columns -- #### 2. `names_to` has a special `".value"` argument, which is the name of the new column with the `name_` values. We need to include an index column to track the values across different names (`dog_id`). -- #### 3. We can remove missing values with `values_drop_na` --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## More `pivot_longer()` options .code70[ ```r SpaceDogs %>% pivot_longer( # columns with `name_` prefix * starts_with("name_"), ``` ] -- .code70[ ```r SpaceDogs %>% pivot_longer( starts_with("name_"), names_sep = "_", names_to = # new column for `name_` values * c(".value", "dog_id"), ``` ] -- .code70[ ```r SpaceDogs %>% pivot_longer( starts_with("name_"), names_sep = "_", names_to = c(".value", "dog_id"), # remove missing * values_drop_na = TRUE) ``` ] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## More `pivot_longer()` options ### Add these arguments to `pivot_longer()` and add `View()` .leftcol[ ```r SpaceDogs %>% pivot_longer( * starts_with("name_"), names_sep = "_", names_to = c(".value", "dog_id"), * values_drop_na = TRUE) %>% View("TidySpaceDogs") ``` ] -- .rightcol[ .border[ <img src="img/TidySpaceDogs.png" width="95%" height="95%" style="display: block; margin: auto;" /> ] ] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## More `pivot_longer()` options #### We can see the new `dog_id` index #### The missing values have been removed -- <img src="img/SpaceDogs-2-TidySpaceDogs.png" width="90%" height="90%" style="display: block; margin: auto;" /> --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% # `tidyr::pivot_wider()` <br> ## We've made wide data long, now we will make long data wide ### *But, why???* ### Wide data is usually better for displaying summaries ### Long data is better for graphing/modeling --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% # `tidyr::pivot_wider()` ### Consider the `CatVsDogWide` data .leftcol[ #### These data come from this article in the [Washington Post](https://www.washingtonpost.com/news/wonk/wp/2014/07/28/where-cats-are-more-popular-than-dogs-in-the-u-s-and-all-over-the-world/?utm_term=.b50cb49b78b5). ] -- .rightcol[ .border[ <img src="img/wapo-article.png" width="100%" height="100%" style="display: block; margin: auto;" /> ]] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## `tidyr::pivot_wider()` <br><br><br> .leftcol35[ #### Create `CatVsDogWide` ] -- .rightcol65[ .code70[ ```r CatVsDogWide <- tibble::tribble( ~metric, ~CA, ~TX, ~FL, ~NY, ~PA, "no_of_households", 12974, 9002, 7609, 7512, 5172, "no_of_pet_households", 6865, 5265, 4138, 3802, 2942, "no_of_dog_households", 4260, 3960, 2718, 2177, 1702, "no_of_cat_households", 3687, 2544, 2079, 2189, 1748) ``` ]] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## `tidyr::pivot_wider()` ```r CatVsDogWide %>% View("CvDWide") ``` .border[ <img src="img/CvDWide.png" width="88%" height="88%" style="display: block; margin: auto;" /> ] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## `tidyr::pivot_wider()` Step 1. `pivot_long()` states into `state` column, values into `value` column -- .leftcol35[ ```r CatVsDogWide %>% pivot_longer(-metric, names_to = "state", values_to = "value") %>% # view View("CvDLong") ``` ] -- .rightcol65[.border[ <img src="img/CvDLong.png" width="34%" height="34%" style="display: block; margin: auto;" /> ]] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## `tidyr::pivot_wider()` **Calculate percent pets per household, dog owners, and cat owners** Step 1. `pivot_long()` states into `state` column, values into `value` column -- Step 2. `pivot_wider()` the `metric` across columns -- ```r CatVsDogWide %>% pivot_longer(-metric, names_to = "state", values_to = "value") %>% pivot_wider(names_from = "metric", values_from = "value") %>% View("CvDWide") ``` -- .border[ <img src="img/CvDWide2.png" width="70%" height="70%" style="display: block; margin: auto;" /> ] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## `tidyr::pivot_wider()` **Calculate percent pets per household, dog owners, and cat owners** Step 1. `pivot_long()` states into `state` column, values into `value` column Step 2. `pivot_wider()` the `metric` and `values` across columns -- Step 3. `mutate()` the `perc` (percentage) columns -- ```r CatVsDogWide %>% pivot_longer(-metric, names_to = "state", values_to = "value") %>% pivot_wider(names_from = "metric", values_from = "value") %>% mutate( perc_pet_household = no_of_pet_households / no_of_households * 100, perc_pet_household = round(perc_pet_household, digits = 1), perc_dog_owners = no_of_dog_households / no_of_households * 100, perc_dog_owners = round(perc_dog_owners, digits = 1), perc_cat_owners = no_of_cat_households / no_of_households * 100, perc_cat_owners = round(perc_cat_owners, digits = 1)) %>% View("CvDWide") ``` --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## `tidyr::pivot_wider()` **Calculate percent pets per household, dog owners, and cat owners** Step 1. `pivot_long()` states into `state` column, values into `value` column Step 2. `pivot_wider()` the `metric` and `values` across columns Step 3. `mutate()` the `perc` (percentage) columns -- .border[ <img src="img/CvDWide2-perc.png" width="100%" height="100%" style="display: block; margin: auto;" /> ] --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## `tidyr::pivot_wider()` #### *We could assign here, but why stop?* -- Add `select()` helpers to reorganize data! -- ```r # ...code omitted perc_cat_owners = no_of_cat_households / no_of_households * 100, perc_cat_owners = round(perc_cat_owners, digits = 1)) %>% # pass the output from pivot_wider() to select() select(state, contains("perc_dog"), contains("perc_cat")) ``` --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## `tidyr::pivot_wider()` (***Final code***) ```r CatVsDogWide %>% # pivots pivot_longer( -metric, names_to = "state", values_to = "value") %>% pivot_wider( names_from = "metric", values_from = "value") %>% # mutate mutate( perc_pet_household = no_of_pet_households / no_of_households * 100, perc_pet_household = round(perc_pet_household, digits = 1), perc_dog_owners = no_of_dog_households / no_of_households * 100, perc_dog_owners = round(perc_dog_owners, digits = 1), perc_cat_owners = no_of_cat_households / no_of_households * 100, perc_cat_owners = round(perc_cat_owners, digits = 1)) %>% # select select(state, contains("perc_dog"), contains("perc_cat")) ``` --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 5% background-size: 7% ## `tidyr::pivot_wider()` (***Final output***) <img src="img/CvsDPercent.png" width="75%" height="75%" style="display: block; margin: auto;" /> --- class: left, top background-image: url(img/tidyr.png) background-position: 95% 90% background-size: 8% # More tidying in the exercises! ### `separate()` - pull columns apart ### `separate_rows()` - split columns down rows ### `unite()` - stick columns together ### `unnest()` - flatten columns ### `uncount()` - duplicate rows according to a weighting variable --- class: inverse, center, bottom background-image: url(img/dplyr.png) background-position: 50% 20% background-size: 32% <br><br><br><br><br> # `dplyr` = a package for *manipulating relational* data --- background-image: url(img/dplyr.png) background-position: 90% 10% background-size: 8% class: left, top # The `dplyr` joining functions ### ***`dplyr` has functions for joining multiple tibbles or data.frames*** -- ### `left_join()` -- ### `right_join()` -- ### `inner_join()` -- ### `full_join()` -- *Recall that `tibble`s and `data.frame`'s are nearly identical --- background-image: url(img/dplyr.png) background-position: 90% 10% background-size: 12% class: left, top # dplyr joins .leftcol[ ### Toy data `X` .code70[ ```r # create X table X <- tibble::tribble( ~A, ~B, ~C, "a", "t", 1L, "b", "u", 2L, "c", "v", 3L) ``` .small[ <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["A"],"name":[1],"type":["chr"],"align":["left"]},{"label":["B"],"name":[2],"type":["chr"],"align":["left"]},{"label":["C"],"name":[3],"type":["int"],"align":["right"]}],"data":[{"1":"a","2":"t","3":"1"},{"1":"b","2":"u","3":"2"},{"1":"c","2":"v","3":"3"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] ] ] -- .rightcol[ ### Toy data `Y` .code70[ ```r # create Y table Y <- tibble::tribble( ~A, ~B, ~D, "a", "t", 3L, "b", "u", 2L, "d", "W", 1L) ``` .small[ <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["A"],"name":[1],"type":["chr"],"align":["left"]},{"label":["B"],"name":[2],"type":["chr"],"align":["left"]},{"label":["D"],"name":[3],"type":["int"],"align":["right"]}],"data":[{"1":"a","2":"t","3":"3"},{"1":"b","2":"u","3":"2"},{"1":"d","2":"W","3":"1"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] ] ] --- background-image: url(img/dplyr.png) background-position: 90% 10% background-size: 12% class: left, top # `dplyr` left joins ### `left_join(x = , y = )` #### ...joins on matches *from* right-hand table (`Y`) *to* left-hand table (`X`) -- .leftcol[ Keep all data from `X`, and only matching data from `Y` ```r left_join( * x = X, * y = Y ) ``` ] -- .rightcol[ This creates: .small[ <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["A"],"name":[1],"type":["chr"],"align":["left"]},{"label":["B"],"name":[2],"type":["chr"],"align":["left"]},{"label":["C"],"name":[3],"type":["int"],"align":["right"]},{"label":["D"],"name":[4],"type":["int"],"align":["right"]}],"data":[{"1":"a","2":"t","3":"1","4":"3"},{"1":"b","2":"u","3":"2","4":"2"},{"1":"c","2":"v","3":"3","4":"NA"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ]] --- background-image: url(img/left-join-01.png) background-position: 50% 80% background-size: 45% class: left, top # `dplyr` left joins (1) ### Left joins use all the data from `X` (the left-hand table) --- background-image: url(img/left-join-02.png) background-position: 50% 70% background-size: 90% class: left, top # `dplyr` left joins (2) ### Left joins include matched data in the right-hand table `Y`, and it carries over any new corresponding columns --- background-image: url(img/left-join-03.png) background-position: 50% 75% background-size: 90% class: left, top # `dplyr` left join (3) ### The final data includes the new column(s) from `Y` (the right-hand table), and missing values for the unmatched rows. --- background-image: url(img/dplyr.png) background-position: 90% 10% background-size: 12% class: left, top # `dplyr` right joins ### `right_join(x = , y = )` #### ...join on matches *from* right-hand table (`Y`) *to* left-hand table (`X`) -- .leftcol[ Keep all data from `Y`, and only matching data from `X` ```r right_join(x = X, y = Y) ``` ] -- .rightcol[ This creates: .small[ <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["A"],"name":[1],"type":["chr"],"align":["left"]},{"label":["B"],"name":[2],"type":["chr"],"align":["left"]},{"label":["C"],"name":[3],"type":["int"],"align":["right"]},{"label":["D"],"name":[4],"type":["int"],"align":["right"]}],"data":[{"1":"a","2":"t","3":"1","4":"3"},{"1":"b","2":"u","3":"2","4":"2"},{"1":"d","2":"W","3":"NA","4":"1"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] ] --- background-image: url(img/right-join-01.png) background-position: 50% 75% background-size: 40% class: left, top # `dplyr` right joins (1) ### Right joins use all the data from `Y` (the right-hand table) --- background-image: url(img/right-join-02.png) background-position: 50% 75% background-size: 85% class: left, top # `dplyr` right joins (2) ### Right joins include the matched data in the left-hand table `X`, and they carry over any new corresponding columns --- background-image: url(img/right-join-03.png) background-position: 50% 75% background-size: 92% class: left, top # `dplyr` right joins (3) ### The final data includes the new column(s) from `X` (the left-hand table), and missing values for the unmatched rows. --- background-image: url(img/dplyr.png) background-position: 90% 10% background-size: 12% class: left, top # `dplyr` inner joins ### `inner_join(x = , y = )` #### ...keep only matches in *both* `x` *and* `y` -- .leftcol[ Keep only the matching data from `X` and `Y` ```r inner_join(x = X, y = Y) ``` ] -- .rightcol[ This creates: .small[ <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["A"],"name":[1],"type":["chr"],"align":["left"]},{"label":["B"],"name":[2],"type":["chr"],"align":["left"]},{"label":["C"],"name":[3],"type":["int"],"align":["right"]},{"label":["D"],"name":[4],"type":["int"],"align":["right"]}],"data":[{"1":"a","2":"t","3":"1","4":"3"},{"1":"b","2":"u","3":"2","4":"2"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ]] --- background-image: url(img/inner-join-01.png) background-position: 50% 75% background-size: 40% class: left, top # `dplyr` inner joins (1) ### Inner joins use only the matched data from both the `X` and `Y` tables --- background-image: url(img/inner-join-02.png) background-position: 50% 60% background-size: 55% class: left, top # `dplyr` inner joins (2) ### Columns `A` and `B` are matched in both `X` and `Y` tables --- background-image: url(img/inner-join-03.png) background-position: 50% 60% background-size: 85% class: left, top # `dplyr` inner joins (3) ### Column `C` from table `X` gets joined on matching columns `A` and `B` --- background-image: url(img/inner-join-04.png) background-position: 50% 60% background-size: 90% class: left, top # `dplyr` inner joins (4) ### Column `D` from table `Y` gets joined on matching columns `A` and `B` --- background-image: url(img/dplyr.png) background-position: 90% 10% background-size: 12% class: left, top # `dplyr` full joins ## `full_join(x = X, y = Y)` ## ...keep *all* data in both `x` and `y` -- .leftcol[ Keep all data from `Y` and `X` ```r full_join(x = X, y = Y) ``` ] -- .rightcol[ This creates: .small[ <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["A"],"name":[1],"type":["chr"],"align":["left"]},{"label":["B"],"name":[2],"type":["chr"],"align":["left"]},{"label":["C"],"name":[3],"type":["int"],"align":["right"]},{"label":["D"],"name":[4],"type":["int"],"align":["right"]}],"data":[{"1":"a","2":"t","3":"1","4":"3"},{"1":"b","2":"u","3":"2","4":"2"},{"1":"c","2":"v","3":"3","4":"NA"},{"1":"d","2":"W","3":"NA","4":"1"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ]] --- background-image: url(img/full-join-01.png) background-position: 50% 60% background-size: 50% class: left, top # `dplyr` full joins (1) ### Full joins include all data from both tables `X` and `Y` --- background-image: url(img/full-join-02.png) background-position: 50% 60% background-size: 70% class: left, top # `dplyr` full joins (2) ### Full joins start with all data in table `X` --- background-image: url(img/full-join-03.png) background-position: 50% 70% background-size: 95% class: left, top # `dplyr` full joins (3) ### Full joins start with all data in table `X` and include the columns and rows from table `Y` --- class: left, top background-image: url(img/dplyr.png) background-position: 7% 92% background-size: 8% # Resources for Data Tidying #### 1. [R for Data Science](https://r4ds.had.co.nz/tidy-data.html) #### 2. [Data Wrangling with R](https://cengel.github.io/R-data-wrangling/) #### 3. [Stack Overflow questions tagged with `tidyr`](https://stackoverflow.com/questions/tagged/tidyr) #### 4. [RStudio Community posts tagged `tidyr`](https://community.rstudio.com/tag/tidyr)