1 Objectives

In order to use graphs and figures to effectively communicate with our audience, we need to consider a few things:

  1. Recognize the needs of your audience (who are they, and where are they coming from?)
  • level of data literacy, subject matter expertise, etc.
  1. Communicate the quality of the data with stakeholders (can we answer their question(s) with the available data?)
  • let them know the good and the bad news
  1. Identify the correct data visualization (based on the data and the problem/question)
  • single variable, bivariate, and multivariate graphs

2 Materials

3 Previous lessons

All of the exercises and lessons are available here. Read more about ggplot2 on the tidyverse website, and in the Data Visualisation chapter of R for Data Science.

4 Load the packages

The main packages we’re going to use are dplyr, tidyr, and ggplot2. These are all part of the tidyverse, so we’ll import this package below:

install.packages("tidyverse")
library(tidyverse)

5 Example: COVID and Mobility

Assume we received the following questions from a stakeholder:

How has COVID changed our modes of transportation?

Or

Are people using fewer or different forms of transportation since the COVID pandemic?

Questions we should be considering:

  1. What kind of measurements would this be?
  • how people travel (walk, drive, etc.)
  1. What would these data look like?
  • what would the columns and rows look like?

5.1 Data Import

We’re going to use the following data to answer the stakeholder’s questions:

Import the data below

AppleMobRaw <- readr::read_csv("https://bit.ly/36tTVpe")

5.1.1 Tidy AppleMobRaw

These data need to be restructured into a tidy format.

AppleMobRaw %>% 
  tidyr::pivot_longer(cols = -c(geo_type:country), 
                      names_to = "date", 
                      values_to = "dir_request")

5.1.2 Wrangle AppleMobRaw

We will remove the missing values from country and sub-region

AppleMobRaw %>% 
  tidyr::pivot_longer(cols = -c(geo_type:country), 
                      names_to = "date", values_to = "dir_request") %>% 
    # remove missing country and missing sub-region data
  dplyr::filter(!is.na(country) & !is.na(`sub-region`))

Use mutate() to create a properly formatted date variable, and rename() the transportation_type variable to trans_type. Apply janitor::clean_names() to the entire dataset and assign the final output to TidyApple.

AppleMobRaw %>% 
  tidyr::pivot_longer(cols = -c(geo_type:country), 
                      names_to = "date", values_to = "dir_request") %>% 
    # remove missing country and missing sub-region data
  dplyr::filter(!is.na(country) & !is.na(`sub-region`)) %>% 
  # format date
  mutate(date = lubridate::ymd(date)) %>% 
  # change name of transportation types
  rename(trans_type = transportation_type) %>% 
  # clean names 
  janitor::clean_names() -> TidyApple

5.2 Counting

One of the most important jobs of analytic work is counting things. There are many ways to accomplish this in R, but we’ll stick with the dplyr package because it’s part of the tidyverse.

The dplyr function for counting responses of a categorical or factor variable is count(), and it works like this:

Data %>% 
  count(variable)

5.2.1 Count

So, if we wanted to count the number of different transportation types in the TidyApple data frame, it would look like this,

TidyApple %>% 
  dplyr::count(trans_type)

5.2.2 Count & sort

We can also sort the responses using the sort = TRUE argument.

TidyApple %>% 
  dplyr::count(trans_type, sort = TRUE)

5.2.3 Iterate with count

We can also combine dplyr::select_if() and purrr::map() to pass the count() function to all the character variables in TidyApple.

TidyApple %>% 
  select_if(is.character) %>% 
  map(~count(data.frame(x = .x), x, sort = TRUE)) -> tidy_apple_counts

We can example the counts of each value by using the $ to subset the tidy_apple_counts list.

tidy_apple_counts$sub_region
tidy_apple_counts$region

6 Visualizing Distributions

Before we start looking at relationships between variables, we should examine each variable’s underlying distribution. In the next section, we’re going to cover a few graphs that display variable distributions: histograms, density, violin, and ridgeline plots,

6.1 Histograms

A histogram is a special kind of bar graph–it only takes a single continuous variable (in this case, dir_request), and it displays a relative breakdown of the values.

The x axis for the histogram will have the direction requests, and the y variable will display a count of the values.

lab_hist <- labs(x = "Apple directions requests",
                 y = "Count",
     title = "Distribution of Direction Requests",
     subtitle = "source: https://covid19.apple.com/mobility")

6.1.1 exercise

Create a histogram of direction requests using dir_request

TidyApple %>% ggplot() + 
  geom_histogram(aes(x = ____________)) + 
  lab_hist

6.1.2 solution

See blow:

TidyApple %>% ggplot() + 
  geom_histogram(aes(x = dir_request)) + 
  lab_hist

6.2 Adjusting Y Axes

We can see the y axis of the histogram is in scientific notation. This might be hard for some audiences to interpret, so we will change this to use the whole number with commas with the scales package.

6.2.1 exercise

Add the scales::comma value to the scale_y_continuous() function.

library(scales)
TidyApple %>% ggplot() + 
  geom_histogram(aes(x = dir_request)) + 
  scale_y_continuous(labels = __________) +
  lab_hist

6.2.2 solution

See below:

library(scales)
TidyApple %>% ggplot() + 
  geom_histogram(aes(x = dir_request)) + 
  scale_y_continuous(labels = scales::comma) +
  lab_hist

6.3 Histogram Shape

We can control the shape of the histogram with the bins argument. The default is 30.

6.3.1 exercie

Set bins to 15.

TidyApple %>% ggplot() + 
  geom_histogram(aes(x = dir_request), bins = __) + 
  scale_y_continuous(labels = scales::comma) +
  lab_hist

6.3.2 solution

See below:

TidyApple %>% ggplot() + 
  geom_histogram(aes(x = dir_request), bins = 15) + 
  scale_y_continuous(labels = scales::comma) +
  lab_hist

6.3.3 exercise

Set bins to 45 and assign it to gg_hist45.

TidyApple %>% ggplot() + 
  geom_histogram(aes(x = dir_request), bins = __) + 
  scale_y_continuous(labels = scales::comma) +
  lab_hist -> _____________

6.3.4 solution

See below:

TidyApple %>% ggplot() + 
  geom_histogram(aes(x = dir_request), bins = 45) + 
  scale_y_continuous(labels = scales::comma) +
  lab_hist -> gg_hist45

6.4 Density Plots

What if we want to see how a continuous variable is distributed across a categorical variable? We did this in the previous lesson with a boxplot.

Density plots come in handy here (so do geom_boxplot()s!). Read more about the density geom here.

We are going to create the graph labels so we know what to expect when we build our graph.

We want to see the distribution of the directions request, filled by the levels of transportation type.

lab_density <- labs(x = "Apple directions requests",
                    fill = "Transit Type",
     title = "Distribution of Direction Requests vs. Transportation Type",
     subtitle = "source: https://covid19.apple.com/mobility")

Now we build the density plot, passing the variables so they match our labels above.

6.4.1 exercise

Create a density plot of direction requests colored by the type of transportation.

TidyApple %>% 
  ggplot() +
  geom_density(aes(x = __________, fill = __________)) + 
  lab_density

6.4.2 solution

See below:

TidyApple %>% 
  ggplot() +
  geom_density(aes(x = dir_request, fill = trans_type)) + 
  lab_density

One drawback to density plots is the y axis can be hard to interpret

6.4.3 exercise

Adjust the overlapping densities by setting alpha to 1/3. Assign this plot to gg_density.

TidyApple %>% 
  ggplot() +
  geom_density(aes(x = dir_request, fill = trans_type), 
               alpha = __________) + 
  
  lab_density -> __________

6.4.4 solution

See below:

TidyApple %>% 
  ggplot() +
  geom_density(aes(x = dir_request, fill = trans_type), 
               alpha = 1/3) + 
  lab_density -> gg_density

gg_density

6.5 Ridgeline Plots

Another option is a ridgeline plot (from the ggridges package). These display multiple densities.

lab_ridges <- labs(
     title = "Direction Requests by Transportation Type",
     subtitle = "source: https://covid19.apple.com/mobility",
     fill = "Transit type",
     x = "Apple directions requests",
     y = "Transportation Types")
library(ggridges)
TidyApple %>%  
  ggplot() + 
  geom_density_ridges(aes(x = dir_request, 
                          y = trans_type, 
                          fill = trans_type), 
                      alpha = 1/5) + 
  lab_ridges

6.6 Violin Plots

Another alternative to the density plot is the violin plot.

6.6.1 exercise

  • assign "Apple directions requests" to the x axis
  • assign "Transit Type" to the y axis
lab_violin <- labs(x = _________________________,
                    y = _________________________,
                   fill = "Transit Type",
     title = "Distribution of Direction Requests vs. Transportation Type",
     subtitle = "source: https://covid19.apple.com/mobility")

6.6.2 solution

lab_violin <- labs(x = "Apple directions requests",
                   y = "Transit Type",
                   fill = "Transit Type",
     title = "Distribution of Direction Requests vs. Transportation Type",
     subtitle = "source: https://covid19.apple.com/mobility")

6.6.3 exercise

Add a geom_violin() to the code below:

TidyApple %>% 
  ggplot() +
  ____________(aes(y = dir_request, x = trans_type, 
                  fill = trans_type)) + 
  lab_violin

6.6.4 solution

TidyApple %>% 
  ggplot() +
  geom_violin(aes(y = dir_request, x = trans_type, 
                  fill = trans_type)) + 
  lab_violin

6.6.5 exercise

The great thing about ggplot2s layered syntax, is that we can add geoms with similar aesthetics to the same graph! For example, we can see how geom_violins and geom_boxplots are related by adding a geom_boxplot() layer to the graph above.

TidyApple %>% 
  ggplot() +
  geom_violin(aes(y = dir_request, x = trans_type, 
                  fill = trans_type), alpha = 1/5) + 
  ___________(aes(y = dir_request, x = trans_type, 
                   color = trans_type)) + 
  lab_violin

6.6.6 solution

Note we set the alpha to 1/5 for the geom_violin(), and the color to trans_type for the geom_boxplot().

TidyApple %>% 
  ggplot() +
  geom_violin(aes(y = dir_request, x = trans_type, 
                  fill = trans_type), alpha = 1/5) + 
  geom_boxplot(aes(y = dir_request, x = trans_type, 
                   color = trans_type)) + 
  lab_violin

8 Adding Text to Graphs

We can see there is a gap in the direct request data (this is documented in the data source).

“Data for May 11-12 is not available and will appear as blank columns in the data set.”

We should communicate this gap with our audience, and we might include a text annotation on the graph so our audience isn’t distracted by the gap.

In the previous lesson, we introduced the ggrepel package to show the points on this graph of top performing pharmaceutical companies.

We’re going to use labels to annotate and highlight US cities between March and June of 2020.

8.1 Annotations

Being able to manually add text and annotations as layers to your graph makes it easier to communicate the nuances of your data to your audience. We are going to start by accounting for the missing data in TidyApple.

8.1.1 exercise

Build a dataset from TidyApple that only has US cities, and ranges from March 1, 2020 to June 30, 2020.

USCitiesMarJun <-  TidyApple %>% 
  filter(geo_type == ___________ &
              country == ___________, 
                date >= as_date(___________) & 
                  date <= as_date(___________)) 

USCitiesMarJun %>% 
                  skimr::skim()

8.1.2 solution

See below:

USCitiesMarJun <-  TidyApple %>% 
  filter(geo_type == "city" &
              country == "United States", 
                date >= as_date("2020-03-01") & 
                  date <= as_date("2020-07-01")) 

USCitiesMarJun %>% skimr::skim()
Data summary
Name Piped data
Number of rows 38622
Number of columns 7
_______________________
Column type frequency:
character 5
Date 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
geo_type 0 1 4 4 0 1 0
region 0 1 5 39 0 111 0
trans_type 0 1 7 7 0 3 0
sub_region 0 1 4 14 0 39 0
country 0 1 13 13 0 1 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2020-03-01 2020-07-01 2020-05-01 123

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
dir_request 628 0.98 82.06 41.11 4.37 50.87 74.38 109.11 525.44 ▇▃▁▁▁

8.1.3 exercise

We’re going to build labels using the paste0() function. Fill in the appropriate dataset for the min() and max() date.

lab_annotate <- labs(x = "Date",
                 y = "Direction Requests",
                 title = "Direction Requests Over Time (US Cities Only)",
                 subtitle = paste0(min(______________$date), 
                                   " through ", 
                                   max(______________$date)),
                 caption = "source: https://covid19.apple.com/mobility", 
                 color = "Transit Type")

8.1.4 solution

See below:

lab_annotate <- labs(x = "Date",
                 y = "Direction Requests",
                 title = "Direction Requests Over Time (US Cities Only)",
                 subtitle = paste0(min(USCitiesMarJun$date), 
                                   " through ", 
                                   max(USCitiesMarJun$date)),
                 caption = "source: https://covid19.apple.com/mobility", 
                 color = "Transit Type")

8.1.5 exercise

The previous code for the graph has been added. We’re going to add the following layers:

Inside coord_cartesian():

  • map min(USCitiesMarJun$date) and max(USCitiesMarJun$date) inside c() to xlim
  • map min(USCitiesMarJun$dir_request, na.rm = TRUE) and max(USCitiesMarJun$dir_request, na.rm = TRUE) inside c() to ylim

Inside the # horizontal annotate():

  • set geom to "segment"
  • map 0.5 to size
  • map "firebrick3" to color
  • map lubridate::as_date("2020-05-10") to x
  • map lubridate::as_date("2020-05-13") to xend
  • map 100 to y and yend

Inside the # big vertical annotate():

  • set geom to "segment"
  • map 1 to size
  • map "firebrick3" to color
  • map lubridate::as_date("2020-05-11") to x
  • map lubridate::as_date("2020-05-11") to xend
  • map 270 to y and 100 to yend

Inside the # text annotate():

  • set geom to "text"
  • map 8 to size
  • map "red" to color
  • map 0.5 to hjust
  • map lubridate::as_date("2020-05-07") to x
  • map 280 to y
  • map "Data not available" to label
USCitiesMarJun %>% 
    ggplot() +
    geom_line(aes(x = date, y = dir_request, 
             group = trans_type, color = trans_type), 
             # make these slightly larger...
             size = 0.30) + 
   # coordinate system
   coord_cartesian(xlim = c(_______________, _______________), 
                   
                   ylim = c(_______________, na.rm = __________), 
                            _______________, na.rm = __________))) +
   # horizontal
   annotate(geom = ___________, size = ___________, color = ___________,
                  x = ___________, 
                  xend = ___________,
                  y = ___________, 
                  yend = ___________) + 
    # big vertical 
    annotate(geom = ___________, 
            size = ___________, 
            color = ___________,
            x = ___________, 
            xend = ___________, 
            y = ___________, yend = ___________) + 
    # text
    annotate(geom = "text",
               size = 8,
               color = "red", 
               hjust = 0.5, 
               x = lubridate::as_date("2020-05-07"), 
               y = 280, 
               label = "Data not available") +

    lab_annotate

8.1.6 solution

# plot
USCitiesMarJun %>% 
    ggplot() +
    geom_line(aes(x = date, y = dir_request, 
             group = trans_type, color = trans_type), 
             # make these slightly larger...
             size = 0.30) + 
   # coordinate system
   coord_cartesian(xlim = c(min(USCitiesMarJun$date), 
                            max(USCitiesMarJun$date)), 
                   
                   ylim = c(min(USCitiesMarJun$dir_request, na.rm = TRUE), 
                            max(USCitiesMarJun$dir_request, na.rm = TRUE))) +
   # horizontal
   annotate(geom = "segment", 
            size = 0.5, 
            color = "firebrick3",
                  x = lubridate::as_date("2020-05-10"), 
                  xend = lubridate::as_date("2020-05-13"),
                             y = 100, 
                             yend = 100) + 
    # big vertical 
    annotate(geom = "segment", 
            size = 1, 
            color = "firebrick3",
            x = lubridate::as_date("2020-05-11"), 
            xend = lubridate::as_date("2020-05-11"), 
            y = 270, yend = 100) + 
  
    # text
    annotate(geom = "text",
               color = "red", 
               hjust = 0.5, 
               size = 8,
               x = lubridate::as_date("2020-05-07"), 
               y = 280, 
               label = "Data not available") +

    lab_annotate

8.1.7 exercise

Add a second and third vertical segment to create a fence or bracket for the dates with missing data.

# plot
USCitiesMarJun %>% 
    ggplot() +
    geom_line(aes(x = date, y = dir_request, 
             group = trans_type, color = trans_type), 
             # make these slightly larger...
             size = 0.30) + 
   # coordinate system
   coord_cartesian(xlim = c(min(USCitiesMarJun$date), 
                            max(USCitiesMarJun$date)), 
                   
                   ylim = c(min(USCitiesMarJun$dir_request, na.rm = TRUE), 
                            max(USCitiesMarJun$dir_request, na.rm = TRUE))) +
   # horizontal
   annotate(geom = "segment", 
            size = 0.5, 
            color = "firebrick3",
                  x = lubridate::as_date("2020-05-10"), 
                  xend = lubridate::as_date("2020-05-13"),
                             y = 100, 
                             yend = 100) +
    # big vertical 
    annotate(geom = "segment", 
            size = 1, 
            color = "firebrick3",
            x = lubridate::as_date("2020-05-11"), 
            xend = lubridate::as_date("2020-05-11"), 
            y = 270, yend = 100) + 
    # text
    annotate(geom = "text",
               color = "red", 
               hjust = 0.5, 
               size = 8,
               x = lubridate::as_date("2020-05-07"), 
               y = 280, 
               label = "Data not available") +
   # second vertical 
   annotate(geom = "segment", 
            size = 0.5, 
            color = "firebrick3",
            x = _____________________________, 
            xend = _____________________________, 
            y = _____________________________, 
            yend = _____________________________) + 
    # third vertical 
     annotate(geom = "segment", 
            size = 0.5, 
            color = "firebrick3",
            x = _____________________________, 
            xend = _____________________________, 
            y = _____________________________, 
            yend = _____________________________) + 
  
   lab_annotate 

8.1.8 solution

See below:

# plot
USCitiesMarJun %>% 
    ggplot() +
    geom_line(aes(x = date, y = dir_request, 
             group = trans_type, color = trans_type), 
             # make these slightly larger...
             size = 0.30) + 
   # coordinate system
   coord_cartesian(xlim = c(min(USCitiesMarJun$date), 
                            max(USCitiesMarJun$date)), 
                   
                   ylim = c(min(USCitiesMarJun$dir_request, na.rm = TRUE), 
                            max(USCitiesMarJun$dir_request, na.rm = TRUE))) +
   # horizontal
   annotate(geom = "segment", 
            size = 0.5, 
            color = "firebrick3",
                  x = lubridate::as_date("2020-05-10"), 
                  xend = lubridate::as_date("2020-05-13"),
                             y = 100, 
                             yend = 100) +
    # big vertical 
    annotate(geom = "segment", 
            size = 1, 
            color = "firebrick3",
            x = lubridate::as_date("2020-05-11"), 
            xend = lubridate::as_date("2020-05-11"), 
            y = 270, yend = 100) + 
    # text
    annotate(geom = "text",
               color = "red", 
               hjust = 0.5, 
               size = 8,
               x = lubridate::as_date("2020-05-07"), 
               y = 280, 
               label = "Data not available") +
   # second vertical 
   annotate(geom = "segment", 
            size = 0.5, 
            color = "firebrick3",
            x = lubridate::as_date("2020-05-10"), 
            xend = lubridate::as_date("2020-05-10"), 
            y = 100, yend = 90) + 
    # third vertical 
     annotate(geom = "segment", 
            size = 0.5, 
            color = "firebrick3",
            x = lubridate::as_date("2020-05-13"), 
            xend = lubridate::as_date("2020-05-13"), 
            y = 100, yend = 90) + 
  
   lab_annotate 

8.2 Plotting area

Another option is to use geom_rect() to black out the missing data.

# plot
USCitiesMarJun %>% 
    ggplot() +
    geom_line(aes(x = date, y = dir_request, 
             group = trans_type, color = trans_type), 
             # make these slightly larger...
             size = 0.30) + 
   # coordinate system
   coord_cartesian(xlim = c(min(USCitiesMarJun$date), 
                            max(USCitiesMarJun$date)), 
                   
                   ylim = c(min(USCitiesMarJun$dir_request, na.rm = TRUE), 
                            max(USCitiesMarJun$dir_request, na.rm = TRUE))) +
  geom_rect(xmin = lubridate::as_date("2020-05-10"), 
            xmax = lubridate::as_date("2020-05-12"), 
            ymin = -Inf, 
            ymax = Inf, 
            color = NA) + 
  geom_text(x = as.Date("2020-05-11"), 
            y = 100, label = "Data Not Available", 
            angle = 90, color = "white") +
  lab_annotate

8.3 Labeling Values

Max Driving Requests

The code below creates a subset of the data (TopUSCities). We will use this to add the labels.

TopUSCities <- TidyApple %>% 
  filter(country == "United States" & 
           region %in% c("New York City", "Los Angeles", 
                         "Chicago", "Houston", "Phoenix")) 
TopUSCities

8.3.1 exercise

Create MaxUSCitiesDriving by filtering trans_type, grouping on the region variable, and using dplyr::slice_max() to get the top value in dir_request.

TopUSCities %>% 
  filter(trans_type == __________) %>% 
  group_by(__________) %>% 
  slice_max(dir_request) %>% 
  ungroup() -> MaxUSCitiesDriving
MaxUSCitiesDriving

8.3.2 solution

See below:

TopUSCities %>% 
  filter(trans_type == "driving") %>% 
  group_by(region) %>% 
  slice_max(dir_request) %>% 
  ungroup() -> MaxUSCitiesDriving
MaxUSCitiesDriving

8.3.3 exercise

Create graph labels:

  • assign "Peak Driving Direction Requests in Largest US Cities" to title

  • assign "Max Driving Direction Requests & Date" to subtitle

lab_line_max_drivers <- labs(
                 x = "Date",
                 y = "Direction Requests",
                 title = "_________________________________",
                 subtitle = paste0(min(___________$date), 
                                   " through ", 
                                   max(___________$date)),
                 caption = "source: https://covid19.apple.com/mobility", 
                 color = "Transit Type")

8.3.4 solution

See below:

lab_line_max_drivers <- labs(
                 x = "Date",
                 y = "Direction Requests",
                 title = "Peak Driving Direction Requests in Largest US Cities",
                 subtitle = paste0(min(TopUSCities$date), 
                                   " through ", 
                                   max(TopUSCities$date)),
                 caption = "source: https://covid19.apple.com/mobility", 
                 color = "Transit Type")

8.3.5 exercise

Create max_driving_labels using paste0() with region and date.

MaxUSCitiesDriving %>% 
  mutate(
    max_driving_labels = paste0(______, ", ", ______)) -> MaxUSCitiesDriving
MaxUSCitiesDriving %>% 
  select(max_driving_labels)

8.3.6 solution

See below:

MaxUSCitiesDriving %>% 
  mutate(max_driving_labels = paste0(region, ", ", date)) -> MaxUSCitiesDriving
MaxUSCitiesDriving %>% 
  select(max_driving_labels)

8.3.7 exercise

Create a line plot, assigning the following values in geom_label_repel():

  • set the data argument to MaxUSCitiesDriving

Inside the aes():

  • map label to max_driving_labels

Outside the aes()

  • map color to "red"
  • map size to 3
TopUSCities %>% 
    ggplot() + 
    geom_line(aes(x = date, y = dir_request, 
                  group = trans_type, 
                  color = trans_type), 
                  # make these slightly smaller again...
                  size = 0.15) + 
    geom_label_repel(data = _____________,
              aes(x = date, y = dir_request, 
                  label = _____________), 
                  # set color and size...
                  color = _____, 
                  size = _) + 
  lab_line_max_drivers

8.3.8 solution

See below:

TopUSCities %>% 
    ggplot() + 
    geom_line(aes(x = date, y = dir_request, 
                  group = trans_type, 
                  color = trans_type), 
                  # make these slightly smaller again...
                  size = 0.15) + 
    geom_label_repel(data = MaxUSCitiesDriving,
              aes(x = date, y = dir_request, 
                  label = max_driving_labels), 
                  # set color and size...
                  color = "red", 
                  size = 3) + 
  lab_line_max_drivers

8.4 Labeling Values 2

Min Walking Requests

We are going to repeat the process above, but use the minimum value for walking direction requests.

8.4.1 exercise

  • filter() the trans_type to "walking"
  • group_by() the region
  • use slice_min() to get the minimum value for dir_request
  • Assign to MinUSCitiesWalking
TopUSCities %>% 
  filter(________ == ________) %>% 
  group_by(________) %>% 
  slice_min(dir_request) %>% 
  ungroup() -> MinUSCitiesWalking
MinUSCitiesWalking

8.4.2 solution

See below:

TopUSCities %>% 
  filter(trans_type == "walking") %>% 
  group_by(region) %>% 
  slice_min(dir_request) %>% 
  ungroup() -> MinUSCitiesWalking
MinUSCitiesWalking

8.4.3 exercise

  • assign "Lowest Walking Direction Requests in Largest US Cities" to title
lab_line_min_walking <- labs(
               x = "Date",
               y = "Direction Requests",
               title = "__________________________________________",
               subtitle = paste0(min(___________$date), 
                                   " through ", 
                                 max(___________$date)),
               caption = "source: https://covid19.apple.com/mobility", 
               color = "Transit Type")

8.4.4 solution

See below:

lab_line_min_walking <- labs(
               x = "Date",
               y = "Direction Requests",
               title = "Lowest Walking Direction Requests in Largest US Cities",
               subtitle = paste0(min(TopUSCities$date), 
                                   " through ", 
                                 max(TopUSCities$date)),
               caption = "source: https://covid19.apple.com/mobility", 
               color = "Transit Type")

8.4.5 exercise

Create min_walking_labels using paste0() with region and date

MinUSCitiesWalking %>% 
  mutate(min_walking_labels = paste0(_____, ", ", _____)) -> MinUSCitiesWalking
MinUSCitiesWalking

8.4.6 solution

See below:

MinUSCitiesWalking %>% 
  mutate(min_walking_labels = paste0(region, ", ", date)) -> MinUSCitiesWalking
MinUSCitiesWalking

8.4.7 exercise

Create a line plot, assigning the following values in geom_label_repel():

  • set data to MinUSCitiesWalking

Inside aes():

  • map min_walking_labels to label

Outside aes():

  • map "blue" to color
  • map 3 to size
TopUSCities %>% 
    ggplot() + 
    geom_line(aes(x = date, y = dir_request, 
                  group = trans_type, 
                  color = trans_type), 
                  # make these slightly smaller again...
                  size = 0.15) + 
    geom_label_repel(data = ____________,
              aes(x = date, y = dir_request, 
                  label = ____________), 
                  # set color and size...
                  color = _______, 
                  size = _) + 
  lab_line_min_walking

8.4.8 solution

See below:

TopUSCities %>% 
    ggplot() + 
    geom_line(aes(x = date, y = dir_request, 
                  group = trans_type, 
                  color = trans_type), 
                  # make these slightly smaller again...
                  size = 0.15) + 
    geom_label_repel(data = MinUSCitiesWalking,
              aes(x = date, y = dir_request, 
                  label = min_walking_labels), 
                  # set color and size...
                  color = "blue", 
                  size = 3) + 
  lab_line_min_walking

9 Reference lines

We’re going to focus on the top 8 cites according to their population (the date of this writing is 2020-11-26).

9.1 Top Cities

Top Cities

We’re going to introduce another filtering method in this section to create the TopUSCities dataset.

9.1.1 exercise

Store top eight cities in the focus_on vector and use it to filter the TidyApple dataset.

focus_on <- c("New York City", "Los Angeles", 
                       "Chicago", "Houston", 
                       "Phoenix", "Philadelphia", 
                       "San Antonio", "San Diego")

TopUSCities <- TidyApple %>% 
  filter(region %in% _____________)

TopUSCities %>% glimpse(60)

9.1.2 solution

See below:

focus_on <- c("New York City", "Los Angeles", 
              "Chicago", "Houston", 
              "Phoenix", "Philadelphia", 
              "San Antonio", "San Diego")

TopUSCities <- TidyApple %>% 
  filter(region %in% focus_on)

TopUSCities %>% glimpse(60)
## Rows: 7,608
## Columns: 7
## $ geo_type    <chr> "city", "city", "city", "city", "city…
## $ region      <chr> "Chicago", "Chicago", "Chicago", "Chi…
## $ trans_type  <chr> "driving", "driving", "driving", "dri…
## $ sub_region  <chr> "Illinois", "Illinois", "Illinois", "…
## $ country     <chr> "United States", "United States", "Un…
## $ date        <date> 2020-01-13, 2020-01-14, 2020-01-15, …
## $ dir_request <dbl> 100.00, 103.68, 104.45, 108.72, 132.8…

9.1.3 exercise

Graph Labels

We’re going to place date on the x axis, and dir_request on the y. The tite will reflect a general description of what we’re expecting to see, and we’ll list the cities in the subtitle. color will be used to give a better description than trans_type.

Fill in names for the x, y, and color.

lab_top_cities <- labs(x = _____, y  = __________, 
     title = "Trends of Relative Activity in Selected US Cities",
     subtitle = "NY, LA, CH, HOU, PHA, PHL, SATX, SD", 
     color = _________)

9.1.4 solution

See below:

lab_top_cities <- labs(x = "Date", y  = "Direction Requests", 
     title = "Trends of Relative Activity in Selected US Cities",
     subtitle = "NY, LA, CH, HOU, PHA, PHL, SATX, SD", 
     color = "Type")

9.2 Set Global aes()

We’re going to set the global graph aesthetics inside ggplot(aes()) using our labels as a guide. This will serve as a base layer for us to add our reference line to!

9.2.1 exercise

Global aes()

  • Map trans_type to group and color

  • Also add a geom_line() layer with the size set to 0.1 (not inside the aes()!)

TopUSCities %>% 
  ggplot(aes(x = date, y = dir_request, 
             _____ = __________, 
             _____ = __________)) + 
  ____________(_____ = ___) + 
  lab_top_cities

9.2.2 solution

See below:

TopUSCities %>% 
  ggplot(aes(x = date, y = dir_request, 
             group = trans_type, 
             color = trans_type)) + 
  geom_line(size = 0.1) + 
  lab_top_cities

9.3 Reference Line Layer

The documentation for the data tells us each dir_request has a baseline value of 100. We’re going to add this as a reference line on the graph using ggplot2::geom_hline().

9.3.1 exercise

The geom_hline() function takes yintercept, size, and color arguments.

  • use our baseline value of 100 as the yintercept
  • set the size to 0.2
  • make the color of this line "gray20"
TopUSCities %>% 
  ggplot(aes(x = date, y = dir_request, 
             group = trans_type, 
             color = trans_type)) +
  geom_line(size = 0.1) + 
  # add reference line
  geom_hline(yintercept = ___, size = ___, color = _______) +
  lab_top_cities

9.3.2 solution

See below:

TopUSCities %>% 
  ggplot(aes(x = date, y = dir_request, 
             group = trans_type, 
             color = trans_type)) + 
  geom_line(size = 0.1) + 
  geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
  lab_top_cities

Reference lines are helpful when we want to examine trends in relation to a particular value.

10 Advanced Facets

In the previous lesson, we introduced the facet_wrap() function for viewing the relationship between two variables across the levels of a categorical variable. In the next section, we’re going to show how faceting can be used to explore ‘small multiples’ in a dataset with variation across multiple levels.

10.1 facet_wrap()

Now that we have a graph we can use to compare the 8 cities, we will use facet_wrap to create a subplot for each level of region.

10.1.1 exercise

Fill in the facet_wrap() (note the use of the ~) function with region and set the ncol to 2.

TopUSCities %>% 
  ggplot(aes(x = date, y = dir_request, 
             group = trans_type, 
             color = trans_type)) + 
  geom_line(size = 0.1) + 
  geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
  facet_wrap(~ _______,  ncol = _) + 
  lab_top_cities

10.1.2 solution

See below:

TopUSCities %>% 
  ggplot(aes(x = date, y = dir_request, 
             group = trans_type, 
             color = trans_type)) + 
  geom_line(size = 0.1) + 
  geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
  facet_wrap(~ region,  ncol = 2) + 
  lab_top_cities

10.1.3 exercise

Now map both region and trans_type to facet_wrap() and set the ncol to 6.

TopUSCities %>% 
  ggplot(aes(x = date, y = dir_request, 
             group = region, 
             color = trans_type)) + 
  geom_line(size = 0.1) + 
  geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
  facet_wrap(_______ ~ _______, ncol = _) + 
  lab_top_cities

10.1.4 solution

See below:

TopUSCities %>% 
  ggplot(aes(x = date, y = dir_request, 
             group = region, 
             color = trans_type)) + 
  geom_line(size = 0.1) + 
  geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
  facet_wrap(region ~ trans_type, ncol = 6) + 
  lab_top_cities

10.2 Complete Themes

ggplot2 comes with a few default theme settings. Read more about them here. These themes come with arguments to adjust the font_size and font_family on our plot.

theme_minimal(
  base_size = 11,
  base_family = "",
  base_line_size = base_size/22,
  base_rect_size = base_size/22
)

10.2.1 exercise

We’ll use the theme_minimal() function to reduce the number of elements on our graph (don’t add any arguments).

TopUSCities %>% 
  ggplot(aes(x = date, y = dir_request, 
             group = region, 
             color = trans_type)) + 
  geom_line(size = 0.1) + 
  geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
  facet_wrap(region ~ trans_type, ncol = 6) + 
  ______________() +
  lab_top_cities 

10.2.2 solution

See below:

TopUSCities %>% 
  ggplot(aes(x = date, y = dir_request, 
             group = region, 
             color = trans_type)) + 
  geom_line(size = 0.1) + 
  geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
  facet_wrap(region ~ trans_type, ncol = 6) + 
  theme_minimal() +
  lab_top_cities 

10.3 Theme elements

Themes (accessible with the theme function) give up the ability to customize various element_() settings, which affect the ‘look’ of our graph.

10.3.1 exercise

We will start by moving the legend.position to "top" so it’s not crowding the x axis on our graph.

TopUSCities %>% 
  ggplot(aes(x = date, y = dir_request, 
             group = region, 
             color = trans_type)) + 
  geom_line(size = 0.1) + 
  geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
  facet_wrap(region ~ trans_type, ncol = 6) + 
  theme_minimal() + 
  theme(___________ = _____) +
  lab_top_cities 

10.3.2 solution

See below:

TopUSCities %>% 
  ggplot(aes(x = date, y = dir_request, 
             group = region, 
             color = trans_type)) + 
  geom_line(size = 0.1) + 
  geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
  facet_wrap(region ~ trans_type, ncol = 6) + 
  theme_minimal() + 
  theme(legend.position = "top") +
  lab_top_cities 

Notice we pass the theme() layer after the theme_minimal() layer (if we ordered these the other way around, theme_minimal() would overwrite any custom settings we made in theme()).

10.4 Small Multiples

What if we wanted to facet more than 8 cities? Fortunately, we have a the ggforce and geofacet packages for doing just that!

library(sf)
library(geofacet)
library(ggforce)
library(jcolors)

Building the Graph Data

  1. We will start by filtering the TidyApple to only the 50 US states (we’ve removed three US territories) and storing these data in TidyAppleUS.

  2. Next we limit the date range to the beginning of the shelter in place (from February 1, 2020 to May 1, 2020). These data get stored in TidyAppleUST1.

  3. We then create a dataset with only "transit" direction requests, and we count these by state (sub_region), arrange the data descending with sort = TRUE, and take the top 25 rows (Top25TransitStates).

Finally, we filter TidyAppleUST1 using the 25 states in Top25TransitStates to create our graphing dataset, Top25TransitUSAllT1.

# create only US states (TidyAppleUS)
TidyApple %>% 
  filter(country == "United States" & 
         !sub_region %in% c("Guam", "Puerto Rico", 
                            "Virgin Islands")) -> TidyAppleUS
# create shelter in place time 1 (TidyAppleUST1)
TidyAppleUS %>% 
  filter(date >= as_date("2020-02-01") &
           date <= as_date("2020-05-01")) -> TidyAppleUST1

# create top 25 states (Top25TransitStates)
Top25TransitStates <- TidyAppleUST1 %>% 
  filter(trans_type == "transit") %>% 
  count(sub_region, trans_type, sort = TRUE) %>% 
  head(25)

# filter T1 to states with the most transit requests (Top25TransitUSAllT1)
TidyAppleUST1 %>% 
  filter(sub_region %in% 
           unique(Top25TransitStates$sub_region)) -> Top25TransitUSAllT1

Top25TransitUSAllT1 %>% skimr::skim()
Data summary
Name Piped data
Number of rows 188643
Number of columns 7
_______________________
Column type frequency:
character 5
Date 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
geo_type 0 1 4 6 0 2 0
region 0 1 5 39 0 1058 0
trans_type 0 1 7 7 0 3 0
sub_region 0 1 4 14 0 25 0
country 0 1 13 13 0 1 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2020-02-01 2020-05-01 2020-03-17 91

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
dir_request 0 1 92.68 36.02 0.44 68.64 93.6 113.14 1379.02 ▇▁▁▁▁

10.4.1 exercise

  • set title to "States With Highest Transit Direction Requests"

  • set subtitle to "Top 25 states based on number of total transit requests"

lab_facet_wrap_paginate <- labs(
       x = "Date", y = "Direction Requests", 
       title = _____________________________________,
       subtitle = _____________________________________)

10.4.2 solution

lab_facet_wrap_paginate <- labs(
       x = "Date", y = "Direction Requests", 
       title = "States With Highest Transit Direction Requests",
       subtitle = "Top 25 states based on number of total transit requests")

10.4.3 exercise

Inside ggforce::facet_wrap_paginate():

  • map sub_region as the variable to facet using the ~

  • map 5 to ncol

  • map "free_y" to scales

Inside theme()

  • map element_blank() to panel.border and panel.background

  • map element_text(size = 6) to axis.text.x and axis.text.y

  • map element_text(colour = 'black') to strip.text

  • map element_rect(fill = "gray93") to strip.background

  • map "top" to legend.position

Top25TransitUSAllT1 %>% 
  # global settings
  ggplot(aes(x = date, y = dir_request, 
             group = trans_type, 
             color = trans_type)) + 
  # lines
  geom_hline(yintercept = 100, size = 0.3, color = "black") + 
  geom_line(size = 0.2) +
  # faceting
  ggforce::facet_wrap_paginate(~ __________, 
                               ncol = _, 
                               scales = _______) + 
  # theme settings
  theme(__________ = __________(), 
        __________ = __________(),
        
        __________ = __________(size = _), 
        __________ = __________(size = _),
        
        __________ = __________(colour = __________),
        
        __________ = __________(fill = __________), 
        
        __________ = __________) + 
  # labels
  lab_facet_wrap_paginate

10.4.4 solution

See below:

Top25TransitUSAllT1 %>% 
  # global settings
  ggplot(aes(x = date, y = dir_request, 
             group = trans_type, 
             color = trans_type)) + 
  # lines
  geom_hline(yintercept = 100, size = 0.3, color = "black") + 
  geom_line(size = 0.2) +
  # faceting
  ggforce::facet_wrap_paginate(~ sub_region, 
                               ncol = 5, 
                               scales = "free_y") + 
  # theme settings
  theme(panel.border = element_blank(), 
        panel.background = element_blank(),
        axis.text.x = element_text(size = 6), 
        axis.text.y = element_text(size = 6),
        strip.text = element_text(colour = 'black'),
        strip.background = element_rect(fill = "gray93"), 
        legend.position = "top") + 
  # labels
  lab_facet_wrap_paginate

10.5 Adjusting Colors

Changing the colors on graphs gives us the ability to further customize their look. We can set these manually, or use one of the many complete color palettes from a user-written package. Below we’ll use the jcolors package to highlight the transit direction requests from the previous graph.

10.5.1 exercise

  • add scale_color_jcolors() and play with the palette argument to make the graph look like the solution.
Top25TransitUSAllT1 %>% 
  # global settings
  ggplot(aes(x = date, y = dir_request, 
             group = trans_type, 
             color = trans_type)) + 
  # lines
  geom_hline(yintercept = 100, size = 0.3, color = "black") + 
  geom_line(size = 0.2) +
  # faceting
  ggforce::facet_wrap_paginate(~ sub_region, 
                               ncol = 5, 
                               scales = "free_y") + 
  # theme settings
  theme(panel.border = element_blank(), 
        panel.background = element_blank(),
        axis.text.x = element_text(size = 6), 
        axis.text.y = element_text(size = 6),
        strip.text = element_text(colour = 'black'),
        strip.background = element_rect(fill = "gray93"), 
        legend.position = "top") + 
  # adjust colors
  __________________________(palette = ____) + 
  
  lab_facet_wrap_paginate

10.5.2 solution

See below:

Top25TransitUSAllT1 %>% 
  # global settings
  ggplot(aes(x = date, y = dir_request, 
             group = trans_type, 
             color = trans_type)) + 
  # lines
  geom_hline(yintercept = 100, size = 0.3, color = "black") + 
  geom_line(size = 0.2) +
  # faceting
  ggforce::facet_wrap_paginate(~ sub_region, 
                               ncol = 5, 
                               scales = "free_y") + 
  # theme settings
  theme(panel.border = element_blank(), 
        panel.background = element_blank(),
        axis.text.x = element_text(size = 6), 
        axis.text.y = element_text(size = 6),
        strip.text = element_text(colour = 'black'),
        strip.background = element_rect(fill = "gray93"), 
        legend.position = "top") + 
  
  scale_color_jcolors(palette = "pal3") + 
  
  lab_facet_wrap_paginate

10.6 Extra: geofacet

We’re now going to look at all 50 states using the facet_geo() function from the geofacet package. To make this graph easier to interpret, we’re going to focus only on walking mobility data, and adjust the dir_request value to absolute change from baseline (set to 100 on 2020-01-13).

10.6.1 Adjusted Walking Data

Create the USWalkingAdj data by filtering the trans_type to "walking" and creating two new variables: above_below (a logical indicator for values being above or below the baseline value of 100), and dir_request_adj (the adjusted direction request value).

10.6.2 exercise

Fill in the correct variables in the wrangling steps below:

USWalkingAdj <- TidyAppleUS %>% 
  filter(trans_type == _________) %>%
  mutate(above_below = _________ < 100, 
         dir_request_adj = _________ - 100)

USWalkingAdj %>% 
  skimr::skim()

10.6.3 solution

See below:

USWalkingAdj <- TidyAppleUS %>% 
  filter(trans_type == "walking") %>%
  mutate(above_below = dir_request < 100, 
         dir_request_adj = dir_request - 100)

USWalkingAdj %>% 
  skimr::skim()
Data summary
Name Piped data
Number of rows 160085
Number of columns 9
_______________________
Column type frequency:
character 5
Date 1
logical 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
geo_type 0 1 4 6 0 2 0
region 0 1 5 39 0 452 0
trans_type 0 1 7 7 0 1 0
sub_region 0 1 4 14 0 48 0
country 0 1 13 13 0 1 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2020-01-13 2020-11-24 2020-06-19 317

Variable type: logical

skim_variable n_missing complete_rate mean count
above_below 220 1 0.32 FAL: 108071, TRU: 51794

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
dir_request 220 1 129.36 63.68 0.44 89.44 122.7 163.82 1379.02 ▇▁▁▁▁
dir_request_adj 220 1 29.36 63.68 -99.56 -10.56 22.7 63.82 1279.02 ▇▁▁▁▁

10.6.4 exercise

Assign the following to the labels:

  • set "US Trends in Mobility Data for Walking (Adjusted)" to title

  • set "https://covid19.apple.com/mobility" to caption

lab_facet_geo <- labs(x = "Date", 
                      y = "Direction Requests (Walking)", 
       title = ____________________________________________, 
       subtitle = paste0("Original data uses 100 as baseline for usage at ", 
                         min(USWalkingAdj$date)), 
       caption = ____________________________________________, 
       fill = "Below Baseline")

10.6.5 solution

See below:

lab_facet_geo <- labs(x = "Date", 
                      y = "Direction Requests (Walking)", 
       title = "US Trends in Mobility Data for Walking (Adjusted)", 
       subtitle = paste0("Original data uses 100 as baseline for usage at ", 
                         min(USWalkingAdj$date)), 
       caption = "https://covid19.apple.com/mobility",
       fill = "Below Baseline")

10.6.6 exercise

  • set the colors in color_bl_or as c("#8470FF", "#7FFFD4")

  • set yintercept to 0 in geom_hline()

  • set the values in scale_fill_manual() to color_bl_or

  • map sub_region to facet_geo using ~

Inside theme()

  • set the panel.border and panel.background to element_blank()
  • set the axis.text.x and axis.text.y to element_text(size = 6)
  • set the strip.text.x to element_text(size = 7)
  • set strip.text to element_text(colour = 'white')
  • set strip.background to element_rect(fill = "black")
  • set legend.position to "bottom"
# set colors
color_bl_or <- c(____________, ____________)

USWalkingAdj %>% 
  ggplot(aes(x = date, y = dir_request_adj, 
             group = sub_region, fill = above_below)) + 
  
  geom_col() + 
  
  geom_hline(yintercept = _, color = "gray7") + 
  
  scale_fill_manual(values = ____________) +
  
  facet_geo(~ sub_region) +
  
  theme_bw() + 
  
  theme(______________ = ______________(), 
        ______________ = ______________(),
        
        ______________ = ______________(size = _), 
        ______________ = ______________(size = _),
        ______________ = ______________(size = _),
        
        ______________ = ______________(colour = ______________),
        ______________ = ______________(fill = ______________), 
        
        ______________ = ______________) + 
  
  lab_facet_geo

10.6.7 solution

See below:

# set colors
color_bl_or <- c("#8470FF", "#7FFFD4")

USWalkingAdj %>% 
  ggplot(aes(x = date, y = dir_request_adj, 
             group = sub_region, fill = above_below)) + 
  
  geom_col() + 
  
  geom_hline(yintercept = 0, 
             color = "gray7") + 
  
  scale_fill_manual(values = color_bl_or) +
  
  facet_geo(~ sub_region) +
  
  theme_bw() + 
  
  theme(panel.border = element_blank(), 
        panel.background = element_blank(),
        axis.text.x = element_text(size = 6), 
        axis.text.y = element_text(size = 6),
        strip.text.x = element_text(size = 7),
        strip.text = element_text(colour = 'white'),
        strip.background = element_rect(fill = "black"), 
        legend.position = "bottom") + 
  
  lab_facet_geo

11 Wrap Up

Original Question: How has COVID changed our modes of transportation?

  1. Which graphs do you feel are best at answering this question? Why?

  2. What other information (tables, annotations, etc.) would you include with the graphs?