View the slides for this section here.
TidyApple dataWe’re going to import the data from the previous exercises.
TidyApple <- vroom::vroom(file = "https://bit.ly/3IiwcJs", delim = ",")
head(TidyApple)
We’ve seen the distribution of dir_request, and how it varies across trans_type.
Now we’ll look at how the relationship between these two variables varies over in a subset of US cities and across a specific date range.
We’ll start by narrowing down the data by filtering to only US cities.
Filter the geo_type to "city" and the country to "United States", and pass the date variable to skimr::skim()
TidyApple %>%
filter(geo_type == ___________ &
country == ___________) %>%
# use skimr to check date
skimr::skim(___________)
Here we reduce the dataset to only cities in the US, and we check the date range with skimr::skim(). If this looks OK, we assign to USCities
TidyApple %>%
filter(geo_type == "city" &
country == "United States") %>%
# use skimr to check date
skimr::skim(date)
| Name | Piped data |
| Number of rows | 99538 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2020-01-13 | 2020-11-24 | 2020-06-19 | 317 |
TidyApple %>%
filter(geo_type == "city" &
country == "United States") -> USCities
We can see this date range is from 2020-01-13 to 2020-11-24.
Use paste0() to combine the first and last date in the filtered USCities dataset.
paste0(min(______________$____),
" through ",
max(______________$____))
paste0(min(USCities$date),
" through ",
max(USCities$date))
## [1] "2020-01-13 through 2020-11-24"
We want to specify this in our labels object (lab_line_update), so we will use the paste0() function to have the labels update every time the data changes.
labs(x = "Date",
y = "Direction Requests",
title = "Direction Requests Over Time (US Cities Only)",
subtitle = paste0(min(USCities$date),
" through ",
max(USCities$date)),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type") -> lab_line_update
We’re going to create a line graph of direction requests over time, colored by color.
Pass the filtered data to the geom_line(), mapping the following variables to their relative aesthetics:
date to xdir_request to ytrans_type to both group and colorInclude lab_line_update to see how the new labels look!
USCities %>%
ggplot() +
geom_line(aes(x = __________, y = __________,
group = __________, color = __________)) +
__________
The dates updated to the min and max date in USCities.
USCities %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type)) +
lab_line_update
These lines in this graph are overlapping each other, so we will adjust the size to 0.20.
Change the size of the geom_line() (outside of aes()).
USCities %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
size = ____________) +
lab_line01
Now the trends are easier to see.
USCities %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
size = 0.20) +
lab_line_update
We are going to only look at the trends between February and August of 2020, but we’re going to use an alternative method to filter the data and create the labels.
We will create two new objects (start_date and end_date), which we can use to narrow the dates using the filter() function (and anywhere else we need to use this date range).
This method is better than passing the dates as a character (i.e. in quotes), because we would only have to change it in one place. However, the option above makes better use of R functional programming syntax.
Pass the start_date and end_date to the as_date() functions, and take a look at the date variable with skimr::skim() and if it looks correct, assign it to USCitiesFebJul
# create date objects
start_date <- "2020-02-01"
end_date <- "2020-08-01"
# check with skimr
TidyApple %>%
filter(geo_type == "city" &
country == "United States",
date >= as_date(_____________) &
date <= as_date(_____________)) %>%
skimr::skim(_____________)
# create date objects
start_date <- "2020-02-01"
end_date <- "2020-08-01"
# check with skimr
TidyApple %>%
filter(geo_type == "city" &
country == "United States",
date >= as_date(start_date) &
date <= as_date(end_date)) -> USCitiesFebJul
USCitiesFebJul %>%
skimr::skim(date)
| Name | Piped data |
| Number of rows | 57462 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2020-02-01 | 2020-08-01 | 2020-05-02 | 183 |
Create the new labels (lab_line_paste) with the paste0() function by passing both start_date and end_date.
lab_line_paste <- labs(x = "Date",
y = "Direction Requests",
title = "Direction Requests Over Time (US Cities Only)",
subtitle = paste0(___________, " through ", ___________),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
lab_line_paste <- labs(x = "Date",
y = "Direction Requests",
title = "Direction Requests Over Time (US Cities Only)",
subtitle = paste0(start_date, " through ", end_date),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
USCitiesFebJul %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
# make these slightly larger...
size = 0.30) +
lab_line_paste
We can see there is a gap in the direct request data (this is documented in the data source).
“Data for May 11-12 is not available and will appear as blank columns in the data set.”
We should communicate this gap with our audience, and we might include a text annotation on the graph so our audience isn’t distracted by the gap.
In the previous lesson, we introduced the ggrepel package to show the points on this graph of top performing pharmaceutical companies.
We’re going to use labels to annotate and highlight US cities between March and June of 2020.
Being able to manually add text and annotations as layers to your graph makes it easier to communicate the nuances of your data to your audience. We are going to start by accounting for the missing data in TidyApple.
Build a dataset from TidyApple that only has US cities, and ranges from March 1, 2020 to June 30, 2020.
USCitiesMarJun <- TidyApple %>%
filter(geo_type == ___________ &
country == ___________,
date >= as_date(___________) &
date <= as_date(___________))
USCitiesMarJun %>%
skimr::skim()
USCitiesMarJun <- TidyApple %>%
filter(geo_type == "city" &
country == "United States",
date >= as_date("2020-03-01") &
date <= as_date("2020-07-01"))
USCitiesMarJun %>% skimr::skim()
| Name | Piped data |
| Number of rows | 38622 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| Date | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| geo_type | 0 | 1 | 4 | 4 | 0 | 1 | 0 |
| region | 0 | 1 | 5 | 39 | 0 | 111 | 0 |
| trans_type | 0 | 1 | 7 | 7 | 0 | 3 | 0 |
| sub_region | 0 | 1 | 4 | 14 | 0 | 39 | 0 |
| country | 0 | 1 | 13 | 13 | 0 | 1 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2020-03-01 | 2020-07-01 | 2020-05-01 | 123 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| dir_request | 628 | 0.98 | 82.06 | 41.11 | 4.37 | 50.87 | 74.38 | 109.11 | 525.44 | ▇▃▁▁▁ |
We’re going to build labels using the paste0() function. Fill in the appropriate dataset for the min() and max() date.
lab_annotate <- labs(x = "Date",
y = "Direction Requests",
title = "Direction Requests Over Time (US Cities Only)",
subtitle = paste0(min(______________$date),
" through ",
max(______________$date)),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
lab_annotate <- labs(x = "Date",
y = "Direction Requests",
title = "Direction Requests Over Time (US Cities Only)",
subtitle = paste0(min(USCitiesMarJun$date),
" through ",
max(USCitiesMarJun$date)),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
The previous code for the graph has been added. We’re going to add the following layers:
Inside coord_cartesian():
min(USCitiesMarJun$date) and max(USCitiesMarJun$date) inside c() to xlimmin(USCitiesMarJun$dir_request, na.rm = TRUE) and max(USCitiesMarJun$dir_request, na.rm = TRUE) inside c() to ylimInside the # horizontal annotate():
geom to "segment"0.5 to size"firebrick3" to colorlubridate::as_date("2020-05-10") to xlubridate::as_date("2020-05-13") to xend100 to y and yendInside the # big vertical annotate():
geom to "segment"1 to size"firebrick3" to colorlubridate::as_date("2020-05-11") to xlubridate::as_date("2020-05-11") to xend270 to y and 100 to yendInside the # text annotate():
geom to "text"8 to size"red" to color0.5 to hjustlubridate::as_date("2020-05-07") to x280 to y"Data not available" to labelUSCitiesMarJun %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
# make these slightly larger...
size = 0.30) +
# coordinate system
coord_cartesian(xlim = c(_______________, _______________),
ylim = c(_______________, na.rm = __________),
_______________, na.rm = __________))) +
# horizontal
annotate(geom = ___________, size = ___________, color = ___________,
x = ___________,
xend = ___________,
y = ___________,
yend = ___________) +
# big vertical
annotate(geom = ___________,
size = ___________,
color = ___________,
x = ___________,
xend = ___________,
y = ___________, yend = ___________) +
# text
annotate(geom = "text",
size = 8,
color = "red",
hjust = 0.5,
x = lubridate::as_date("2020-05-07"),
y = 280,
label = "Data not available") +
lab_annotate
# plot
USCitiesMarJun %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
# make these slightly larger...
size = 0.30) +
# coordinate system
coord_cartesian(xlim = c(min(USCitiesMarJun$date),
max(USCitiesMarJun$date)),
ylim = c(min(USCitiesMarJun$dir_request, na.rm = TRUE),
max(USCitiesMarJun$dir_request, na.rm = TRUE))) +
# horizontal
annotate(geom = "segment",
size = 0.5,
color = "firebrick3",
x = lubridate::as_date("2020-05-10"),
xend = lubridate::as_date("2020-05-13"),
y = 100,
yend = 100) +
# big vertical
annotate(geom = "segment",
size = 1,
color = "firebrick3",
x = lubridate::as_date("2020-05-11"),
xend = lubridate::as_date("2020-05-11"),
y = 270, yend = 100) +
# text
annotate(geom = "text",
color = "red",
hjust = 0.5,
size = 8,
x = lubridate::as_date("2020-05-07"),
y = 280,
label = "Data not available") +
lab_annotate
Add a second and third vertical segment to create a fence or bracket for the dates with missing data.
# plot
USCitiesMarJun %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
# make these slightly larger...
size = 0.30) +
# coordinate system
coord_cartesian(xlim = c(min(USCitiesMarJun$date),
max(USCitiesMarJun$date)),
ylim = c(min(USCitiesMarJun$dir_request, na.rm = TRUE),
max(USCitiesMarJun$dir_request, na.rm = TRUE))) +
# horizontal
annotate(geom = "segment",
size = 0.5,
color = "firebrick3",
x = lubridate::as_date("2020-05-10"),
xend = lubridate::as_date("2020-05-13"),
y = 100,
yend = 100) +
# big vertical
annotate(geom = "segment",
size = 1,
color = "firebrick3",
x = lubridate::as_date("2020-05-11"),
xend = lubridate::as_date("2020-05-11"),
y = 270, yend = 100) +
# text
annotate(geom = "text",
color = "red",
hjust = 0.5,
size = 8,
x = lubridate::as_date("2020-05-07"),
y = 280,
label = "Data not available") +
# second vertical
annotate(geom = "segment",
size = 0.5,
color = "firebrick3",
x = _____________________________,
xend = _____________________________,
y = _____________________________,
yend = _____________________________) +
# third vertical
annotate(geom = "segment",
size = 0.5,
color = "firebrick3",
x = _____________________________,
xend = _____________________________,
y = _____________________________,
yend = _____________________________) +
lab_annotate
See below:
# plot
USCitiesMarJun %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
# make these slightly larger...
size = 0.30) +
# coordinate system
coord_cartesian(xlim = c(min(USCitiesMarJun$date),
max(USCitiesMarJun$date)),
ylim = c(min(USCitiesMarJun$dir_request, na.rm = TRUE),
max(USCitiesMarJun$dir_request, na.rm = TRUE))) +
# horizontal
annotate(geom = "segment",
size = 0.5,
color = "firebrick3",
x = lubridate::as_date("2020-05-10"),
xend = lubridate::as_date("2020-05-13"),
y = 100,
yend = 100) +
# big vertical
annotate(geom = "segment",
size = 1,
color = "firebrick3",
x = lubridate::as_date("2020-05-11"),
xend = lubridate::as_date("2020-05-11"),
y = 270, yend = 100) +
# text
annotate(geom = "text",
color = "red",
hjust = 0.5,
size = 8,
x = lubridate::as_date("2020-05-07"),
y = 280,
label = "Data not available") +
# second vertical
annotate(geom = "segment",
size = 0.5,
color = "firebrick3",
x = lubridate::as_date("2020-05-10"),
xend = lubridate::as_date("2020-05-10"),
y = 100, yend = 90) +
# third vertical
annotate(geom = "segment",
size = 0.5,
color = "firebrick3",
x = lubridate::as_date("2020-05-13"),
xend = lubridate::as_date("2020-05-13"),
y = 100, yend = 90) +
lab_annotate
Another option is to use geom_rect() to black out the missing data.
# plot
USCitiesMarJun %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
# make these slightly larger...
size = 0.30) +
# coordinate system
coord_cartesian(xlim = c(min(USCitiesMarJun$date),
max(USCitiesMarJun$date)),
ylim = c(min(USCitiesMarJun$dir_request, na.rm = TRUE),
max(USCitiesMarJun$dir_request, na.rm = TRUE))) +
geom_rect(xmin = lubridate::as_date("2020-05-10"),
xmax = lubridate::as_date("2020-05-12"),
ymin = -Inf,
ymax = Inf,
color = NA) +
geom_text(x = as.Date("2020-05-11"),
y = 100, label = "Data Not Available",
angle = 90, color = "white") +
lab_annotate
Max Driving Requests
The code below creates a subset of the data (TopUSCities). We will use this to add the labels.
TopUSCities <- TidyApple %>%
filter(country == "United States" &
region %in% c("New York City", "Los Angeles",
"Chicago", "Houston", "Phoenix"))
TopUSCities
Create MaxUSCitiesDriving by filtering trans_type, grouping on the region variable, and using dplyr::slice_max() to get the top value in dir_request.
TopUSCities %>%
filter(trans_type == __________) %>%
group_by(__________) %>%
slice_max(dir_request) %>%
ungroup() -> MaxUSCitiesDriving
MaxUSCitiesDriving
TopUSCities %>%
filter(trans_type == "driving") %>%
group_by(region) %>%
slice_max(dir_request) %>%
ungroup() -> MaxUSCitiesDriving
MaxUSCitiesDriving
Create graph labels:
assign "Peak Driving Direction Requests in Largest US Cities" to title
assign "Max Driving Direction Requests & Date" to subtitle
lab_line_max_drivers <- labs(
x = "Date",
y = "Direction Requests",
title = "_________________________________",
subtitle = paste0(min(___________$date),
" through ",
max(___________$date)),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
lab_line_max_drivers <- labs(
x = "Date",
y = "Direction Requests",
title = "Peak Driving Direction Requests in Largest US Cities",
subtitle = paste0(min(TopUSCities$date),
" through ",
max(TopUSCities$date)),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
Create max_driving_labels using paste0() with region and date.
MaxUSCitiesDriving %>%
mutate(
max_driving_labels = paste0(______, ", ", ______)) -> MaxUSCitiesDriving
MaxUSCitiesDriving %>%
select(max_driving_labels)
See below:
MaxUSCitiesDriving %>%
mutate(max_driving_labels = paste0(region, ", ", date)) -> MaxUSCitiesDriving
MaxUSCitiesDriving %>%
select(max_driving_labels)
Create a line plot, assigning the following values in geom_label_repel():
data argument to MaxUSCitiesDrivingInside the aes():
label to max_driving_labelsOutside the aes()
color to "red"size to 3TopUSCities %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type),
# make these slightly smaller again...
size = 0.15) +
geom_label_repel(data = _____________,
aes(x = date, y = dir_request,
label = _____________),
# set color and size...
color = _____,
size = _) +
lab_line_max_drivers
TopUSCities %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type),
# make these slightly smaller again...
size = 0.15) +
geom_label_repel(data = MaxUSCitiesDriving,
aes(x = date, y = dir_request,
label = max_driving_labels),
# set color and size...
color = "red",
size = 3) +
lab_line_max_drivers
Min Walking Requests
We are going to repeat the process above, but use the minimum value for walking direction requests.
filter() the trans_type to "walking"group_by() the regionslice_min() to get the minimum value for dir_requestMinUSCitiesWalkingTopUSCities %>%
filter(________ == ________) %>%
group_by(________) %>%
slice_min(dir_request) %>%
ungroup() -> MinUSCitiesWalking
MinUSCitiesWalking
TopUSCities %>%
filter(trans_type == "walking") %>%
group_by(region) %>%
slice_min(dir_request) %>%
ungroup() -> MinUSCitiesWalking
MinUSCitiesWalking
"Lowest Walking Direction Requests in Largest US Cities" to titlelab_line_min_walking <- labs(
x = "Date",
y = "Direction Requests",
title = "__________________________________________",
subtitle = paste0(min(___________$date),
" through ",
max(___________$date)),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
See below:
lab_line_min_walking <- labs(
x = "Date",
y = "Direction Requests",
title = "Lowest Walking Direction Requests in Largest US Cities",
subtitle = paste0(min(TopUSCities$date),
" through ",
max(TopUSCities$date)),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
Create min_walking_labels using paste0() with region and date
MinUSCitiesWalking %>%
mutate(min_walking_labels = paste0(_____, ", ", _____)) -> MinUSCitiesWalking
MinUSCitiesWalking
MinUSCitiesWalking %>%
mutate(min_walking_labels = paste0(region, ", ", date)) -> MinUSCitiesWalking
MinUSCitiesWalking
Create a line plot, assigning the following values in geom_label_repel():
data to MinUSCitiesWalkingInside aes():
min_walking_labels to labelOutside aes():
"blue" to color3 to sizeTopUSCities %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type),
# make these slightly smaller again...
size = 0.15) +
geom_label_repel(data = ____________,
aes(x = date, y = dir_request,
label = ____________),
# set color and size...
color = _______,
size = _) +
lab_line_min_walking
TopUSCities %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type),
# make these slightly smaller again...
size = 0.15) +
geom_label_repel(data = MinUSCitiesWalking,
aes(x = date, y = dir_request,
label = min_walking_labels),
# set color and size...
color = "blue",
size = 3) +
lab_line_min_walking
We’re going to focus on the top 8 cites according to their population (the date of this writing is 2020-11-26).
Top Cities
We’re going to introduce another filtering method in this section to create the TopUSCities dataset.
Store top eight cities in the focus_on vector and use it to filter the TidyApple dataset.
focus_on <- c("New York City", "Los Angeles",
"Chicago", "Houston",
"Phoenix", "Philadelphia",
"San Antonio", "San Diego")
TopUSCities <- TidyApple %>%
filter(region %in% _____________)
TopUSCities %>% glimpse(60)
focus_on <- c("New York City", "Los Angeles",
"Chicago", "Houston",
"Phoenix", "Philadelphia",
"San Antonio", "San Diego")
TopUSCities <- TidyApple %>%
filter(region %in% focus_on)
TopUSCities %>% glimpse(60)
## Rows: 7,608
## Columns: 7
## $ geo_type <chr> "city", "city", "city", "city", "city"…
## $ region <chr> "Chicago", "Chicago", "Chicago", "Chic…
## $ trans_type <chr> "driving", "driving", "driving", "driv…
## $ sub_region <chr> "Illinois", "Illinois", "Illinois", "I…
## $ country <chr> "United States", "United States", "Uni…
## $ date <date> 2020-01-13, 2020-01-14, 2020-01-15, 2…
## $ dir_request <dbl> 100.00, 103.68, 104.45, 108.72, 132.80…
Graph Labels
We’re going to place date on the x axis, and dir_request on the y. The tite will reflect a general description of what we’re expecting to see, and we’ll list the cities in the subtitle. color will be used to give a better description than trans_type.
Fill in names for the x, y, and color.
lab_top_cities <- labs(x = _____, y = __________,
title = "Trends of Relative Activity in Selected US Cities",
subtitle = "NY, LA, CH, HOU, PHA, PHL, SATX, SD",
color = _________)
See below:
lab_top_cities <- labs(x = "Date", y = "Direction Requests",
title = "Trends of Relative Activity in Selected US Cities",
subtitle = "NY, LA, CH, HOU, PHA, PHL, SATX, SD",
color = "Type")
aes()We’re going to set the global graph aesthetics inside ggplot(aes()) using our labels as a guide. This will serve as a base layer for us to add our reference line to!
Global aes()
Map trans_type to group and color
Also add a geom_line() layer with the size set to 0.1 (not inside the aes()!)
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
_____ = __________,
_____ = __________)) +
____________(_____ = ___) +
lab_top_cities
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type)) +
geom_line(size = 0.1) +
lab_top_cities
The documentation for the data tells us each dir_request has a baseline value of 100. We’re going to add this as a reference line on the graph using ggplot2::geom_hline().
The geom_hline() function takes yintercept, size, and color arguments.
100 as the yinterceptsize to 0.2color of this line "gray20"TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type)) +
geom_line(size = 0.1) +
# add reference line
geom_hline(yintercept = ___, size = ___, color = _______) +
lab_top_cities
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type)) +
geom_line(size = 0.1) +
geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
lab_top_cities
Reference lines are helpful when we want to examine trends in relation to a particular value.