In order to use graphs and figures to effectively communicate with our audience, we need to consider a few things:
The slides for this lesson are here
All of the exercises and lessons are available here. Read more about ggplot2
on the tidyverse website, and in the Data Visualisation chapter of R for Data Science.
The main packages we’re going to use are dplyr
, tidyr
, and ggplot2
. These are all part of the tidyverse
, so we’ll import this package below:
install.packages("tidyverse")
library(tidyverse)
Assume we received the following questions from a stakeholder:
How has COVID changed our modes of transportation?
Or
Are people using fewer or different forms of transportation since the COVID pandemic?
Questions we should be considering:
We’re going to use the following data to answer the stakeholder’s questions:
Import the data below
AppleMobRaw <- readr::read_csv("https://bit.ly/36tTVpe")
AppleMobRaw
These data need to be restructured into a tidy format.
AppleMobRaw %>%
tidyr::pivot_longer(cols = -c(geo_type:country),
names_to = "date",
values_to = "dir_request")
AppleMobRaw
We will remove the missing values from country
and sub-region
AppleMobRaw %>%
tidyr::pivot_longer(cols = -c(geo_type:country),
names_to = "date", values_to = "dir_request") %>%
# remove missing country and missing sub-region data
dplyr::filter(!is.na(country) & !is.na(`sub-region`))
Use mutate()
to create a properly formatted date
variable, and rename()
the transportation_type
variable to trans_type
. Apply janitor::clean_names()
to the entire dataset and assign the final output to TidyApple
.
AppleMobRaw %>%
tidyr::pivot_longer(cols = -c(geo_type:country),
names_to = "date", values_to = "dir_request") %>%
# remove missing country and missing sub-region data
dplyr::filter(!is.na(country) & !is.na(`sub-region`)) %>%
# format date
mutate(date = lubridate::ymd(date)) %>%
# change name of transportation types
rename(trans_type = transportation_type) %>%
# clean names
janitor::clean_names() -> TidyApple
One of the most important jobs of analytic work is counting things. There are many ways to accomplish this in R, but we’ll stick with the dplyr
package because it’s part of the tidyverse
.
The dplyr
function for counting responses of a categorical or factor variable is count()
, and it works like this:
Data %>%
count(variable)
So, if we wanted to count the number of different transportation types in the TidyApple
data frame, it would look like this,
TidyApple %>%
dplyr::count(trans_type)
We can also sort the responses using the sort = TRUE
argument.
TidyApple %>%
dplyr::count(trans_type, sort = TRUE)
We can also combine dplyr::select_if()
and purrr::map()
to pass the count()
function to all the character variables in TidyApple
.
TidyApple %>%
select_if(is.character) %>%
map(~count(data.frame(x = .x), x, sort = TRUE)) -> tidy_apple_counts
We can example the counts of each value by using the $
to subset the tidy_apple_counts
list.
tidy_apple_counts$sub_region
tidy_apple_counts$region
Before we start looking at relationships between variables, we should examine each variable’s underlying distribution. In the next section, we’re going to cover a few graphs that display variable distributions: histograms, density, violin, and ridgeline plots,
A histogram is a special kind of bar graph–it only takes a single continuous variable (in this case, dir_request
), and it displays a relative breakdown of the values.
The x
axis for the histogram will have the direction requests, and the y
variable will display a count of the values.
lab_hist <- labs(x = "Apple directions requests",
y = "Count",
title = "Distribution of Direction Requests",
subtitle = "source: https://covid19.apple.com/mobility")
Create a histogram of direction requests using dir_request
TidyApple %>% ggplot() +
geom_histogram(aes(x = ____________)) +
lab_hist
See blow:
TidyApple %>% ggplot() +
geom_histogram(aes(x = dir_request)) +
lab_hist
We can see the y
axis of the histogram is in scientific notation. This might be hard for some audiences to interpret, so we will change this to use the whole number with commas with the scales
package.
Add the scales::comma
value to the scale_y_continuous()
function.
library(scales)
TidyApple %>% ggplot() +
geom_histogram(aes(x = dir_request)) +
scale_y_continuous(labels = __________) +
lab_hist
See below:
library(scales)
TidyApple %>% ggplot() +
geom_histogram(aes(x = dir_request)) +
scale_y_continuous(labels = scales::comma) +
lab_hist
We can control the shape of the histogram with the bins
argument. The default is 30
.
Set bins
to 15
.
TidyApple %>% ggplot() +
geom_histogram(aes(x = dir_request), bins = __) +
scale_y_continuous(labels = scales::comma) +
lab_hist
See below:
TidyApple %>% ggplot() +
geom_histogram(aes(x = dir_request), bins = 15) +
scale_y_continuous(labels = scales::comma) +
lab_hist
Set bins
to 45
and assign it to gg_hist45
.
TidyApple %>% ggplot() +
geom_histogram(aes(x = dir_request), bins = __) +
scale_y_continuous(labels = scales::comma) +
lab_hist -> _____________
See below:
TidyApple %>% ggplot() +
geom_histogram(aes(x = dir_request), bins = 45) +
scale_y_continuous(labels = scales::comma) +
lab_hist -> gg_hist45
What if we want to see how a continuous variable is distributed across a categorical variable? We did this in the previous lesson with a boxplot.
Density plots come in handy here (so do geom_boxplot()
s!). Read more about the density geom here.
We are going to create the graph labels so we know what to expect when we build our graph.
We want to see the distribution of the directions request, filled by the levels of transportation type.
lab_density <- labs(x = "Apple directions requests",
fill = "Transit Type",
title = "Distribution of Direction Requests vs. Transportation Type",
subtitle = "source: https://covid19.apple.com/mobility")
Now we build the density plot, passing the variables so they match our labels above.
Create a density plot of direction requests colored by the type of transportation.
TidyApple %>%
ggplot() +
geom_density(aes(x = __________, fill = __________)) +
lab_density
See below:
TidyApple %>%
ggplot() +
geom_density(aes(x = dir_request, fill = trans_type)) +
lab_density
One drawback to density plots is the y
axis can be hard to interpret
Adjust the overlapping densities by setting alpha
to 1/3
. Assign this plot to gg_density
.
TidyApple %>%
ggplot() +
geom_density(aes(x = dir_request, fill = trans_type),
alpha = __________) +
lab_density -> __________
See below:
TidyApple %>%
ggplot() +
geom_density(aes(x = dir_request, fill = trans_type),
alpha = 1/3) +
lab_density -> gg_density
gg_density
Another option is a ridgeline plot (from the ggridges
package). These display multiple densities.
lab_ridges <- labs(
title = "Direction Requests by Transportation Type",
subtitle = "source: https://covid19.apple.com/mobility",
fill = "Transit type",
x = "Apple directions requests",
y = "Transportation Types")
library(ggridges)
TidyApple %>%
ggplot() +
geom_density_ridges(aes(x = dir_request,
y = trans_type,
fill = trans_type),
alpha = 1/5) +
lab_ridges
Another alternative to the density plot is the violin plot.
"Apple directions requests"
to the x
axis"Transit Type"
to the y
axislab_violin <- labs(x = _________________________,
y = _________________________,
fill = "Transit Type",
title = "Distribution of Direction Requests vs. Transportation Type",
subtitle = "source: https://covid19.apple.com/mobility")
lab_violin <- labs(x = "Apple directions requests",
y = "Transit Type",
fill = "Transit Type",
title = "Distribution of Direction Requests vs. Transportation Type",
subtitle = "source: https://covid19.apple.com/mobility")
Add a geom_violin()
to the code below:
TidyApple %>%
ggplot() +
____________(aes(y = dir_request, x = trans_type,
fill = trans_type)) +
lab_violin
TidyApple %>%
ggplot() +
geom_violin(aes(y = dir_request, x = trans_type,
fill = trans_type)) +
lab_violin
The great thing about ggplot2
s layered syntax, is that we can add geoms
with similar aesthetics to the same graph! For example, we can see how geom_violin
s and geom_boxplot
s are related by adding a geom_boxplot()
layer to the graph above.
TidyApple %>%
ggplot() +
geom_violin(aes(y = dir_request, x = trans_type,
fill = trans_type), alpha = 1/5) +
___________(aes(y = dir_request, x = trans_type,
color = trans_type)) +
lab_violin
Note we set the alpha
to 1/5
for the geom_violin()
, and the color
to trans_type
for the geom_boxplot()
.
TidyApple %>%
ggplot() +
geom_violin(aes(y = dir_request, x = trans_type,
fill = trans_type), alpha = 1/5) +
geom_boxplot(aes(y = dir_request, x = trans_type,
color = trans_type)) +
lab_violin
We’ve seen the distribution of dir_request
, and how it varies across trans_type
.
Now we’ll look at how the relationship between these two variables varies over in a subset of US cities and across a specific date
range.
We’ll start by narrowing down the data by filtering to only US cities.
Filter the geo_type
to "city"
and the country
to "United States"
, and pass the date
variable to skimr::skim()
TidyApple %>%
filter(geo_type == ___________ &
country == ___________) %>%
# use skimr to check date
skimr::skim(___________)
Here we reduce the dataset to only cities in the US, and we check the date
range with skimr::skim()
. If this looks OK, we assign to USCities
TidyApple %>%
filter(geo_type == "city" &
country == "United States") %>%
# use skimr to check date
skimr::skim(date)
Name | Piped data |
Number of rows | 99538 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
Date | 1 |
________________________ | |
Group variables | None |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 2020-01-13 | 2020-11-24 | 2020-06-19 | 317 |
TidyApple %>%
filter(geo_type == "city" &
country == "United States") -> USCities
We can see this date range is from 2020-01-13 to 2020-11-24.
Use paste0()
to combine the first and last date
in the filtered USCities
dataset.
paste0(min(______________$____),
" through ",
max(______________$____))
paste0(min(USCities$date),
" through ",
max(USCities$date))
## [1] "2020-01-13 through 2020-11-24"
We want to specify this in our labels object (lab_line_update
), so we will use the paste0()
function to have the labels update every time the data changes.
labs(x = "Date",
y = "Direction Requests",
title = "Direction Requests Over Time (US Cities Only)",
subtitle = paste0(min(USCities$date),
" through ",
max(USCities$date)),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type") -> lab_line_update
We’re going to create a line graph of direction requests over time, colored by color.
Pass the filtered data to the geom_line()
, mapping the following variables to their relative aesthetics:
date
to x
dir_request
to y
trans_type
to both group
and color
Include lab_line_update
to see how the new labels look!
USCities %>%
ggplot() +
geom_line(aes(x = __________, y = __________,
group = __________, color = __________)) +
__________
Let’s see what happens when we use lab_line_update
.
USCities %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type)) +
lab_line_update
The dates updated to the min and max date
in USCities
.
These lines in this graph are overlapping each other, so we will adjust the size
to 0.20
.
Change the size of the geom_line()
(outside of aes()
).
USCities %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
size = ____________) +
lab_line01
See below:
USCities %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
size = 0.20) +
lab_line_update
Now the trends are easier to see.
We are going to only look at the trends between February and August of 2020, but we’re going to use an alternative method to filter
the data and create the labels.
We will create two new objects (start_date
and end_date
), which we can use to narrow the dates using the filter()
function (and anywhere else we need to use this date range).
This method is better than passing the dates as a character (i.e. in quotes), because we would only have to change it in one place. However, the option above makes better use of R functional programming syntax.
Pass the start_date
and end_date
to the as_date()
functions, and take a look at the date
variable with skimr::skim()
and if it looks correct, assign it to USCitiesFebJul
# create date objects
start_date <- "2020-02-01"
end_date <- "2020-08-01"
# check with skimr
TidyApple %>%
filter(geo_type == "city" &
country == "United States",
date >= as_date(_____________) &
date <= as_date(_____________)) %>%
skimr::skim(_____________)
See below:
# create date objects
start_date <- "2020-02-01"
end_date <- "2020-08-01"
# check with skimr
TidyApple %>%
filter(geo_type == "city" &
country == "United States",
date >= as_date(start_date) &
date <= as_date(end_date)) -> USCitiesFebJul
USCitiesFebJul %>%
skimr::skim(date)
Name | Piped data |
Number of rows | 57462 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
Date | 1 |
________________________ | |
Group variables | None |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 2020-02-01 | 2020-08-01 | 2020-05-02 | 183 |
Create the new labels (lab_line_paste
) with the paste0()
function by passing both start_date
and end_date
.
lab_line_paste <- labs(x = "Date",
y = "Direction Requests",
title = "Direction Requests Over Time (US Cities Only)",
subtitle = paste0(___________, " through ", ___________),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
lab_line_paste <- labs(x = "Date",
y = "Direction Requests",
title = "Direction Requests Over Time (US Cities Only)",
subtitle = paste0(start_date, " through ", end_date),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
USCitiesFebJul %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
# make these slightly larger...
size = 0.30) +
lab_line_paste
We can see there is a gap in the direct request data (this is documented in the data source).
“Data for May 11-12 is not available and will appear as blank columns in the data set.”
We should communicate this gap with our audience, and we might include a text annotation on the graph so our audience isn’t distracted by the gap.
In the previous lesson, we introduced the ggrepel
package to show the points on this graph of top performing pharmaceutical companies.
We’re going to use labels to annotate and highlight US cities between March and June of 2020.
Being able to manually add text and annotations as layers to your graph makes it easier to communicate the nuances of your data to your audience. We are going to start by accounting for the missing data in TidyApple
.
Build a dataset from TidyApple
that only has US cities, and ranges from March 1, 2020 to June 30, 2020.
USCitiesMarJun <- TidyApple %>%
filter(geo_type == ___________ &
country == ___________,
date >= as_date(___________) &
date <= as_date(___________))
USCitiesMarJun %>%
skimr::skim()
See below:
USCitiesMarJun <- TidyApple %>%
filter(geo_type == "city" &
country == "United States",
date >= as_date("2020-03-01") &
date <= as_date("2020-07-01"))
USCitiesMarJun %>% skimr::skim()
Name | Piped data |
Number of rows | 38622 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
character | 5 |
Date | 1 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
geo_type | 0 | 1 | 4 | 4 | 0 | 1 | 0 |
region | 0 | 1 | 5 | 39 | 0 | 111 | 0 |
trans_type | 0 | 1 | 7 | 7 | 0 | 3 | 0 |
sub_region | 0 | 1 | 4 | 14 | 0 | 39 | 0 |
country | 0 | 1 | 13 | 13 | 0 | 1 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 2020-03-01 | 2020-07-01 | 2020-05-01 | 123 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
dir_request | 628 | 0.98 | 82.06 | 41.11 | 4.37 | 50.87 | 74.38 | 109.11 | 525.44 | ▇▃▁▁▁ |
We’re going to build labels using the paste0()
function. Fill in the appropriate dataset for the min()
and max()
date.
lab_annotate <- labs(x = "Date",
y = "Direction Requests",
title = "Direction Requests Over Time (US Cities Only)",
subtitle = paste0(min(______________$date),
" through ",
max(______________$date)),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
See below:
lab_annotate <- labs(x = "Date",
y = "Direction Requests",
title = "Direction Requests Over Time (US Cities Only)",
subtitle = paste0(min(USCitiesMarJun$date),
" through ",
max(USCitiesMarJun$date)),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
The previous code for the graph has been added. We’re going to add the following layers:
Inside coord_cartesian()
:
min(USCitiesMarJun$date)
and max(USCitiesMarJun$date)
inside c()
to xlim
min(USCitiesMarJun$dir_request, na.rm = TRUE)
and max(USCitiesMarJun$dir_request, na.rm = TRUE)
inside c()
to ylim
Inside the # horizontal
annotate()
:
geom
to "segment"
0.5
to size
"firebrick3"
to color
lubridate::as_date("2020-05-10")
to x
lubridate::as_date("2020-05-13")
to xend
100
to y
and yend
Inside the # big vertical
annotate()
:
geom
to "segment"
1
to size
"firebrick3"
to color
lubridate::as_date("2020-05-11")
to x
lubridate::as_date("2020-05-11")
to xend
270
to y
and 100
to yend
Inside the # text
annotate()
:
geom
to "text"
8
to size
"red"
to color
0.5
to hjust
lubridate::as_date("2020-05-07")
to x
280
to y
"Data not available"
to label
USCitiesMarJun %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
# make these slightly larger...
size = 0.30) +
# coordinate system
coord_cartesian(xlim = c(_______________, _______________),
ylim = c(_______________, na.rm = __________),
_______________, na.rm = __________))) +
# horizontal
annotate(geom = ___________, size = ___________, color = ___________,
x = ___________,
xend = ___________,
y = ___________,
yend = ___________) +
# big vertical
annotate(geom = ___________,
size = ___________,
color = ___________,
x = ___________,
xend = ___________,
y = ___________, yend = ___________) +
# text
annotate(geom = "text",
size = 8,
color = "red",
hjust = 0.5,
x = lubridate::as_date("2020-05-07"),
y = 280,
label = "Data not available") +
lab_annotate
# plot
USCitiesMarJun %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
# make these slightly larger...
size = 0.30) +
# coordinate system
coord_cartesian(xlim = c(min(USCitiesMarJun$date),
max(USCitiesMarJun$date)),
ylim = c(min(USCitiesMarJun$dir_request, na.rm = TRUE),
max(USCitiesMarJun$dir_request, na.rm = TRUE))) +
# horizontal
annotate(geom = "segment",
size = 0.5,
color = "firebrick3",
x = lubridate::as_date("2020-05-10"),
xend = lubridate::as_date("2020-05-13"),
y = 100,
yend = 100) +
# big vertical
annotate(geom = "segment",
size = 1,
color = "firebrick3",
x = lubridate::as_date("2020-05-11"),
xend = lubridate::as_date("2020-05-11"),
y = 270, yend = 100) +
# text
annotate(geom = "text",
color = "red",
hjust = 0.5,
size = 8,
x = lubridate::as_date("2020-05-07"),
y = 280,
label = "Data not available") +
lab_annotate
Add a second and third vertical segment
to create a fence or bracket for the dates with missing data.
# plot
USCitiesMarJun %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
# make these slightly larger...
size = 0.30) +
# coordinate system
coord_cartesian(xlim = c(min(USCitiesMarJun$date),
max(USCitiesMarJun$date)),
ylim = c(min(USCitiesMarJun$dir_request, na.rm = TRUE),
max(USCitiesMarJun$dir_request, na.rm = TRUE))) +
# horizontal
annotate(geom = "segment",
size = 0.5,
color = "firebrick3",
x = lubridate::as_date("2020-05-10"),
xend = lubridate::as_date("2020-05-13"),
y = 100,
yend = 100) +
# big vertical
annotate(geom = "segment",
size = 1,
color = "firebrick3",
x = lubridate::as_date("2020-05-11"),
xend = lubridate::as_date("2020-05-11"),
y = 270, yend = 100) +
# text
annotate(geom = "text",
color = "red",
hjust = 0.5,
size = 8,
x = lubridate::as_date("2020-05-07"),
y = 280,
label = "Data not available") +
# second vertical
annotate(geom = "segment",
size = 0.5,
color = "firebrick3",
x = _____________________________,
xend = _____________________________,
y = _____________________________,
yend = _____________________________) +
# third vertical
annotate(geom = "segment",
size = 0.5,
color = "firebrick3",
x = _____________________________,
xend = _____________________________,
y = _____________________________,
yend = _____________________________) +
lab_annotate
See below:
# plot
USCitiesMarJun %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
# make these slightly larger...
size = 0.30) +
# coordinate system
coord_cartesian(xlim = c(min(USCitiesMarJun$date),
max(USCitiesMarJun$date)),
ylim = c(min(USCitiesMarJun$dir_request, na.rm = TRUE),
max(USCitiesMarJun$dir_request, na.rm = TRUE))) +
# horizontal
annotate(geom = "segment",
size = 0.5,
color = "firebrick3",
x = lubridate::as_date("2020-05-10"),
xend = lubridate::as_date("2020-05-13"),
y = 100,
yend = 100) +
# big vertical
annotate(geom = "segment",
size = 1,
color = "firebrick3",
x = lubridate::as_date("2020-05-11"),
xend = lubridate::as_date("2020-05-11"),
y = 270, yend = 100) +
# text
annotate(geom = "text",
color = "red",
hjust = 0.5,
size = 8,
x = lubridate::as_date("2020-05-07"),
y = 280,
label = "Data not available") +
# second vertical
annotate(geom = "segment",
size = 0.5,
color = "firebrick3",
x = lubridate::as_date("2020-05-10"),
xend = lubridate::as_date("2020-05-10"),
y = 100, yend = 90) +
# third vertical
annotate(geom = "segment",
size = 0.5,
color = "firebrick3",
x = lubridate::as_date("2020-05-13"),
xend = lubridate::as_date("2020-05-13"),
y = 100, yend = 90) +
lab_annotate
Another option is to use geom_rect()
to black out the missing data.
# plot
USCitiesMarJun %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type, color = trans_type),
# make these slightly larger...
size = 0.30) +
# coordinate system
coord_cartesian(xlim = c(min(USCitiesMarJun$date),
max(USCitiesMarJun$date)),
ylim = c(min(USCitiesMarJun$dir_request, na.rm = TRUE),
max(USCitiesMarJun$dir_request, na.rm = TRUE))) +
geom_rect(xmin = lubridate::as_date("2020-05-10"),
xmax = lubridate::as_date("2020-05-12"),
ymin = -Inf,
ymax = Inf,
color = NA) +
geom_text(x = as.Date("2020-05-11"),
y = 100, label = "Data Not Available",
angle = 90, color = "white") +
lab_annotate
Max Driving Requests
The code below creates a subset of the data (TopUSCities
). We will use this to add the labels.
TopUSCities <- TidyApple %>%
filter(country == "United States" &
region %in% c("New York City", "Los Angeles",
"Chicago", "Houston", "Phoenix"))
TopUSCities
Create MaxUSCitiesDriving
by filtering trans_type
, grouping on the region
variable, and using dplyr::slice_max()
to get the top value in dir_request
.
TopUSCities %>%
filter(trans_type == __________) %>%
group_by(__________) %>%
slice_max(dir_request) %>%
ungroup() -> MaxUSCitiesDriving
MaxUSCitiesDriving
See below:
TopUSCities %>%
filter(trans_type == "driving") %>%
group_by(region) %>%
slice_max(dir_request) %>%
ungroup() -> MaxUSCitiesDriving
MaxUSCitiesDriving
Create graph labels:
assign "Peak Driving Direction Requests in Largest US Cities"
to title
assign "Max Driving Direction Requests & Date"
to subtitle
lab_line_max_drivers <- labs(
x = "Date",
y = "Direction Requests",
title = "_________________________________",
subtitle = paste0(min(___________$date),
" through ",
max(___________$date)),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
See below:
lab_line_max_drivers <- labs(
x = "Date",
y = "Direction Requests",
title = "Peak Driving Direction Requests in Largest US Cities",
subtitle = paste0(min(TopUSCities$date),
" through ",
max(TopUSCities$date)),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
Create max_driving_labels
using paste0()
with region
and date
.
MaxUSCitiesDriving %>%
mutate(
max_driving_labels = paste0(______, ", ", ______)) -> MaxUSCitiesDriving
MaxUSCitiesDriving %>%
select(max_driving_labels)
See below:
MaxUSCitiesDriving %>%
mutate(max_driving_labels = paste0(region, ", ", date)) -> MaxUSCitiesDriving
MaxUSCitiesDriving %>%
select(max_driving_labels)
Create a line plot, assigning the following values in geom_label_repel()
:
data
argument to MaxUSCitiesDriving
Inside the aes()
:
label
to max_driving_labels
Outside the aes()
color
to "red"
size
to 3
TopUSCities %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type),
# make these slightly smaller again...
size = 0.15) +
geom_label_repel(data = _____________,
aes(x = date, y = dir_request,
label = _____________),
# set color and size...
color = _____,
size = _) +
lab_line_max_drivers
See below:
TopUSCities %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type),
# make these slightly smaller again...
size = 0.15) +
geom_label_repel(data = MaxUSCitiesDriving,
aes(x = date, y = dir_request,
label = max_driving_labels),
# set color and size...
color = "red",
size = 3) +
lab_line_max_drivers
Min Walking Requests
We are going to repeat the process above, but use the minimum value for walking direction requests.
filter()
the trans_type
to "walking"
group_by()
the region
slice_min()
to get the minimum value for dir_request
MinUSCitiesWalking
TopUSCities %>%
filter(________ == ________) %>%
group_by(________) %>%
slice_min(dir_request) %>%
ungroup() -> MinUSCitiesWalking
MinUSCitiesWalking
See below:
TopUSCities %>%
filter(trans_type == "walking") %>%
group_by(region) %>%
slice_min(dir_request) %>%
ungroup() -> MinUSCitiesWalking
MinUSCitiesWalking
"Lowest Walking Direction Requests in Largest US Cities"
to title
lab_line_min_walking <- labs(
x = "Date",
y = "Direction Requests",
title = "__________________________________________",
subtitle = paste0(min(___________$date),
" through ",
max(___________$date)),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
See below:
lab_line_min_walking <- labs(
x = "Date",
y = "Direction Requests",
title = "Lowest Walking Direction Requests in Largest US Cities",
subtitle = paste0(min(TopUSCities$date),
" through ",
max(TopUSCities$date)),
caption = "source: https://covid19.apple.com/mobility",
color = "Transit Type")
Create min_walking_labels
using paste0()
with region
and date
MinUSCitiesWalking %>%
mutate(min_walking_labels = paste0(_____, ", ", _____)) -> MinUSCitiesWalking
MinUSCitiesWalking
See below:
MinUSCitiesWalking %>%
mutate(min_walking_labels = paste0(region, ", ", date)) -> MinUSCitiesWalking
MinUSCitiesWalking
Create a line plot, assigning the following values in geom_label_repel()
:
data
to MinUSCitiesWalking
Inside aes()
:
min_walking_labels
to label
Outside aes()
:
"blue"
to color
3
to size
TopUSCities %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type),
# make these slightly smaller again...
size = 0.15) +
geom_label_repel(data = ____________,
aes(x = date, y = dir_request,
label = ____________),
# set color and size...
color = _______,
size = _) +
lab_line_min_walking
See below:
TopUSCities %>%
ggplot() +
geom_line(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type),
# make these slightly smaller again...
size = 0.15) +
geom_label_repel(data = MinUSCitiesWalking,
aes(x = date, y = dir_request,
label = min_walking_labels),
# set color and size...
color = "blue",
size = 3) +
lab_line_min_walking
We’re going to focus on the top 8 cites according to their population (the date of this writing is 2020-11-26
).
Top Cities
We’re going to introduce another filter
ing method in this section to create the TopUSCities
dataset.
Store top eight cities in the focus_on
vector and use it to filter the TidyApple
dataset.
focus_on <- c("New York City", "Los Angeles",
"Chicago", "Houston",
"Phoenix", "Philadelphia",
"San Antonio", "San Diego")
TopUSCities <- TidyApple %>%
filter(region %in% _____________)
TopUSCities %>% glimpse(60)
See below:
focus_on <- c("New York City", "Los Angeles",
"Chicago", "Houston",
"Phoenix", "Philadelphia",
"San Antonio", "San Diego")
TopUSCities <- TidyApple %>%
filter(region %in% focus_on)
TopUSCities %>% glimpse(60)
## Rows: 7,608
## Columns: 7
## $ geo_type <chr> "city", "city", "city", "city", "city…
## $ region <chr> "Chicago", "Chicago", "Chicago", "Chi…
## $ trans_type <chr> "driving", "driving", "driving", "dri…
## $ sub_region <chr> "Illinois", "Illinois", "Illinois", "…
## $ country <chr> "United States", "United States", "Un…
## $ date <date> 2020-01-13, 2020-01-14, 2020-01-15, …
## $ dir_request <dbl> 100.00, 103.68, 104.45, 108.72, 132.8…
Graph Labels
We’re going to place date
on the x
axis, and dir_request
on the y
. The tite
will reflect a general description of what we’re expecting to see, and we’ll list the cities in the subtitle
. color
will be used to give a better description than trans_type
.
Fill in names for the x
, y
, and color
.
lab_top_cities <- labs(x = _____, y = __________,
title = "Trends of Relative Activity in Selected US Cities",
subtitle = "NY, LA, CH, HOU, PHA, PHL, SATX, SD",
color = _________)
See below:
lab_top_cities <- labs(x = "Date", y = "Direction Requests",
title = "Trends of Relative Activity in Selected US Cities",
subtitle = "NY, LA, CH, HOU, PHA, PHL, SATX, SD",
color = "Type")
aes()
We’re going to set the global graph aesthetics inside ggplot(aes())
using our labels as a guide. This will serve as a base layer for us to add our reference line to!
Global aes()
Map trans_type
to group
and color
Also add a geom_line()
layer with the size
set to 0.1
(not inside the aes()
!)
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
_____ = __________,
_____ = __________)) +
____________(_____ = ___) +
lab_top_cities
See below:
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type)) +
geom_line(size = 0.1) +
lab_top_cities
The documentation for the data tells us each dir_request
has a baseline value of 100
. We’re going to add this as a reference line on the graph using ggplot2::geom_hline()
.
The geom_hline()
function takes yintercept
, size
, and color
arguments.
100
as the yintercept
size
to 0.2
color
of this line "gray20"
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type)) +
geom_line(size = 0.1) +
# add reference line
geom_hline(yintercept = ___, size = ___, color = _______) +
lab_top_cities
See below:
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type)) +
geom_line(size = 0.1) +
geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
lab_top_cities
Reference lines are helpful when we want to examine trends in relation to a particular value.
In the previous lesson, we introduced the facet_wrap()
function for viewing the relationship between two variables across the levels of a categorical variable. In the next section, we’re going to show how faceting can be used to explore ‘small multiples’ in a dataset with variation across multiple levels.
facet_wrap()
Now that we have a graph we can use to compare the 8
cities, we will use facet_wrap
to create a subplot for each level of region
.
Fill in the facet_wrap()
(note the use of the ~
) function with region
and set the ncol
to 2
.
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type)) +
geom_line(size = 0.1) +
geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
facet_wrap(~ _______, ncol = _) +
lab_top_cities
See below:
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type)) +
geom_line(size = 0.1) +
geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
facet_wrap(~ region, ncol = 2) +
lab_top_cities
Now map both region
and trans_type
to facet_wrap()
and set the ncol
to 6
.
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
group = region,
color = trans_type)) +
geom_line(size = 0.1) +
geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
facet_wrap(_______ ~ _______, ncol = _) +
lab_top_cities
See below:
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
group = region,
color = trans_type)) +
geom_line(size = 0.1) +
geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
facet_wrap(region ~ trans_type, ncol = 6) +
lab_top_cities
ggplot2
comes with a few default theme settings. Read more about them here. These themes come with arguments to adjust the font_size
and font_family
on our plot.
theme_minimal(
base_size = 11,
base_family = "",
base_line_size = base_size/22,
base_rect_size = base_size/22
)
We’ll use the theme_minimal()
function to reduce the number of elements
on our graph (don’t add any arguments).
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
group = region,
color = trans_type)) +
geom_line(size = 0.1) +
geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
facet_wrap(region ~ trans_type, ncol = 6) +
______________() +
lab_top_cities
See below:
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
group = region,
color = trans_type)) +
geom_line(size = 0.1) +
geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
facet_wrap(region ~ trans_type, ncol = 6) +
theme_minimal() +
lab_top_cities
Themes (accessible with the theme
function) give up the ability to customize various element_()
settings, which affect the ‘look’ of our graph.
We will start by moving the legend.position
to "top"
so it’s not crowding the x
axis on our graph.
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
group = region,
color = trans_type)) +
geom_line(size = 0.1) +
geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
facet_wrap(region ~ trans_type, ncol = 6) +
theme_minimal() +
theme(___________ = _____) +
lab_top_cities
See below:
TopUSCities %>%
ggplot(aes(x = date, y = dir_request,
group = region,
color = trans_type)) +
geom_line(size = 0.1) +
geom_hline(yintercept = 100, size = 0.2, color = "gray20") +
facet_wrap(region ~ trans_type, ncol = 6) +
theme_minimal() +
theme(legend.position = "top") +
lab_top_cities
Notice we pass the theme()
layer after the theme_minimal()
layer (if we ordered these the other way around, theme_minimal()
would overwrite any custom settings we made in theme()
).
What if we wanted to facet more than 8 cities? Fortunately, we have a the ggforce
and geofacet
packages for doing just that!
library(sf)
library(geofacet)
library(ggforce)
library(jcolors)
Building the Graph Data
We will start by filtering the TidyApple
to only the 50 US states (we’ve removed three US territories) and storing these data in TidyAppleUS
.
Next we limit the date range to the beginning of the shelter in place (from February 1, 2020 to May 1, 2020). These data get stored in TidyAppleUST1
.
We then create a dataset with only "transit"
direction requests, and we count these by state (sub_region
), arrange the data descending with sort = TRUE
, and take the top 25 rows (Top25TransitStates
).
Finally, we filter TidyAppleUST1
using the 25 states in Top25TransitStates
to create our graphing dataset, Top25TransitUSAllT1
.
# create only US states (TidyAppleUS)
TidyApple %>%
filter(country == "United States" &
!sub_region %in% c("Guam", "Puerto Rico",
"Virgin Islands")) -> TidyAppleUS
# create shelter in place time 1 (TidyAppleUST1)
TidyAppleUS %>%
filter(date >= as_date("2020-02-01") &
date <= as_date("2020-05-01")) -> TidyAppleUST1
# create top 25 states (Top25TransitStates)
Top25TransitStates <- TidyAppleUST1 %>%
filter(trans_type == "transit") %>%
count(sub_region, trans_type, sort = TRUE) %>%
head(25)
# filter T1 to states with the most transit requests (Top25TransitUSAllT1)
TidyAppleUST1 %>%
filter(sub_region %in%
unique(Top25TransitStates$sub_region)) -> Top25TransitUSAllT1
Top25TransitUSAllT1 %>% skimr::skim()
Name | Piped data |
Number of rows | 188643 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
character | 5 |
Date | 1 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
geo_type | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
region | 0 | 1 | 5 | 39 | 0 | 1058 | 0 |
trans_type | 0 | 1 | 7 | 7 | 0 | 3 | 0 |
sub_region | 0 | 1 | 4 | 14 | 0 | 25 | 0 |
country | 0 | 1 | 13 | 13 | 0 | 1 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 2020-02-01 | 2020-05-01 | 2020-03-17 | 91 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
dir_request | 0 | 1 | 92.68 | 36.02 | 0.44 | 68.64 | 93.6 | 113.14 | 1379.02 | ▇▁▁▁▁ |
set title
to "States With Highest Transit Direction Requests"
set subtitle
to "Top 25 states based on number of total transit requests"
lab_facet_wrap_paginate <- labs(
x = "Date", y = "Direction Requests",
title = _____________________________________,
subtitle = _____________________________________)
lab_facet_wrap_paginate <- labs(
x = "Date", y = "Direction Requests",
title = "States With Highest Transit Direction Requests",
subtitle = "Top 25 states based on number of total transit requests")
Inside ggforce::facet_wrap_paginate()
:
map sub_region
as the variable to facet using the ~
map 5
to ncol
map "free_y"
to scales
Inside theme()
map element_blank()
to panel.border
and panel.background
map element_text(size = 6)
to axis.text.x
and axis.text.y
map element_text(colour = 'black')
to strip.text
map element_rect(fill = "gray93")
to strip.background
map "top"
to legend.position
Top25TransitUSAllT1 %>%
# global settings
ggplot(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type)) +
# lines
geom_hline(yintercept = 100, size = 0.3, color = "black") +
geom_line(size = 0.2) +
# faceting
ggforce::facet_wrap_paginate(~ __________,
ncol = _,
scales = _______) +
# theme settings
theme(__________ = __________(),
__________ = __________(),
__________ = __________(size = _),
__________ = __________(size = _),
__________ = __________(colour = __________),
__________ = __________(fill = __________),
__________ = __________) +
# labels
lab_facet_wrap_paginate
See below:
Top25TransitUSAllT1 %>%
# global settings
ggplot(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type)) +
# lines
geom_hline(yintercept = 100, size = 0.3, color = "black") +
geom_line(size = 0.2) +
# faceting
ggforce::facet_wrap_paginate(~ sub_region,
ncol = 5,
scales = "free_y") +
# theme settings
theme(panel.border = element_blank(),
panel.background = element_blank(),
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6),
strip.text = element_text(colour = 'black'),
strip.background = element_rect(fill = "gray93"),
legend.position = "top") +
# labels
lab_facet_wrap_paginate
Changing the colors on graphs gives us the ability to further customize their look. We can set these manually, or use one of the many complete color palettes from a user-written package. Below we’ll use the jcolors
package to highlight the transit
direction requests from the previous graph.
scale_color_jcolors()
and play with the palette
argument to make the graph look like the solution
.Top25TransitUSAllT1 %>%
# global settings
ggplot(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type)) +
# lines
geom_hline(yintercept = 100, size = 0.3, color = "black") +
geom_line(size = 0.2) +
# faceting
ggforce::facet_wrap_paginate(~ sub_region,
ncol = 5,
scales = "free_y") +
# theme settings
theme(panel.border = element_blank(),
panel.background = element_blank(),
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6),
strip.text = element_text(colour = 'black'),
strip.background = element_rect(fill = "gray93"),
legend.position = "top") +
# adjust colors
__________________________(palette = ____) +
lab_facet_wrap_paginate
See below:
Top25TransitUSAllT1 %>%
# global settings
ggplot(aes(x = date, y = dir_request,
group = trans_type,
color = trans_type)) +
# lines
geom_hline(yintercept = 100, size = 0.3, color = "black") +
geom_line(size = 0.2) +
# faceting
ggforce::facet_wrap_paginate(~ sub_region,
ncol = 5,
scales = "free_y") +
# theme settings
theme(panel.border = element_blank(),
panel.background = element_blank(),
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6),
strip.text = element_text(colour = 'black'),
strip.background = element_rect(fill = "gray93"),
legend.position = "top") +
scale_color_jcolors(palette = "pal3") +
lab_facet_wrap_paginate
geofacet
We’re now going to look at all 50 states using the facet_geo()
function from the geofacet
package. To make this graph easier to interpret, we’re going to focus only on walking mobility data, and adjust the dir_request
value to absolute change from baseline (set to 100
on 2020-01-13
).
Create the USWalkingAdj
data by filtering the trans_type
to "walking"
and creating two new variables: above_below
(a logical indicator for values being above or below the baseline value of 100
), and dir_request_adj
(the adjusted direction request value).
Fill in the correct variables in the wrangling steps below:
USWalkingAdj <- TidyAppleUS %>%
filter(trans_type == _________) %>%
mutate(above_below = _________ < 100,
dir_request_adj = _________ - 100)
USWalkingAdj %>%
skimr::skim()
See below:
USWalkingAdj <- TidyAppleUS %>%
filter(trans_type == "walking") %>%
mutate(above_below = dir_request < 100,
dir_request_adj = dir_request - 100)
USWalkingAdj %>%
skimr::skim()
Name | Piped data |
Number of rows | 160085 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
character | 5 |
Date | 1 |
logical | 1 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
geo_type | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
region | 0 | 1 | 5 | 39 | 0 | 452 | 0 |
trans_type | 0 | 1 | 7 | 7 | 0 | 1 | 0 |
sub_region | 0 | 1 | 4 | 14 | 0 | 48 | 0 |
country | 0 | 1 | 13 | 13 | 0 | 1 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 2020-01-13 | 2020-11-24 | 2020-06-19 | 317 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
above_below | 220 | 1 | 0.32 | FAL: 108071, TRU: 51794 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
dir_request | 220 | 1 | 129.36 | 63.68 | 0.44 | 89.44 | 122.7 | 163.82 | 1379.02 | ▇▁▁▁▁ |
dir_request_adj | 220 | 1 | 29.36 | 63.68 | -99.56 | -10.56 | 22.7 | 63.82 | 1279.02 | ▇▁▁▁▁ |
Assign the following to the labels:
set "US Trends in Mobility Data for Walking (Adjusted)"
to title
set "https://covid19.apple.com/mobility"
to caption
lab_facet_geo <- labs(x = "Date",
y = "Direction Requests (Walking)",
title = ____________________________________________,
subtitle = paste0("Original data uses 100 as baseline for usage at ",
min(USWalkingAdj$date)),
caption = ____________________________________________,
fill = "Below Baseline")
See below:
lab_facet_geo <- labs(x = "Date",
y = "Direction Requests (Walking)",
title = "US Trends in Mobility Data for Walking (Adjusted)",
subtitle = paste0("Original data uses 100 as baseline for usage at ",
min(USWalkingAdj$date)),
caption = "https://covid19.apple.com/mobility",
fill = "Below Baseline")
set the colors in color_bl_or
as c("#8470FF", "#7FFFD4")
set yintercept
to 0
in geom_hline()
set the values
in scale_fill_manual()
to color_bl_or
map sub_region
to facet_geo
using ~
Inside theme()
panel.border
and panel.background
to element_blank()
axis.text.x
and axis.text.y
to element_text(size = 6)
strip.text.x
to element_text(size = 7)
strip.text
to element_text(colour = 'white')
strip.background
to element_rect(fill = "black")
legend.position
to "bottom"
# set colors
color_bl_or <- c(____________, ____________)
USWalkingAdj %>%
ggplot(aes(x = date, y = dir_request_adj,
group = sub_region, fill = above_below)) +
geom_col() +
geom_hline(yintercept = _, color = "gray7") +
scale_fill_manual(values = ____________) +
facet_geo(~ sub_region) +
theme_bw() +
theme(______________ = ______________(),
______________ = ______________(),
______________ = ______________(size = _),
______________ = ______________(size = _),
______________ = ______________(size = _),
______________ = ______________(colour = ______________),
______________ = ______________(fill = ______________),
______________ = ______________) +
lab_facet_geo
See below:
# set colors
color_bl_or <- c("#8470FF", "#7FFFD4")
USWalkingAdj %>%
ggplot(aes(x = date, y = dir_request_adj,
group = sub_region, fill = above_below)) +
geom_col() +
geom_hline(yintercept = 0,
color = "gray7") +
scale_fill_manual(values = color_bl_or) +
facet_geo(~ sub_region) +
theme_bw() +
theme(panel.border = element_blank(),
panel.background = element_blank(),
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6),
strip.text.x = element_text(size = 7),
strip.text = element_text(colour = 'white'),
strip.background = element_rect(fill = "black"),
legend.position = "bottom") +
lab_facet_geo
Original Question: How has COVID changed our modes of transportation?
Which graphs do you feel are best at answering this question? Why?
What other information (tables, annotations, etc.) would you include with the graphs?