In order to use graphs and figures to effectively communicate with our audience, we need to consider a few things:
View the slides for this section here.
Read more about ggplot2
on the tidyverse website, and in the Data Visualisation chapter of R for Data Science.
The main packages we’re going to use are dplyr
, tidyr
, and ggplot2
. These are all part of the tidyverse
, so we’ll import this package below:
install.packages("tidyverse")
library(tidyverse)
Assume we received the following questions:
How has COVID changed our modes of transportation?
Or
Are people using fewer or different forms of transportation since the COVID pandemic?
Questions we should be considering:
We did some digging and came up with the following dataset to try and answer the questions above:
Import the data below
AppleMobRaw <- readr::read_csv("https://bit.ly/3DEDa8T")
View the head()
and tail()
head(AppleMobRaw)
tail(AppleMobRaw)
We can see the dates are structured across the columns, so we need to restructure these into a tidy format. Read more about this format here.
AppleMobRaw %>%
tidyr::pivot_longer(cols = -c(geo_type:country),
names_to = "date",
values_to = "dir_request")
Now that we have the data in a tidy format, we should remove the missing values from country
and sub-region
AppleMobRaw %>%
tidyr::pivot_longer(cols = -c(geo_type:country),
names_to = "date", values_to = "dir_request") %>%
# remove missing country and missing sub-region data
dplyr::filter(!is.na(country) & !is.na(`sub-region`))
Use mutate()
to create a properly formatted date
variable, and rename()
the transportation_type
variable to trans_type
. Apply janitor::clean_names()
to the entire dataset and assign the final output to TidyApple
.
AppleMobRaw %>%
tidyr::pivot_longer(cols = -c(geo_type:country),
names_to = "date", values_to = "dir_request") %>%
# remove missing country and missing sub-region data
dplyr::filter(!is.na(country) & !is.na(`sub-region`)) %>%
# format date
mutate(date = lubridate::ymd(date)) %>%
# change name of transportation types
rename(trans_type = transportation_type) %>%
# clean names
janitor::clean_names() -> TidyApple
One of the most important jobs of analytic work is counting things. There are many ways to accomplish this in R, but we’ll stick with the dplyr
package because it’s part of the tidyverse
.
The dplyr
function for counting responses of a categorical or factor variable is count()
, and it works like this:
Data %>%
count(variable)
So, if we wanted to count the number of different transportation types in the TidyApple
data frame, it would look like this,
TidyApple %>%
dplyr::count(trans_type)
We can also sort the responses using the sort = TRUE
argument.
TidyApple %>%
dplyr::count(trans_type, sort = TRUE)
We can also combine dplyr::select_if()
and purrr::map()
to pass the count()
function to all the character variables in TidyApple
.
TidyApple %>%
select_if(is.character) %>%
map(~count(data.frame(x = .x), x, sort = TRUE)) -> tidy_apple_counts
We can example the counts of each value by using the $
to subset the tidy_apple_counts
list.
tidy_apple_counts$sub_region
tidy_apple_counts$region
Before we start looking at relationships between variables, we should examine each variable’s underlying distribution. In the next section, we’re going to cover a few graphs that display variable distributions: histograms, density, violin, and ridgeline plots,
A histogram is a special kind of bar graph–it only takes a single continuous variable (in this case, dir_request
), and it displays a relative breakdown of the values.
The x
axis for the histogram will have the direction requests, and the y
variable will display a count of the values.
lab_hist <- labs(x = "Apple directions requests",
y = "Count",
title = "Distribution of Direction Requests",
subtitle = "source: https://covid19.apple.com/mobility")
Create a histogram of direction requests using dir_request
TidyApple %>% ggplot() +
geom_histogram(aes(x = ____________)) +
lab_hist
TidyApple %>% ggplot() +
geom_histogram(aes(x = dir_request)) +
lab_hist
We can see the y
axis of the histogram is in scientific notation. This might be hard for some audiences to interpret, so we will change this to use the whole number with commas with the scales
package.
Add the scales::comma
value to the scale_y_continuous()
function.
library(scales)
TidyApple %>% ggplot() +
geom_histogram(aes(x = dir_request)) +
scale_y_continuous(labels = __________) +
lab_hist
library(scales)
TidyApple %>% ggplot() +
geom_histogram(aes(x = dir_request)) +
scale_y_continuous(labels = scales::comma) +
lab_hist
We can control the shape of the histogram with the bins
argument. The default is 30
.
Set bins
to 15
.
TidyApple %>% ggplot() +
geom_histogram(aes(x = dir_request), bins = __) +
scale_y_continuous(labels = scales::comma) +
lab_hist
TidyApple %>% ggplot() +
geom_histogram(aes(x = dir_request), bins = 15) +
scale_y_continuous(labels = scales::comma) +
lab_hist
Set bins
to 45
and assign it to gg_hist45
.
TidyApple %>% ggplot() +
geom_histogram(aes(x = dir_request), bins = __) +
scale_y_continuous(labels = scales::comma) +
lab_hist -> _____________
TidyApple %>% ggplot() +
geom_histogram(aes(x = dir_request), bins = 45) +
scale_y_continuous(labels = scales::comma) +
lab_hist -> gg_hist45
What if we want to see how a continuous variable is distributed across a categorical variable? We covered this in the previous lesson with a boxplot.
Density plots come in handy here (so do geom_boxplot()
s!). Read more about the density geom here.
We are going to create the graph labels so we know what to expect when we build our graph, and we want to see the distribution of the directions request, filled by the levels of transportation type.
lab_density <- labs(x = "Apple directions requests",
fill = "Transit Type",
title = "Distribution of Direction Requests vs. Transportation Type",
subtitle = "source: https://covid19.apple.com/mobility")
Now we build the density plot, passing the variables so they match our labels above.
Create a density plot of direction requests colored by the type of transportation.
TidyApple %>%
ggplot() +
geom_density(aes(x = __________, fill = __________)) +
lab_density
One drawback to density plots is the y
axis can be hard to interpret
TidyApple %>%
ggplot() +
geom_density(aes(x = dir_request, fill = trans_type)) +
lab_density
Adjust the overlapping densities by setting alpha
to 1/3
. Assign this plot to gg_density
.
TidyApple %>%
ggplot() +
geom_density(aes(x = dir_request, fill = trans_type),
alpha = __________) +
lab_density -> __________
TidyApple %>%
ggplot() +
geom_density(aes(x = dir_request, fill = trans_type),
alpha = 1/3) +
lab_density -> gg_density
gg_density
Another option is a ridgeline plot (from the ggridges
package). These display multiple densities.
lab_ridges <- labs(
title = "Direction Requests by Transportation Type",
subtitle = "source: https://covid19.apple.com/mobility",
fill = "Transit type",
x = "Apple directions requests",
y = "Transportation Types")
library(ggridges)
TidyApple %>%
ggplot() +
geom_density_ridges(aes(x = dir_request,
y = trans_type,
fill = trans_type),
alpha = 1/5) +
lab_ridges
Another alternative to the density plot is the violin plot.
"Apple directions requests"
to the x
axis"Transit Type"
to the y
axislab_violin <- labs(x = _________________________,
y = _________________________,
fill = "Transit Type",
title = "Distribution of Direction Requests vs. Transportation Type",
subtitle = "source: https://covid19.apple.com/mobility")
lab_violin <- labs(x = "Transit Type",
y = "Apple directions requests",
fill = "Transit Type",
title = "Distribution of Direction Requests vs. Transportation Type",
subtitle = "source: https://covid19.apple.com/mobility")
Add a geom_violin()
to the code below:
TidyApple %>%
ggplot() +
____________(aes(y = dir_request, x = trans_type,
fill = trans_type)) +
lab_violin
TidyApple %>%
ggplot() +
geom_violin(aes(y = dir_request, x = trans_type,
fill = trans_type)) +
lab_violin
The great thing about ggplot2
s layered syntax, is that we can add geoms
with similar aesthetics to the same graph! For example, we can see how geom_violin
s and geom_boxplot
s are related by adding a geom_boxplot()
layer to the graph above.
TidyApple %>%
ggplot() +
geom_violin(aes(y = dir_request, x = trans_type,
fill = trans_type), alpha = 1/5) +
___________(aes(y = dir_request, x = trans_type,
color = trans_type)) +
lab_violin
Note we set the alpha
to 1/5
for the geom_violin()
, and the color
to trans_type
for the geom_boxplot()
.
TidyApple %>%
ggplot() +
geom_violin(aes(y = dir_request, x = trans_type,
fill = trans_type), alpha = 1/5) +
geom_boxplot(aes(y = dir_request, x = trans_type,
color = trans_type)) +
lab_violin
You’ll want to export the TidyApple
dataset for the next set of exercise.
The code chunk below exports the dataset as a .csv.
fs::dir_create("../data/wk5-01-intro-to-ggp2-part-02/")
readr::write_csv(x = TidyApple,
file = "../data/wk5-01-intro-to-ggp2-part-02/TidyApple.csv")