In order to use graphs and figures to effectively communicate with our audience, we need to consider a few things:
View the slides for this section here.
Read more about ggplot2 on the tidyverse website, and in the Data Visualisation chapter of R for Data Science.
The main packages we’re going to use are dplyr, tidyr, and ggplot2. These are all part of the tidyverse, so we’ll import this package below:
install.packages("tidyverse")
library(tidyverse)
Assume we received the following questions:
How has COVID changed our modes of transportation?
Or
Are people using fewer or different forms of transportation since the COVID pandemic?
Questions we should be considering:
We did some digging and came up with the following dataset to try and answer the questions above:
Import the data below
AppleMobRaw <- readr::read_csv("https://bit.ly/3DEDa8T")
View the head() and tail()
head(AppleMobRaw)
tail(AppleMobRaw)
We can see the dates are structured across the columns, so we need to restructure these into a tidy format. Read more about this format here.
AppleMobRaw %>% 
  tidyr::pivot_longer(cols = -c(geo_type:country), 
                      names_to = "date", 
                      values_to = "dir_request")
Now that we have the data in a tidy format, we should remove the missing values from country and sub-region
AppleMobRaw %>% 
  tidyr::pivot_longer(cols = -c(geo_type:country), 
                      names_to = "date", values_to = "dir_request") %>% 
    # remove missing country and missing sub-region data
  dplyr::filter(!is.na(country) & !is.na(`sub-region`))
Use mutate() to create a properly formatted date variable, and rename() the transportation_type variable to trans_type. Apply janitor::clean_names() to the entire dataset and assign the final output to TidyApple.
AppleMobRaw %>% 
  tidyr::pivot_longer(cols = -c(geo_type:country), 
                      names_to = "date", values_to = "dir_request") %>% 
    # remove missing country and missing sub-region data
  dplyr::filter(!is.na(country) & !is.na(`sub-region`)) %>% 
  # format date
  mutate(date = lubridate::ymd(date)) %>% 
  # change name of transportation types
  rename(trans_type = transportation_type) %>% 
  # clean names 
  janitor::clean_names() -> TidyApple
One of the most important jobs of analytic work is counting things. There are many ways to accomplish this in R, but we’ll stick with the dplyr package because it’s part of the tidyverse.
The dplyr function for counting responses of a categorical or factor variable is count(), and it works like this:
Data %>% 
  count(variable)
So, if we wanted to count the number of different transportation types in the TidyApple data frame, it would look like this,
TidyApple %>% 
  dplyr::count(trans_type)
We can also sort the responses using the sort = TRUE argument.
TidyApple %>% 
  dplyr::count(trans_type, sort = TRUE)
We can also combine dplyr::select_if() and purrr::map() to pass the count() function to all the character variables in TidyApple.
TidyApple %>% 
  select_if(is.character) %>% 
  map(~count(data.frame(x = .x), x, sort = TRUE)) -> tidy_apple_counts
We can example the counts of each value by using the $ to subset the tidy_apple_counts list.
tidy_apple_counts$sub_region
tidy_apple_counts$region
Before we start looking at relationships between variables, we should examine each variable’s underlying distribution. In the next section, we’re going to cover a few graphs that display variable distributions: histograms, density, violin, and ridgeline plots,
A histogram is a special kind of bar graph–it only takes a single continuous variable (in this case, dir_request), and it displays a relative breakdown of the values.
The x axis for the histogram will have the direction requests, and the y variable will display a count of the values.
lab_hist <- labs(x = "Apple directions requests",
                 y = "Count",
     title = "Distribution of Direction Requests",
     subtitle = "source: https://covid19.apple.com/mobility")
Create a histogram of direction requests using dir_request
TidyApple %>% ggplot() + 
  geom_histogram(aes(x = ____________)) + 
  lab_hist
TidyApple %>% ggplot() + 
  geom_histogram(aes(x = dir_request)) + 
  lab_hist
We can see the y axis of the histogram is in scientific notation. This might be hard for some audiences to interpret, so we will change this to use the whole number with commas with the scales package.
Add the scales::comma value to the scale_y_continuous() function.
library(scales)
TidyApple %>% ggplot() + 
  geom_histogram(aes(x = dir_request)) + 
  scale_y_continuous(labels = __________) +
  lab_hist
library(scales)
TidyApple %>% ggplot() + 
  geom_histogram(aes(x = dir_request)) + 
  scale_y_continuous(labels = scales::comma) +
  lab_hist
We can control the shape of the histogram with the bins argument. The default is 30.
Set bins to 15.
TidyApple %>% ggplot() + 
  geom_histogram(aes(x = dir_request), bins = __) + 
  scale_y_continuous(labels = scales::comma) +
  lab_hist
TidyApple %>% ggplot() + 
  geom_histogram(aes(x = dir_request), bins = 15) + 
  scale_y_continuous(labels = scales::comma) +
  lab_hist
Set bins to 45 and assign it to gg_hist45.
TidyApple %>% ggplot() + 
  geom_histogram(aes(x = dir_request), bins = __) + 
  scale_y_continuous(labels = scales::comma) +
  lab_hist -> _____________
TidyApple %>% ggplot() + 
  geom_histogram(aes(x = dir_request), bins = 45) + 
  scale_y_continuous(labels = scales::comma) +
  lab_hist -> gg_hist45
What if we want to see how a continuous variable is distributed across a categorical variable? We covered this in the previous lesson with a boxplot.
Density plots come in handy here (so do geom_boxplot()s!). Read more about the density geom here.
We are going to create the graph labels so we know what to expect when we build our graph, and we want to see the distribution of the directions request, filled by the levels of transportation type.
lab_density <- labs(x = "Apple directions requests",
                    fill = "Transit Type",
     title = "Distribution of Direction Requests vs. Transportation Type",
     subtitle = "source: https://covid19.apple.com/mobility")
Now we build the density plot, passing the variables so they match our labels above.
Create a density plot of direction requests colored by the type of transportation.
TidyApple %>% 
  ggplot() +
  geom_density(aes(x = __________, fill = __________)) + 
  lab_density
One drawback to density plots is the y axis can be hard to interpret
TidyApple %>% 
  ggplot() +
  geom_density(aes(x = dir_request, fill = trans_type)) + 
  lab_density
Adjust the overlapping densities by setting alpha to 1/3. Assign this plot to gg_density.
TidyApple %>% 
  ggplot() +
  geom_density(aes(x = dir_request, fill = trans_type), 
               alpha = __________) + 
  
  lab_density -> __________
TidyApple %>% 
  ggplot() +
  geom_density(aes(x = dir_request, fill = trans_type), 
               alpha = 1/3) + 
  lab_density -> gg_density
gg_density
Another option is a ridgeline plot (from the ggridges package). These display multiple densities.
lab_ridges <- labs(
     title = "Direction Requests by Transportation Type",
     subtitle = "source: https://covid19.apple.com/mobility",
     fill = "Transit type",
     x = "Apple directions requests",
     y = "Transportation Types")
library(ggridges)
TidyApple %>%  
  ggplot() + 
  geom_density_ridges(aes(x = dir_request, 
                          y = trans_type, 
                          fill = trans_type), 
                      alpha = 1/5) + 
  lab_ridges
Another alternative to the density plot is the violin plot.
"Apple directions requests" to the x axis"Transit Type" to the y axislab_violin <- labs(x = _________________________,
                    y = _________________________,
                   fill = "Transit Type",
     title = "Distribution of Direction Requests vs. Transportation Type",
     subtitle = "source: https://covid19.apple.com/mobility")
lab_violin <- labs(x = "Transit Type",
                   y = "Apple directions requests",
                   fill = "Transit Type",
     title = "Distribution of Direction Requests vs. Transportation Type",
     subtitle = "source: https://covid19.apple.com/mobility")
Add a geom_violin() to the code below:
TidyApple %>% 
  ggplot() +
  ____________(aes(y = dir_request, x = trans_type, 
                  fill = trans_type)) + 
  lab_violin
TidyApple %>% 
  ggplot() +
  geom_violin(aes(y = dir_request, x = trans_type, 
                  fill = trans_type)) + 
  lab_violin
The great thing about ggplot2s layered syntax, is that we can add geoms with similar aesthetics to the same graph! For example, we can see how geom_violins and geom_boxplots are related by adding a geom_boxplot() layer to the graph above.
TidyApple %>% 
  ggplot() +
  geom_violin(aes(y = dir_request, x = trans_type, 
                  fill = trans_type), alpha = 1/5) + 
  ___________(aes(y = dir_request, x = trans_type, 
                   color = trans_type)) + 
  lab_violin
Note we set the alpha to 1/5 for the geom_violin(), and the color to trans_type for the geom_boxplot().
TidyApple %>% 
  ggplot() +
  geom_violin(aes(y = dir_request, x = trans_type, 
                  fill = trans_type), alpha = 1/5) + 
  geom_boxplot(aes(y = dir_request, x = trans_type, 
                   color = trans_type)) + 
  lab_violin
You’ll want to export the TidyApple dataset for the next set of exercise.
The code chunk below exports the dataset as a .csv.
fs::dir_create("../data/wk5-01-intro-to-ggp2-part-02/")
readr::write_csv(x = TidyApple, 
                 file = "../data/wk5-01-intro-to-ggp2-part-02/TidyApple.csv")