This document outlines and introduction to data visualization with ggplot2
.
The slides for this presentation are here
There is also an accompanying RStudio.Cloud project
All of the exercises and lessons are available here, but you can also read more about dplyr
and tidyr
on the tidyverse website, and in the Data Transformation and Tidy Data chapters of R for Data Science.
The main packages we’re going to use are dplyr
, tidyr
, and ggplot2
. These are all part of the tidyverse
, so we’ll import this package below:
install.packages("tidyverse")
library(tidyverse)
We will begin by importing the data from the wrangling section. These data come from a wikipedia table on largest biotechnology and pharmaceutical companies.
TopPharmComp <- readr::read_csv(file = "https://bit.ly/3gC67HT")
TopPharmComp %>% glimpse(78)
## Rows: 231
## Columns: 10
## $ ranking <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,…
## $ company_name <chr> "Johnson & Johnson", "Johnson & Johnson",…
## $ company_type <chr> "Pharma", "Pharma", "Pharma", "Pharma", "…
## $ stock_exch <chr> "NYSE", "NYSE", "NYSE", "NYSE", "NYSE", "…
## $ stock_id <chr> " JNJ", " JNJ", " JNJ", " JNJ", " JNJ", "…
## $ cap_date_original <chr> "Jan 2018", "Jan 2018", "Jan 2018", "Jan …
## $ largest_market_cap_date <date> 2018-01-01, 2018-01-01, 2018-01-01, 2018…
## $ year <dbl> 2019, 2018, 2017, 2016, 2015, 2014, 2013,…
## $ largest_market_cap_us_bil <dbl> 397.4, 397.4, 397.4, 397.4, 397.4, 397.4,…
## $ market_cap_us_bil <dbl> 385.0, 346.1, 375.4, 314.1, 284.2, 277.8,…
We can see a few of the variables need to be formatted before we can start visualizing.
TopPharmComp %>%
mutate(
# create factor
ranking = factor(ranking, ordered = TRUE),
# remove whitespace
stock_id = str_trim(string = stock_id, side = "both")) -> TopPharmComp
# get skim
TopPharmComp %>% skimr::skim()
Name | Piped data |
Number of rows | 231 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 5 |
Date | 1 |
factor | 1 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
company_name | 0 | 1 | 5 | 25 | 0 | 33 | 0 |
company_type | 0 | 1 | 6 | 7 | 0 | 2 | 0 |
stock_exch | 0 | 1 | 3 | 6 | 0 | 6 | 0 |
stock_id | 0 | 1 | 3 | 5 | 0 | 33 | 0 |
cap_date_original | 0 | 1 | 8 | 8 | 0 | 17 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
largest_market_cap_date | 0 | 1 | 2000-07-01 | 2020-08-01 | 2017-03-01 | 17 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
ranking | 0 | 1 | TRUE | 33 | 1: 7, 2: 7, 3: 7, 4: 7 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 2016.00 | 2.00 | 2013.0 | 2014.0 | 2016.0 | 2018.0 | 2019.0 | ▇▃▃▃▇ |
largest_market_cap_us_bil | 0 | 1.00 | 125.23 | 91.21 | 24.0 | 47.0 | 111.7 | 159.6 | 397.4 | ▇▅▂▂▁ |
market_cap_us_bil | 2 | 0.99 | 85.68 | 75.07 | 2.6 | 28.7 | 63.2 | 117.2 | 385.0 | ▇▅▁▁▁ |
We will start by looking at the trends of market_cap_us_bil
over time (using year
).
I suggest building labels first when making a figure or graph, because it forces us to think about what we should expect to see. For example, if we want to see market_cap_us_bil
on the y
and market_cap_year
on the x
, we can create these with a title using the ggplot2::labs()
function below.
lab_year_x_mrktcap <- ggplot2::labs(title = "Market Cap Trends Over Time",
subtitle = "Largest Pharmaceutical Companies",
x = "Year", y = "Market cap (USD in billions)")
We map variables to positions (x
and y
) inside the aes()
function, either globally inside ggplot()
or locally inside a geom()
.
The exercises below are a refresher on mapping variables to aesthetics and picking geoms.
Map the year
to the x
and market_cap_us_bil
to the y
globally and add a geom_point()
TopPharmComp %>%
ggplot(aes(x = __________, y = __________)) +
geom______() +
lab_year_x_mrktcap
Check the solution below–what is wrong with this plot?
TopPharmComp %>%
ggplot(aes(x = year, y = market_cap_us_bil)) +
geom_point() +
lab_year_x_mrktcap
Change the geom to a geom_line()
. Does it help?
TopPharmComp %>%
ggplot(aes(x = __________, y = __________)) +
geom_____() +
lab_year_x_mrktcap
Check the updated graph below. It’s closer to what we want, but not quite. What do you think the issue is?
TopPharmComp %>%
ggplot(aes(x = year, y = market_cap_us_bil)) +
geom_line() +
lab_year_x_mrktcap
This time add another aes()
inside the geom_line()
function, and apply group
to company_name
.
TopPharmComp %>%
ggplot(aes(x = year, y = market_cap_us_bil)) +
geom_line(aes(_____ = __________)) +
lab_year_x_mrktcap
Now the graph is looking how we want! We can see the trend lines are separated according to each company name, but we can’t tell which line is which.
Let’s add some color to see more.
TopPharmComp %>%
ggplot(aes(x = year, y = market_cap_us_bil)) +
geom_line(aes(group = company_name)) +
lab_year_x_mrktcap
Aesthetics include visual elements like size
, color
, and fill
.
In the previous graph, we were able to get a different line per company. However, we want to identify more about these companies with color.
We know there are 33 levels in company_name
, and this is too many to map with color (check this below with distinct()
)
TopPharmComp %>% distinct(company_name)
Use the stock_exch
to color the lines, assign this graph to gg_trend_line
gg_trend_line <- TopPharmComp %>%
ggplot(aes(x = year, y = market_cap_us_bil)) +
geom_line(aes(group = company_name, color = __________)) +
lab_year_x_mrktcap
gg_trend_line
This graph is showing us more information that the previous plot. For example, we can see the companies with the best performance are on the NYSE
, and those with relatively low performance are on the TSX
.
This graph is better then the previous versions, but it still has too much information to be very helpful. For example, we have no idea which line belongs to which company?
gg_trend_line <- TopPharmComp %>%
ggplot(aes(x = year, y = market_cap_us_bil)) +
geom_line(aes(group = company_name, color = stock_exch)) +
lab_year_x_mrktcap
gg_trend_line
In the next section, we’ll summarize the data a bit more to create different visualizations.
We’re going to summarize the data in TopPharmComp
. Summarizations are helpful when we’re interested in comparing variables across groups.
Whenever we’re grouping by a particular variable (or variables), we need to decide what summary calculation to use. In the section below we will use the mean()
function to get the average market cap (in USD in billions).
We want the TopPharmComp
dataset to have one variable per company_name
name and stock_exch
.
A good measure of central tendency for this is the mean()
(or average). First we define the new labels.
lab_avgmrktcap_compname <- ggplot2::labs(title = "Average Market Cap (2013-2019)",
subtitle = "Largest Pharmaceutical Companies",
x = "Average Market cap (USD in billions)",
y = "Company")
The code below groups the TopPharmComp
data by company_name
and stock_exch
and summarizes the mean market_cap_us_bil
.
TopPharmComp %>%
group_by(company_name, stock_exch) %>%
summarize(
avg_market_cap = mean(market_cap_us_bil, na.rm = TRUE)) %>%
ungroup()
This dataset will allow us to show more information per company.
Complete the code below with the following:
avg_market_cap
to the x
axiscompany_name
to the y
axisstock_exch
to fill
TopPharmComp %>%
group_by(company_name, stock_exch) %>%
summarize(
avg_market_cap = mean(market_cap_us_bil, na.rm = TRUE)) %>%
ungroup() %>%
ggplot() +
geom_col(aes(x = _________,
y = __________,
fill = _________)) +
lab_avgmrktcap_compname
See below:
TopPharmComp %>%
group_by(company_name, stock_exch) %>%
summarize(
avg_market_cap = mean(market_cap_us_bil, na.rm = TRUE)) %>%
ungroup() %>%
ggplot() +
geom_col(aes(x = avg_market_cap,
y = company_name,
fill = stock_exch)) +
lab_avgmrktcap_compname
This graph is better because we can see the average performance of the market cap per company. The lines are very unorganized, though. We can use a combination of factor()
and forcats::fct_reorder()
to fix this.
Use fct_reorder
and factor
to reorder avg_market_cap
according to company_name
TopPharmComp %>%
group_by(company_name, stock_exch) %>%
summarize(
avg_market_cap = mean(market_cap_us_bil, na.rm = TRUE)) %>%
ungroup() %>%
ggplot() +
geom_col(aes(x = avg_market_cap,
y = fct_reorder(factor(_________), __________),
fill = stock_exch)) +
lab_avgmrktcap_compname
See below:
TopPharmComp %>%
group_by(company_name, stock_exch) %>%
summarize(
avg_market_cap = mean(market_cap_us_bil, na.rm = TRUE)) %>%
ungroup() %>%
ggplot() +
geom_col(aes(x = avg_market_cap,
y = fct_reorder(factor(company_name), avg_market_cap),
fill = stock_exch)) +
lab_avgmrktcap_compname
Change the labels below so that the legend is named, "Stock Exchange"
. Re-build the plot with the new labels and assign this to gg_col_graph
.
lab_avgmrktcap_comp_fill <- ggplot2::labs(x = "Company",
y = "Average Market cap (USD in billions)",
title = "Average Market Cap (2013-2019)",
subtitle = "Largest Pharmaceutical Companies",
fill = ___________)
TopPharmComp %>%
group_by(company_name, stock_exch) %>%
summarize(
avg_market_cap = mean(market_cap_us_bil, na.rm = TRUE)) %>%
ungroup() %>%
ggplot() +
geom_col(aes(x = avg_market_cap,
y = fct_reorder(factor(company_name), avg_market_cap),
fill = stock_exch)) +
lab_avgmrktcap_comp_fill -> gg_col_graph
gg_col_graph
See below:
lab_avgmrktcap_comp_fill <- ggplot2::labs(x = "Company",
y = "Average Market cap (USD in billions)",
title = "Average Market Cap (2013-2019)",
subtitle = "Largest Pharmaceutical Companies",
fill = "Stock Exchange")
TopPharmComp %>%
group_by(company_name, stock_exch) %>%
summarize(
avg_market_cap = mean(market_cap_us_bil, na.rm = TRUE)) %>%
ungroup() %>%
ggplot() +
geom_col(aes(x = avg_market_cap,
y = fct_reorder(factor(company_name), avg_market_cap),
fill = stock_exch)) +
lab_avgmrktcap_comp_fill -> gg_col_graph
gg_col_graph
We’re going to focus on the Largest Market Cap (largest_market_cap_us_bil
) and Largest Market Cap Date (largest_market_cap_date
) variables.
First we are going to build a scatter plot (geom_point()
) of Largest Market Cap over time.
We will define a new set of labels below.
lab_lrg_mrkt_cap_trend <- labs(title = "Largest Market Cap (USD in billions)",
subtitle = "Largest Pharmaceutical Companies",
x = "Largest Market Cap Date",
y = "Largest Market Cap (USD in billions)",
color = "Company Type")
select()
the stock_exch
, stock_id
, company_type
, largest_market_cap_date
, and largest_market_cap_us_bil
variables and get the distinct()
values. Assign this to TopPharmComCaps
.
TopPharmComCaps <- TopPharmComp %>%
select(_________, _________, _________,
_________, _________) %>%
dplyr::distinct()
See below:
TopPharmComCaps <- TopPharmComp %>%
select(stock_exch, stock_id, company_type,
largest_market_cap_date, largest_market_cap_us_bil) %>%
dplyr::distinct()
TopPharmComCaps
Use TopPharmComCaps
to create a scatter plot with largest_market_cap_date
to the x
axis, and largest_market_cap_us_bil
to the y
. Color the points by company_type
.
TopPharmComCaps %>%
ggplot() +
geom_point(aes(x = _____________,
y = _____________,
color = _____________)) +
lab_lrg_mrkt_cap_trend
We can see this plot is showing us the date they reached their Largest Market Cap, and the date they reached it. But we still can’t see which company is which.
TopPharmComCaps %>%
ggplot() +
geom_point(aes(x = largest_market_cap_date,
y = largest_market_cap_us_bil,
color = company_type)) +
lab_lrg_mrkt_cap_trend
We’re going to add text labels to each data point. This is as simple as adding another layer to our previous plot! We just need to identify the proper geom, then map the variables to their aesthetics.
Add another layer to the plot above using geom_text()
largest_market_cap_date
to the x
largest_market_cap_us_bil
to the y
stock_id
to label
size
to 3
(outside aes()
)TopPharmComCaps %>%
ggplot() +
geom_point(aes(x = largest_market_cap_date,
y = largest_market_cap_us_bil,
color = company_type)) +
geom_text(aes(x = _____________,
y = _____________,
label = _____________), size = _) +
lab_lrg_mrkt_cap_trend
We can see this plot has labeled each company on their data-point, so we can clearly see the date they reached their Largest Market Cap in comparison to other companies. The labels are hard to see, though. We can try to adjust this with size
, but fortunately there is a package specifically designed to help with text labels.
TopPharmComCaps %>%
ggplot() +
geom_point(aes(x = largest_market_cap_date,
y = largest_market_cap_us_bil,
color = company_type)) +
geom_text(aes(x = largest_market_cap_date,
y = largest_market_cap_us_bil,
label = stock_id), size = 3) +
lab_lrg_mrkt_cap_trend
Use the ggrepel
package to adjust the labels.
stock_id
to geom_text_repel()
library(ggrepel)
TopPharmComCaps %>%
ggplot(aes(x = largest_market_cap_date,
y = largest_market_cap_us_bil)) +
geom_point(aes(color = company_type)) +
geom_text_repel(aes(label = __________), size = 3) +
lab_lrg_mrkt_cap_trend
Now the labels for each company are easier to see! However, this is still a very busy graph. We probably don’t want label all the points on the scatter plot. We will cover more advanced uses of text annotations in the next section.
library(ggrepel)
TopPharmComCaps %>%
ggplot(aes(x = largest_market_cap_date,
y = largest_market_cap_us_bil)) +
geom_point(aes(color = company_type)) +
geom_text_repel(aes(label = stock_id), size = 3) +
lab_lrg_mrkt_cap_trend
In the next few exercises we’re going to see how to use facets to explore values across different levels and scales. Facets show subplots for different levels of a grouping variable.
We’re going to build a few graphs using facet_wrap()
. Each exercise uses a different setting for the scales
argument.
Use facet_wrap()
to split the graphs by company_type
scales
argument to "free"
and nrow
to 2
.TopPharmComCaps %>%
ggplot(aes(x = largest_market_cap_date,
y = largest_market_cap_us_bil)) +
geom_point(aes(color = company_type), show.legend = FALSE) +
geom_text_repel(aes(label = stock_id), size = 3) +
facet_wrap(. ~ company_type, scales = ______, nrow = _) +
lab_lrg_mrkt_cap_trend
This creates two graphs–one for each level of company_type
. Why would setting the scales
to "free"
be an issue?
It makes it difficult to compare points (i.e. companies) when they don’t share common axes.
TopPharmComCaps %>%
ggplot(aes(x = largest_market_cap_date,
y = largest_market_cap_us_bil)) +
geom_point(aes(color = company_type), show.legend = FALSE) +
geom_text_repel(aes(label = stock_id), size = 3) +
facet_wrap(. ~ company_type, scales = "free", nrow = 2) +
lab_lrg_mrkt_cap_trend
Change the scales
argument to "free_x"
and rebuild the graph above.
TopPharmComCaps %>%
filter(stock_exch %in% c("NASDAQ", "NYSE"),
company_type == "Pharma") %>%
ggplot(aes(x = largest_market_cap_date,
y = largest_market_cap_us_bil)) +
geom_point(aes(color = stock_exch), show.legend = FALSE) +
geom_text_repel(aes(label = stock_id), size = 3) +
facet_wrap(. ~ stock_exch, scales = ______, nrow = 2) +
lab_lrg_mrkt_cap_trend
This fixes the y
axis, but allows the x
to adjust according to the data. Now the data are comparable in terms of the y
axis, but over different time-frames (along the x
axis).
TopPharmComCaps %>%
filter(stock_exch %in% c("NASDAQ", "NYSE"),
company_type == "Pharma") %>%
ggplot(aes(x = largest_market_cap_date,
y = largest_market_cap_us_bil)) +
geom_point(aes(color = stock_exch), show.legend = FALSE) +
geom_text_repel(aes(label = stock_id), size = 3) +
facet_wrap(. ~ stock_exch, scales = "free_x", nrow = 2) +
lab_lrg_mrkt_cap_trend
Change the scales
argument to "free_y"
and rebuild the graph.
TopPharmComCaps %>%
filter(stock_exch %in% c("NASDAQ", "NYSE"),
company_type == "Pharma") %>%
ggplot(aes(x = largest_market_cap_date,
y = largest_market_cap_us_bil)) +
geom_point(aes(color = stock_exch), show.legend = FALSE) +
geom_text_repel(aes(label = stock_id), size = 3) +
facet_wrap(. ~ stock_exch, scales = ________, nrow = 2) +
lab_lrg_mrkt_cap_trend
Now we have a fixed x
axis, and the y
axis can adjust according to the data. This presents a similar dilemma–we can compare the time-frames, but the y
axes are misleading.
TopPharmComCaps %>%
filter(stock_exch %in% c("NASDAQ", "NYSE"),
company_type == "Pharma") %>%
ggplot(aes(x = largest_market_cap_date,
y = largest_market_cap_us_bil)) +
geom_point(aes(color = stock_exch), show.legend = FALSE) +
geom_text_repel(aes(label = stock_id), size = 3) +
facet_wrap(. ~ stock_exch, scales = "free_y", nrow = 2) +
lab_lrg_mrkt_cap_trend
This is why it’s preferable to keep both the x
and y
axes identical whenever we use facets. See below:
TopPharmComCaps %>%
filter(stock_exch %in% c("NASDAQ", "NYSE"),
company_type == "Pharma") %>%
ggplot(aes(x = largest_market_cap_date,
y = largest_market_cap_us_bil)) +
geom_point(aes(color = stock_exch), show.legend = FALSE) +
geom_text_repel(aes(label = stock_id), size = 3) +
facet_wrap(. ~ stock_exch, nrow = 2) +
lab_lrg_mrkt_cap_trend
These exercises have covered:
aes()
is used to map variables to x
and y
positionsaes()
is used to map variables to color
, fill
, etc.data.frame
s to ggplot2
functionsgeom_text()
and ggrepel
layersfacet
s to explore graphs across different scales for the x
and y
axes