Give your graphs a custom look using ggplot
extension packages.
Slides: this lesson does not slides.
RStudio Project: this lesson does not have an RStudio.Cloud project.
All of the exercises and lessons are available here, but you can also read more about ggplot2
on the tidyverse website, and in the Data Visualisation chapter of R for Data Science.
tidyverse
The main packages we’re going to use are dplyr
, tidyr
, and ggplot2
. These are all part of the tidyverse
, so we’ll import this package below:
install.packages("tidyverse")
library(tidyverse)
ggplot2
extension packagesggplot2
has an extensive list of user-written packages for just about any data visualization you can think of, and we’re only going to cover a few in this lesson. Check out more here.
library(ggtext)
library(ggfittext)
library(ggdist)
library(ggbeeswarm)
library(plotly)
library(ggthemes)
library(wesanderson)
library(gganimate)
We’re going to be using data from the following packages:
library(tidytuesdayR)
library(Lahman)
library(starwarsdb)
library(dplyr)
In the previous lesson, we covered how to annotate your graphs with ggplot2::annotate()
, and how to label data points with ggrepel
. In this section, we’re going to cover additional text options with ggtext
.
library(ggtext)
We’re going to be visualizing the relationship between height and weight for baseball players in the People
dataset from the Lahman
package.
Adding labels to values on a graph can highlight important values and focus the audiences’ attention on what information we’re trying to display. Good text annotation,
“places descriptions directly in the context of the data so that a reader doesn’t have to look outside a graph for additional information to fully understand what you show.” - Nathan Yau, Data Points
The descriptive information on this dataset is presented below:
Description People
table - Player names, DOB, and biographical info. This file is to be used to get details about players listed in the Batting, Pitching, and other files where players are identified only by playerID
.
nameFirst
- Player’s first name
nameLast
- Player’s last name
weight
- Player’s weight in pounds
height
- Player’s height in inches
Create a BPlayerData
dataset with the variables listed above:
BPlayerData <- Lahman::People %>%
dplyr::select(__________, __________, __________, __________)
BPlayerData
See below:
BPlayerData <- Lahman::People %>%
dplyr::select(nameFirst, nameLast, height, weight)
BPlayerData
Define the graph labels.
lab_bbp_ht_wt <- labs(title = "Relationship between _______ and ________",
subtitle = "Baseball Players from ______ Data Package",
x = "_______ in pounds",
y = "______ in inches")
lab_bbp_ht_wt <- labs(title = "Relationship between height and weight",
subtitle = "Baseball Players from Lahman Data Package",
x = "Weight in pounds",
y = "Height in inches")
Now we can create a scatter-plot
Use geom_point()
. Set the alpha
to 1/3
.
BPlayerData %>%
ggplot(aes(x = ______, y = ______)) +
geom_point(alpha = _/_) +
lab_bbp_ht_wt
See below:
BPlayerData %>%
ggplot(aes(x = weight, y = height)) +
geom_point(alpha = 1/3) +
lab_bbp_ht_wt
We can see there are some outliers in this graph, let’s label them!
First we identify the outliers and determine the names of these players.
315
as BPHeavy
82
as BPTall
60
and height greater than 50
as BPShort
100
and height greater than 70
as BPLight
50
and weight less than 100
as BPTiny
BPlayerLabels
BPlayerData %>% filter(weight > ___) -> BPHeavy
BPlayerData %>% filter(height > 82) -> BPTall
BPlayerData %>% filter(height < __ & height > __) -> BPShort
BPlayerData %>% filter(weight < 100 & height > 70) -> BPLight
BPlayerData %>% filter(height < __ & weight < ___) -> BPTiny
bind_rows(BPHeavy, BPTall, BPShort, BPLight, BPTiny) -> _____________
See below:
BPlayerData %>% filter(weight > 315) -> BPHeavy
BPlayerData %>% filter(height > 82) -> BPTall
BPlayerData %>% filter(height < 60 & height > 50) -> BPShort
BPlayerData %>% filter(weight < 100 & height > 70) -> BPLight
BPlayerData %>% filter(height < 50 & weight < 100) -> BPTiny
bind_rows(BPHeavy, BPTall, BPShort, BPLight, BPTiny) -> BPlayerLabels
BPlayerLabels
Now we’re going to build some labels for these four players:
paste0()
to combine the markdown formatting with the nameFirst
and nameLast
variables.Assign the text to the appropriate player:
"Heaviest"
= https://www.baseball-reference.com/players/y/youngwa01.shtml
"Tallest"
= https://www.baseball-reference.com/players/r/rauchjo01.shtml
"Shortest"
= https://www.baseball-reference.com/players/h/healeto01.shtml
"Lightest"
= https://www.baseball-reference.com/players/s/stallja01.shtml
"Tiniest"
= https://www.baseball-reference.com/players/g/gaedeed01.shtml
Create separate datasets for each group: BPBigSmall
, BPShortTall
, and BPMaybeLightest
put "Young"
and "Gaedel"
in BPBigSmall
put "Rauch"
and "Healey"
in BPShortTall
put "Stallings"
in BPMaybeLightest
BPlayerLabels <- BPlayerLabels %>%
mutate(outlier_label = case_when(
nameFirst == "Walter" ~ paste0("**", __________, " ", __________, ":** ",
"*__________...*"),
nameFirst == "Jon" ~ paste0("**", __________, " ", __________, ":** ",
"*__________...*"),
nameFirst == "Tom" ~ paste0("**", nameFirst, " ", nameLast, ":** ",
"*__________...*"),
nameFirst == "Jacob" ~ paste0("**", nameFirst, " ", nameLast, ":** ",
"*__________..?*"),
nameFirst == "Eddie" ~ paste0("**", nameFirst, " ", nameLast, ":** ",
"*__________...*")))
BPlayerLabels %>%
filter(nameLast %in% c("______", "_______")) -> BPBigSmall
BPlayerLabels %>%
filter(nameLast %in% c("______", "_______")) -> BPShortTall
BPlayerLabels %>%
filter(nameLast == "__________") -> BPMaybeLightest
See below.
BPlayerLabels <- BPlayerLabels %>%
mutate(outlier_label = case_when(
nameFirst == "Walter" ~ paste0("**", nameFirst, " ", nameLast, ":** ",
"*Heaviest...*"),
nameFirst == "Jon" ~ paste0("**", nameFirst, " ", nameLast, ":** ",
"*Tallest...*"),
nameFirst == "Tom" ~ paste0("**", nameFirst, " ", nameLast, ":** ",
"*Shortest...*"),
nameFirst == "Jacob" ~ paste0("**", nameFirst, " ", nameLast, ":** ",
"*Lightest..?*"),
nameFirst == "Eddie" ~ paste0("**", nameFirst, " ", nameLast, ":** ",
"*Tiniest...*")))
BPlayerLabels %>%
filter(nameLast %in% c("Young", "Gaedel")) -> BPBigSmall
BPlayerLabels %>%
filter(nameLast %in% c("Rauch", "Healey")) -> BPShortTall
BPlayerLabels %>%
filter(nameLast == "Stallings") -> BPMaybeLightest
BPBigSmall
Now we create the geom_textbox()
layer. To accomplish this, we’re going to define the following aesthetics in the ggtext::geom_textbox(aes())
: - data
= BPBigSmall
- label
= outlier_label
- orientation
= "upright"
- hjust
= -0.01
- vjust
= -0.01
- fill
= "black"
- color
= "white"
These aesthetics are defined outside the geom_textbox(aes())
function, but inside the geom_textbox()
geom:
size
= 3
width
= unit(0.16, "npc")
Add another geom_point()
layer
data
as BPBigSmall
x = weight
and y = height
inside aes()
size
to 3
color
to "green4"
alpha
to 2/3
In order to get the geom_textbox()
to fit on the graph, we need to extend the x
and y
axes. We can do this with scale_x_continuous()
:
limits
to 0
and 370
lab_bbp_big_small <- labs(title = "Biggest and smallest MLB players",
subtitle = "Baseball Players from Lahman Data Package",
x = "Weight in pounds",
y = "Height in inches")
BPlayerData %>%
ggplot(aes(x = weight,
y = height)) +
geom_point(alpha = 1/3) +
ggtext::geom_textbox(data = ____________,
aes(label = ___________,
orientation = "________",
hjust = -____,
vjust = -____,
fill = "_____",
color = "_____"),
size = _,
width = unit(____, "___")) +
geom_point(data = ____________,
aes(x = _______, y = _______),
size = _,
color = "_______",
alpha = ___) +
scale_discrete_identity(aesthetics = c("color",
"fill",
"orientation")) +
scale_x_continuous(limits = c(_, ___)) +
lab_bbp_big_small
See below:
lab_bbp_big_small <- labs(title = "Biggest and smallest MLB players",
subtitle = "Baseball Players from Lahman Data Package",
x = "Weight in pounds",
y = "Height in inches")
BPlayerData %>%
ggplot(aes(x = weight,
y = height)) +
geom_point(alpha = 1/3) +
ggtext::geom_textbox(data = BPBigSmall,
aes(label = outlier_label,
orientation = "upright",
hjust = -0.01,
vjust = -0.01,
fill = "black",
color = "white"),
size = 3,
width = unit(0.16, "npc")) +
geom_point(data = BPBigSmall,
aes(x = weight, y = height),
size = 3,
color = "green4",
alpha = 2/3) +
scale_discrete_identity(aesthetics = c("color",
"fill",
"orientation")) +
scale_x_continuous(limits = c(0, 370)) +
lab_bbp_big_small
BPShortTall
Change the geom_textbox()
layer by defining the following aesthetics in the ggtext::geom_textbox(aes())
:
data
= BPShortTall
hjust
= -0.04
vjust
= 0.08
Add another geom_point()
layer
color
to "dodgerblue"
We want to ‘zoom in’ on the tallest and shortest players in the dataset, so we’ll adjust the x
and y
axes with scale_x_continuous()
and scale_y_continuous()
:
set the scale_x_continuous()
limits to 50
and 350
set the scale_y_continuous()
limits to 30
and 90
lab_bbp_tall_short <- labs(title = "Tallest and shortest MLB players",
subtitle = "Baseball Players from Lahman Data Package",
x = "Weight in pounds",
y = "Height in inches")
BPlayerData %>%
ggplot(aes(x = weight,
y = height)) +
geom_point(alpha = 1/3) +
ggtext::geom_textbox(data = ____________,
aes(label = outlier_label,
orientation = "upright",
hjust = ______,
vjust = ______,
fill = "black",
color = "white"),
size = 3,
width = unit(0.16, "npc")) +
geom_point(data = ____________,
aes(x = weight,
y = height),
size = 3,
color = "__________",
alpha = 2/3) +
scale_discrete_identity(aesthetics = c("color",
"fill",
"orientation")) +
scale_x_continuous(limits = c(__, ___)) +
scale_y_continuous(limits = c(__, __)) +
lab_bbp_tall_short
lab_bbp_tall_short <- labs(title = "Tallest and shortest MLB players",
subtitle = "Baseball Players from Lahman Data Package",
x = "Weight in pounds",
y = "Height in inches")
BPlayerData %>%
ggplot(aes(x = weight,
y = height)) +
geom_point(alpha = 1/3) +
ggtext::geom_textbox(data = BPShortTall,
aes(label = outlier_label,
orientation = "upright",
hjust = -0.04,
vjust = 0.08,
fill = "black",
color = "white"),
size = 3,
width = unit(0.14, "npc")) +
geom_point(data = BPShortTall,
aes(x = weight,
y = height),
size = 3,
color = "dodgerblue",
alpha = 2/3) +
scale_discrete_identity(aesthetics = c("color",
"fill",
"orientation")) +
scale_x_continuous(limits = c(50, 350)) +
scale_y_continuous(limits = c(30, 90)) +
lab_bbp_tall_short
The Lahman
dataset lists Jacob Stallings
as weighing 76
lbs, but baseball-reference lists his weight as 220lbs. We will include this label with different colors.
BPMaybeLightest
Change the geom_textbox()
layer by defining the following aesthetics in the ggtext::geom_textbox(aes())
:
data
= BPMaybeLightest
hjust
= -0.05
vjust
= -0.01
fill
= "darkred"
color
= "white"
Add another geom_point()
layer
color
to "firebrick"
We don’t need to adjust the axes on this graph.
lab_bbp_maybe_light <- labs(title = "Is Jacob Stallings the lightest player?",
subtitle = "Listed as 6-5, 220lb on baseball-reference.com",
x = "Weight in pounds",
y = "Height in inches",
caption = "https://www.baseball-reference.com/players/s/stallja01.shtml")
BPlayerData %>%
ggplot(aes(x = weight,
y = height)) +
geom_point(alpha = 1/3) +
ggtext::geom_textbox(data = ________________,
aes(label = outlier_label,
orientation = "upright",
hjust = _____,
vjust = _____,
fill = "________",
color = "white"),
size = 3,
width = unit(0.16, "npc")) +
geom_point(data = ________________,
aes(x = weight, y = height),
size = 3,
color = "__________",
alpha = 2/3) +
scale_discrete_identity(aesthetics = c("color", "fill", "orientation")) +
lab_bbp_maybe_light
See below:
lab_bbp_maybe_light <- labs(title = "Is Jacob Stallings the lightest player?",
subtitle = "Listed as 6-5, 220lb on baseball-reference.com",
x = "Weight in pounds",
y = "Height in inches",
caption = "https://www.baseball-reference.com/players/s/stallja01.shtml")
BPlayerData %>%
ggplot(aes(x = weight,
y = height)) +
geom_point(alpha = 1/3) +
ggtext::geom_textbox(data = BPMaybeLightest,
aes(label = outlier_label,
orientation = "upright",
hjust = -0.05,
vjust = -0.01,
fill = "darkred",
color = "white"),
size = 3,
width = unit(0.16, "npc")) +
geom_point(data = BPMaybeLightest,
aes(x = weight, y = height),
size = 3,
color = "firebrick",
alpha = 2/3) +
scale_discrete_identity(aesthetics = c("color", "fill", "orientation")) +
lab_bbp_maybe_light
The colored points and textboxes highlight the outliers on the scatter plot.
We’re going to be using the ggfittext
package to add labels onto a bar (or column) graph. This comes in handy if space is limited, or we have
library(ggfittext)
Calculate the slugging percentage slug_perc
from the Lahman::Batting
table using the following code:
H - X2B - X3B - HR + 2 * X2B + 3 * X3B + 4 * HR) / AB
Also create the batting_era
variable that cut()
s the yearID
into 7 different levels:
Note that cut()
has a breaks
argument that will actually take 8 levels.
BattingStats <- Lahman::Batting %>%
mutate(slug_perc = (_ - ___ - ___ - __ +
_ * ___ + _ * ___ + _ * __) / __,
batting_era = cut(yearID,
breaks = c(____, 1900, 1919, 1941,
1960, 1976, 1993, ____),
labels = c("____________ (1871 - 1900)",
"________ (1901 - 1919)",
"____________ (1920 - 1941)",
"__________ (1942 - 1976)",
"________ (1961 - 1976)",
"____________ (1977 - 1993)",
"__________ (1994 - 2019)")))
BattingStats %>%
group_by(batting_era) %>%
summarize(from = min(yearID),
to = max(yearID))
See below:
BattingStats <- Lahman::Batting %>%
mutate(slug_perc = (H - X2B - X3B - HR +
2 * X2B + 3 * X3B + 4 * HR) / AB,
batting_era = cut(yearID,
breaks = c(1870, 1900, 1919, 1941,
1960, 1976, 1993, 2020),
labels = c("19th Century (1871 - 1900)",
"Dead Ball (1901 - 1919)",
"Lively Ball (1920 - 1941)",
"Integration (1942 - 1976)",
"Expansion (1961 - 1976)",
"Free Agency (1977 - 1993)",
"Long Ball (1994 - 2019)")))
BattingStats %>%
group_by(batting_era) %>%
summarize(from = min(yearID),
to = max(yearID))
Group by batting_era
and summarize
the mean
of slug_perc
, calling it avg_slug_perc
. Remove the missing values with na.rm = TRUE
.
SumBattingStats <- BattingStats %>%
group_by(_________) %>%
summarize(_________ = mean(_____, na.rm = ____))
SumBattingStats %>% str()
See below:
SumBattingStats <- BattingStats %>%
group_by(batting_era) %>%
summarize(avg_slug_perc = mean(slug_perc, na.rm = TRUE))
SumBattingStats %>% str()
## tibble [7 × 2] (S3: tbl_df/tbl/data.frame)
## $ batting_era : Factor w/ 7 levels "19th Century (1871 - 1900)",..: 1 2 3 4 5 6 7
## $ avg_slug_perc: num [1:7] 0.291 0.264 0.304 0.281 0.272 ...
Define the graph labels.
Define the graph labels below:
lab_bat_era <- labs(title = "Average slugging percentage by era",
subtitle = "Batting Statistics from ______ Data Package",
x = "___",
y = "Average Slugging Percentage")
See below:
lab_bat_era <- labs(title = "Average slugging percentage by era",
subtitle = "Batting Statistics from Lahman Data Package",
x = "Era",
y = "Average Slugging Percentage")
Now we’re ready for adding the text to the bars. In this case, the levels of batting_era
aren’t equal, so we want to include the timeframe inside the bars to show the actual range of years.
Initiate a graph below by mapping batting_era
to the x
and the label
, and avg_slug_perc
to the y
. Create a column (or bar) with geom_col()
, flip the coordinate using coord_flip()
.
SumBattingStats %>%
ggplot(aes(x = __________,
y = __________,
label = __________)) +
geom____() +
___________() +
lab_bat_era
See below:
SumBattingStats %>%
ggplot(aes(x = batting_era,
y = avg_slug_perc,
label = batting_era)) +
geom_col() +
coord_flip() +
lab_bat_era
The text on the y
axis is taking up a lot of space–we’re going to move this to inside the columns.
Add the geom_bar_text()
to include labels on the columns. Use theme_bw()
, but also remove the y
axis text
and ticks
inside an additional theme()
layer, using element_blank()
.
SumBattingStats %>%
ggplot(aes(x = batting_era,
y = avg_slug_perc,
label = batting_era)) +
geom_col() +
coord_flip() +
geom__________() +
theme_bw() +
theme(axis.____.y = element_blank(),
axis._____.y = element_blank()) +
coord_flip() +
_____________() +
lab_bat_era
See below:
SumBattingStats %>%
ggplot(aes(x = batting_era,
y = avg_slug_perc,
label = batting_era)) +
geom_col() +
coord_flip() +
geom_bar_text() +
theme_minimal() +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
lab_bat_era
Now we have more ink (data) on the graph. The labels inside the columns also makes sense because these aren’t equal year intervals.
We previously explored variable distributions using histograms, density and violin plots, and ridgeline plots from the ggridges
package.
In this section, we’re going to cover some advanced variable distribution graphs using the ggdist
package. We’ll also need some help from the broom
and distributional
packages. The ggdist
package requires a little more underlying knowledge about modeling in R, and I invite you to read the entire vignette for more information.
library(ggdist)
library(broom)
library(distributional)
We’re going to be using TidyTuesday’s dataset on penguins. We can load these data below:
penguin_data <- tidytuesdayR::tt_load('2020-07-28')
##
## Downloading file 1 of 2: `penguins.csv`
## Downloading file 2 of 2: `penguins_raw.csv`
Penguins <- penguin_data$penguins
We can view the data with skimr::skim()
Penguins %>% skimr::skim()
Name | Piped data |
Number of rows | 344 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 3 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
species | 0 | 1.00 | 6 | 9 | 0 | 3 | 0 |
island | 0 | 1.00 | 5 | 9 | 0 | 3 | 0 |
sex | 11 | 0.97 | 4 | 6 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
The description of each variable in Penguins
is below.
variable | class | description |
---|---|---|
species |
integer | Penguin species (Adelie, Gentoo, Chinstrap) |
island |
integer | Island where recorded (Biscoe, Dream, Torgersen) |
bill_length_mm |
double | Bill length in millimeters (also known as culmen length) |
bill_depth_mm |
double | Bill depth in millimeters (also known as culmen depth) |
flipper_length_mm |
integer | Flipper length in mm |
body_mass_g |
integer | Body mass in grams |
sex |
integer | sex of the animal |
year |
integer | year recorded |
We will start by looking at body mass of the three species in the Penguins
dataset.
Build the labels for a scatter plot of body_mass_g
on the x
axis, and species
on the y
lab_peng_scatter <- labs(title = "Relationship between body mass and species",
subtitle = "Data from the palmerpenguins package",
x = "___________",
y = "_______",
caption = "https://allisonhorst.github.io/palmerpenguins/")
See below:
lab_peng_scatter <- labs(title = "Relationship between body mass and species",
subtitle = "Data from the palmerpenguins package",
x = "Body mass (g)",
y = "Species",
caption = "https://allisonhorst.github.io/palmerpenguins/")
Create a scatter plot using the labels we built above. Set the alpha
to 1/2
.
Penguins %>%
ggplot(aes(x = __________, y = ________)) +
geom_point(alpha = 1/2) +
lab_peng_scatter
See below:
Penguins %>%
ggplot(aes(x = body_mass_g, y = species)) +
geom_point(alpha = 1/2) +
lab_peng_scatter
The graph above doesn’t show much in terms of the probability distributions for body mass across the different levels of species, so we will build a linear model to include additional terms (estimate
and std.error
) for each level of the distribution on the graph.
We will use the lm()
function to create a linear model predicting body_mass_g
with species
.
lmmod_penguins <- lm(___________ ~ _______, data = Penguins)
See below:
lmmod_penguins <- lm(body_mass_g ~ species, data = Penguins)
summary(lmmod_penguins)
##
## Call:
## lm(formula = body_mass_g ~ species, data = Penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1126.02 -333.09 -33.09 316.91 1223.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3700.66 37.62 98.37 <2e-16 ***
## speciesChinstrap 32.43 67.51 0.48 0.631
## speciesGentoo 1375.35 56.15 24.50 <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 462.3 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6697, Adjusted R-squared: 0.6677
## F-statistic: 343.6 on 2 and 339 DF, p-value: < 2.2e-16
The typical output from the summary(lmmod_penguins)
is not very helpful from a graphing standpoint, so we will use the broom
package to make it easier to manipulate.
Pass the output of lm()
to broom::tidy()
to convert the model statistics into a tibble
.
______________ %>% broom::tidy()
See below:
lmmod_penguins %>% broom::tidy()
We can now use the columns from broom::tidy()
to build a graph of the probability distributions.
Fill in the following labels:
title = "Probability distribution of body mass by species"
x = "Student T distribution"
y = "Model Term"
lab_prob_dist <- labs(title = "_______________________________________________",
subtitle = "Data from the palmerpenguins package",
x = "________________________",
y = "__________________",
caption = "https://allisonhorst.github.io/palmerpenguins/")
See below:
lab_prob_dist <- labs(title = "Probability distribution of body mass by species",
subtitle = "Data from the palmerpenguins package",
x = "Student T distribution",
y = "Model Term",
caption = "https://allisonhorst.github.io/palmerpenguins/")
Use the ggdist::stat_dist_halfeye()
to map the dist
aesthetic with some help from the distributional::dist_student_t()
function.
df.residual(lmmod_penguins)
to df
inside the dist_student_t()
functionestimate
to mu
inside the dist_student_t()
functionstd.error
to sigma
inside the dist_student_t()
functionWe’ve extended the x
axis with scale_x_continuous()
for clarity.
Penguins %>%
lm(body_mass_g ~ species, data = .) %>%
broom::tidy() %>%
ggplot(aes(y = term)) +
ggdist::_______________(
aes(dist = distributional::______________(df = df.residual(___________),
mu = _________,
sigma = ___________))) +
scale_x_continuous(limits = c(-400, 4400)) +
lab_prob_dist
See below:
Penguins %>%
lm(body_mass_g ~ species, data = .) %>%
broom::tidy() %>%
ggplot(aes(y = term)) +
ggdist::stat_dist_halfeye(
aes(dist = distributional::dist_student_t(df = df.residual(lmmod_penguins),
mu = estimate,
sigma = std.error))) +
scale_x_continuous(limits = c(-400, 4400)) +
lab_prob_dist
ggplot2
has many options for picking the colors on your graph. We’re going to cover the khroma
and wesanderson
color packages in this section.
library(wesanderson)
library(khroma)
We’re going to be graphing the starwarsdb
package. Below is a relational model for the tables in the package.
library(dm, warn.conflicts = FALSE)
sw_dm <- starwars_dm()
dm_draw(sw_dm)
We covered joins in a previous lesson. Below we will create a dataset with variables from the people
, planets
, species
, and pilots
tables.
Use an inner_join()
to connect people
to planets
by "homeworld"
and "name"
SWPeopPlan <- starwarsdb::people %>%
____________(x = ., y = starwarsdb::planets, by = c("_________" = "____"))
SWPeopPlan
See below:
SWPeopPlan <- starwarsdb::people %>%
inner_join(x = ., y = starwarsdb::planets, by = c("homeworld" = "name"))
SWPeopPlan
Use select()
to remove homeworld
from species
inner_join()
to connect species
to SWPeopPlan
by "name"
and "species"
.
Use the suffix
argument to keep track of the variables original location by supplying c("_species", "_people")
.
rename()
the name
variable as species_name
.
Use another inner_join()
to add the vehicle
column from pilots
, joining by
the "name_people" = "pilot"
.
select()
only the name_people
, height
, mass
, sex
, homeworld
, gravity
, terrain
, population
, species_name
, average_height
, classification
, average_lifespan
, and vehicle
columns
Finally, change population
to an integer
value with mutate()
starwarsdb::species %>%
select(-_________) %>%
_________(x = ., y = SWPeopPlan,
by = c("name" = "species"),
suffix = c("_________", "_________")) %>%
rename(_________ = name) %>%
inner_join(x = ., y = starwarsdb::pilots,
by = c("name_people" = "_________")) %>%
select(_________,
_________,
_________,
_________,
_________,
_________,
_________,
_________,
_________,
_________,
_________,
_________,
_________) %>%
_________(population = as.integer(population)) -> SWDBData
See below:
starwarsdb::species %>%
select(-homeworld) %>%
inner_join(x = ., y = SWPeopPlan,
by = c("name" = "species"),
suffix = c("_species", "_people")) %>%
rename(species_name = name) %>%
inner_join(x = ., y = starwarsdb::pilots,
by = c("name_people" = "pilot")) %>%
select(name_people,
height,
mass,
sex,
homeworld,
gravity,
terrain,
population,
species_name,
average_height,
classification,
average_lifespan,
vehicle) %>%
mutate(population = as.integer(population),
average_lifespan = as.integer(average_lifespan)) -> SWDBData
SWDBData
Before we start building a graph, we should take a look at the available colors each package and palette.
wesanderson
Check the names of the palettes in the wesanderson
package using names(wes_palettes)
.
names(wes_palettes)
## [1] "BottleRocket1" "BottleRocket2" "Rushmore1"
## [4] "Rushmore" "Royal1" "Royal2"
## [7] "Zissou1" "Darjeeling1" "Darjeeling2"
## [10] "Chevalier1" "FantasticFox1" "Moonrise1"
## [13] "Moonrise2" "Moonrise3" "Cavalcanti1"
## [16] "GrandBudapest1" "GrandBudapest2" "IsleofDogs1"
## [19] "IsleofDogs2"
We can view the colors using wes_palette("name of palette")
.
View the "IsleofDogs1"
palette below:
wes_palette("____________")
See below:
wes_palette("IsleofDogs1")
View the "FantasticFox1"
palette below:
wes_palette("_____________")
See below:
wes_palette("FantasticFox1")
khroma
View the available khroma
package using the following syntax:
First we define the colour()
or color()
with a text string (i.e. "vibrant"
) and store in an output (i.e. vibrant
).
Then we use the plot_scheme()
function, which takes the output from colour()
(in this case, vibrant
) along with the number of colors we want displayed from that particular scheme in parentheses (each scheme has an upper limit of colors). It looks like vibrant(7)
.
Additional arguments include colours
and names
(which we set to TRUE
) and size
(which we set to 0.9
).
See the example below for reference.
# set palette
vibrant <- colour("vibrant")
# plot the color scheme
plot_scheme(vibrant(7), colours = TRUE, names = TRUE, size = 0.9)
View the colors in the "bright"
scheme, setting the number of different colors to 6
.
bright <- colour("______")
plot_scheme(bright(_), colours = TRUE, names = TRUE, size = 0.9)
See below:
bright <- colour("bright")
plot_scheme(bright(6), colours = TRUE, names = TRUE, size = 0.9)
Now we’re going to build a few bar and column graphs using the color palettes we’ve outlined above. Bar and column graphs are great for showing amounts (or counts) of data. An important distinction between the geom_bar()
and geom_col()
is the that the geom_bar()
only maps a single x
variable, while the geom_col()
can map both x
and y
variables.
I’ve defined the labels for the graph below. Use them to guide you in building a column graph for the number of species present in the SWDBData
pilots data.
Fill the columns by species_name
Add the khroma::scale_fill_light()
layer after the geom_bar()
(but before the labels)
lab_species_swdb <- labs(title = "Species of the Pilots in Star Wars",
subtitle = "Species for pilot characters",
x = "Species",
y = "Count",
caption = "Data from starwarsdb package",
fill = "Species")
SWDBData %>%
ggplot(aes(x = ___________,
fill = ___________)) +
geom_bar(show.legend = FALSE) +
khroma::____________________() +
lab_species_swdb
See below. Note that the fill
aesthetic is matched with a scale_fill_light()
function.
lab_species_swdb <- labs(title = "Species of the Pilots in Star Wars",
subtitle = "Species for pilot characters",
x = "Species",
y = "Count",
caption = "Data from starwarsdb package",
fill = "Species")
SWDBData %>%
ggplot(aes(x = species_name,
fill = species_name)) +
geom_bar(show.legend = FALSE) +
khroma::scale_fill_light() +
lab_species_swdb
We are going to reorganize the bars in the previous graph according to the values on the y
axis. In order to reorganize our graph, we need a column of counts
. Do this with dplyr::count()
, sorting the output and naming the new column "counts"
.
Reordering the x
axis is accomplished with the forcats::fct_reorder()
function, which takes .f
(the factor or character variable we’re reordering: species_name
) and .x
(the numerical variable we want to use to reorder the factor or character variable: counts
).
We need to switch from using a geom_bar()
to a geom_col()
, because we need to map an x
and y
variable.
Swap the khroma::scale_fill_light()
function for khroma::scale_fill_bright()
lab_species_swdb <- labs(title = "Species of the Pilots in Star Wars",
subtitle = "Species for pilot characters",
x = "Species",
y = "Count",
caption = "Data from starwarsdb package",
fill = "Species")
SWDBData %>%
# here we count species_name and give the new variable name
dplyr::count(name = "_______", species_name, sort = ____) %>%
ggplot(aes(x = forcats::fct_reorder(.f = ___________, .x = ______),
y = counts,
fill = species_name)) +
geom____(show.legend = FALSE) +
khroma::___________________() +
lab_species_swdb
See below. Note that we’ve organized the x
axis according to the values on the y
. Which color scheme do you prefer?
lab_species_swdb <- labs(title = "Species of the Pilots in Star Wars",
subtitle = "Species for pilot characters",
x = "Species",
y = "Count",
caption = "Data from starwarsdb package",
fill = "Species")
SWDBData %>%
# here we count species_name and give the new variable name
count(name = "counts", species_name, sort = TRUE) %>%
ggplot(aes(x = forcats::fct_reorder(.f = species_name, .x = counts),
y = counts,
fill = species_name)) +
geom_col(show.legend = FALSE) +
khroma::scale_fill_bright() +
lab_species_swdb
Now we’re going to use a palette from the wesanderson
package.
x = "Species"
y = "Average lifespan"
Wrangle the data:
+ select()
the species_name
and average_lifespan
+ Remove missing values with tidyr::drop_na()
+ Get only the distinct combinations of species_name
and average_lifespan
using dplyr::distinct()
Initiate a graph and map the global positions: + reorder species_name
on the x
axis according to the descending values of average_lifespan
+ y
as average_lifespan
, and
+ fill
as species_name
Add a geom_col()
layer and set show.legend
to FALSE
Add the scale_fill_manual()
layer, and specify the values
argument to wes_palette("IsleofDogs2")
lab_spec_lfspn_class <- labs(title = "Average lifespan by species",
x = "_______",
y = "_________ _________",
caption = "Data from starwarsdb package")
SWDBData %>%
dplyr::select(____________, ________________) %>%
tidyr::______() %>%
dplyr::________() %>%
ggplot(aes(x = forcats::fct_reorder(.f = ____________,
.x = desc(________________)),
y = ________________,
fill = ____________)) +
geom_col(___________ = ____) +
scale_fill_manual(values = wes_palette("_____________")) +
lab_spec_lfspn_class
See below. Note the different position of the columns compared to the previous graph.
lab_spec_lfspn_class <- labs(title = "Average lifespan by species",
x = "Species",
y = "Average lifespan",
caption = "Data from starwarsdb package")
SWDBData %>%
dplyr::select(species_name, average_lifespan) %>%
tidyr::drop_na() %>%
dplyr::distinct() %>%
ggplot(aes(x = forcats::fct_reorder(.f = species_name,
.x = desc(average_lifespan)),
y = average_lifespan,
fill = species_name)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = wes_palette("IsleofDogs2")) +
lab_spec_lfspn_class
The previous graphs introduced two packages for choosing alternative colors. In this section, we’re going to extend the color selection to other geoms, and include considerations for people with color-vision deficiencies.
The code below filters the SWDBData
data to only the height
and weight
for those characters who were “force sensitive” (either Jedi or Sith).
SWDBData %>%
filter(name_people %in% c("Anakin Skywalker", "Dooku",
"Obi-Wan Kenobi", "Plo Koon",
"Luke Skywalker", "Leia Organa",
"Darth Vader", "Darth Maul")) %>%
select(name_people, height, mass) %>%
distinct() -> SWDBForcePilots
SWDBForcePilots
We’re going to build another column graph, and reorder the x
axis by the mass
variable.
name_people
(reordered by mass
)mass
to the y
variablefill
to name_people
scale_fill_muted()
layer to set the colorslab_ht_wt_cols <- labs(title = "Force and mass in Star Wars",
subtitle = "Mass of force sensitive characters",
caption = "source: https://starwars.fandom.com/wiki/",
x = "Character",
y = "Mass")
SWDBForcePilots %>%
ggplot(aes(x = fct_reorder(.f = ___________,
.x = ____),
y = ____,
fill = ___________)) +
geom_col(show.legend = FALSE) +
____________________() +
lab_ht_wt_cols
See the solution below. Notice how the text along the x
axis is difficult to read.
lab_ht_wt_cols <- labs(title = "Force and mass in Star Wars",
subtitle = "Mass of force sensitive characters",
caption = "source: https://starwars.fandom.com/wiki/",
x = "Character",
y = "Mass")
SWDBForcePilots %>%
ggplot(aes(x = fct_reorder(.f = name_people,
.x = mass),
y = mass,
fill = name_people)) +
geom_col(show.legend = FALSE) +
scale_fill_muted() +
lab_ht_wt_cols
Use the Okabe Ito scale if you’re presenting graphs to a broad audience, because it’s specifically designed for color-blindness.
add a coord_flip()
layer to deal with the x
axis
add the scale_fill_okabeito()
layer
SWDBForcePilots %>%
ggplot(aes(x = fct_reorder(.f = name_people,
.x = mass),
y = mass,
fill = name_people)) +
geom_col(show.legend = FALSE) +
__________() +
___________________() +
lab_ht_wt_cols
See below:
SWDBForcePilots %>%
ggplot(aes(x = fct_reorder(.f = name_people,
.x = mass),
y = mass,
fill = name_people)) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_fill_okabeito() +
lab_ht_wt_cols
Sometimes it’s better to pick a different aesthetic when presenting a continuous variable across a categorical variable. Below we replace the columns with shapes, and use color as an additional aesthetic to distinguish the different species of mammals.
We have to manually set the color scales here using the hex codes for the Okabe-Ito scale, but we can make things more exciting by randomly assigning the color values to the shapes in the plot.
okabe_scale <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442",
"#0072B2", "#D55E00", "#CC79A7")
sample(x = okabe_scale, size = 4, replace = FALSE)
## [1] "#F0E442" "#D55E00" "#0072B2" "#E69F00"
We also add the theme_bw()
layer to reduce some of the excess chart elements.
lab_mammal_ht <- labs(title = "Human aren't flightless mammals in Star Wars",
subtitle = "Heights of all mammal Star Wars pilots",
caption = "source: https://starwars.fandom.com/wiki/",
x = "Pilot",
y = "Height")
SWDBData %>%
filter(classification == "mammal") %>%
ggplot(aes(x = fct_reorder(.f = name_people,
.x = height),
y = height,
color = species_name,
fill = species_name,
shape = species_name)) +
geom_point(size = 3) +
coord_flip() +
scale_shape_manual(name = "Species",
values = 21:24) +
scale_color_manual(name = "Species",
values = sample(x = okabe_scale,
size = 4,
replace = FALSE)) +
scale_fill_manual(name = "Species",
values = sample(x = okabe_scale,
size = 4,
replace = FALSE)) +
theme_bw() +
lab_mammal_ht
plotly
coming soon!!
plotly
coming soon!!
See below:
plotly
coming soon!!
See below:
plotly
coming soon!!
See below:
gganimate
coming soon!!
gganimate
coming soon!!
See below:
gganimate
coming soon!!
See below:
gganimate
coming soon!!
See below: