Summary bar graphs
Description
Summary bar graphs display the sum (or total) of a numerical variable across the levels of a second categorical variable. Color is used to make comparisons and distinguish between groups (or levels) of the categorical variable.
In ggplot2
, we can create summary bar graphs with geom_bar()
.
Getting set up
PACKAGES:
Install packages.
Code
install.packages("palmerpenguins")
library(palmerpenguins)
library(ggplot2)
DATA:
Remove the missing values from body_mass_g
and island
in the palmerpenguins::penguins
data and convert body mass in grams to kilograms (body_mass_kg
).
We’ll also reduce the number of columns in the penguins
data for clarity.
Code
<- palmerpenguins::penguins |>
peng_sum_col ::select(body_mass_g, island) |>
dplyr::drop_na() |>
tidyr# divide the mass value by 1000
::mutate(
dplyrbody_mass_kg = body_mass_g / 1000
)::glimpse(peng_sum_col) dplyr
Rows: 342
Columns: 3
$ body_mass_g <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3475, 4250, 330…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, To…
$ body_mass_kg <dbl> 3.750, 3.800, 3.250, 3.450, 3.650, 3.625, 4.675, 3.475, 4…
The grammar
CODE:
Create labels with labs()
Initialize the graph with ggplot()
and provide data
Map island
to x
and body_mass_kg
to y
Inside the aes()
of geom_col()
, map island
to fill
Outside the aes()
of geom_col()
, remove the legend with show.legend = FALSE
Code
<- labs(
labs_sum_col title = "Total Penguin Mass",
subtitle = "How many kilograms of penguin per Island?",
x = "Island",
y = "Total Penguin Body Mass (kg)")
<- ggplot(data = peng_sum_col,
ggp2_sum_col aes(x = island,
y = body_mass_kg)) +
geom_col(aes(fill = island),
show.legend = FALSE)
+
ggp2_sum_col labs_sum_col
GRAPH:
More Info
Note that we didn’t have to write any code to calculate the total body_mass_g
(displayed on the y
axis) by island
.
That’s because ggplot2
does this for us!
SUMMARY:
If we pass a categorical variable to the x
(like island
) and a continuous variable to y
(like body_mass_kg
), geom_col()
will calculate the sum()
of y
by levels of x
We can see the underlying summary of budget using dplyr
’s group_by()
and summarise()
functions.
Code
::penguins |>
palmerpenguins::select(body_mass_g, island) |>
dplyr::drop_na() |>
tidyr# divide the mass value by 1000
::mutate(
dplyrbody_mass_kg = body_mass_g / 1000
|>
) ::group_by(island) |>
dplyr::summarise(
dplyr`Total Penguin Body Mass (kg)` = sum(body_mass_kg)) |>
::ungroup() |>
dplyr::select(`Island` = island,
dplyr`Total Penguin Body Mass (kg)`)
Island | Total Penguin Body Mass (kg) |
---|---|
Biscoe | 787.575 |
Dream | 460.400 |
Torgersen | 189.025 |
STATS:
The geom_bar()
geom will also create grouped bar graphs, but we have to switch the stat
argument to "identity"
.
Code
ggplot(data = peng_sum_col,
aes(x = island,
y = body_mass_kg)) +
geom_col(aes(fill = island),
show.legend = FALSE,
stat = "identity") +
labs_sum_col
geom_bar()
vs. geom_col()
:
geom_bar()
will map a categorical variable to the x
or y
and display counts for the discrete levels (see stat_count()
for more info)
geom_col()
will map both x
and y
aesthetics, and is used when we want to display numerical (quantitative) values across the levels of a categorical variable. geom_col()
assumes these values have been created in their own column (see stat_identity()
for more info)