34  Instance chart

This graph should be readable but is currently undergoing final polishing.


This graph requires:

✅ a categorical variable

✅ a numeric (count) variable

✅ a numeric (date) variable

34.1 Description

An instance chart (or instance graph) displays frequencies (or ‘instances’) of different categorical values over time time.

The time dimension is placed on the x and each separate categorical item is placed on the y-axis. The instances are typically represented using the vertical hashes from geom_point() (i.e., shape 124 or the ‘pipe’ "|").

Saturation and color are also used to represent different categorical levels and counts.

34.2 Set up

PACKAGES:

Install packages.

show/hide
install.packages("babynames")
library(babynames)
library(ggplot2)

DATA:

  • Filter the babynames::babynames to only those names in 1887, then group by name and sex, arrange descending by the n, and store only the top 6 names in top_bby_nms_1887.

  • Use the names from top_bby_nms_1887 to filter babynames::babynames for names after 1897 and store in popular_bby_nms.

show/hide
library(babynames)
top_bby_nms_1887 <- babynames::babynames |> 
  dplyr::filter(year == 1887) |> 
  dplyr::group_by(name, sex) |> 
  dplyr::slice_max(order_by = sex) |> 
  dplyr::arrange(desc(n)) |> 
  utils::head(n = 6) |> 
  dplyr::ungroup()
top_bby_nms_1887
show/hide
popular_bby_nms <- babynames::babynames |> 
  dplyr::filter(name %in% top_bby_nms_1887[['name']] & 
      year >= 1897)

34.3 Grammar

CODE:

  • Create labels with labs()

  • Initialize the graph with ggplot() and provide data

  • Map year to x, name to y, and color to name

  • Add geom_point(), map n to alpha, and set the shape to 124 and size to 5

show/hide
labs_inst_pop_nms <- labs(
  title = "Popular names: then and now",
  subtitle = "Most popular names from 1887 appearing between 1897 - 2017",
  x = "Year",
  y = "Name")
ggp2_inst_pop_nms <- ggplot(data = popular_bby_nms, 
    mapping = aes(x = year, 
                  y = name, 
                  color = name)) + 
  geom_point(aes(alpha = n),
              shape = 124,
              size = 5)
ggp2_inst_pop_nms + 
  labs_inst_pop_nms

GRAPH:

34.4 More info

34.4.1 Saturation with cut_interval()

ggplot2 has two great functions for splitting numerical variables into intervals or widths. We’re going to create two datasets from babynames::babynames that capture the top names in 1900 and and the top names 2000.

show/hide
top_nms_prop_1900 <- babynames::babynames |>
  dplyr::filter(year == 1900) |>
  dplyr::group_by(sex, name) |> 
  dplyr::summarise(max_prop = max(prop)) |> 
  dplyr::slice_max(n = 2, order_by = max_prop) |>
  dplyr::ungroup()
top_nms_prop_1900
top_nms_prop_2000 <- babynames::babynames |>
  dplyr::filter(year == 2000) |>
  dplyr::group_by(sex, name) |> 
  dplyr::summarise(max_prop = max(prop)) |> 
  dplyr::slice_max(n = 2, order_by = max_prop) |>
  dplyr::ungroup()
top_nms_prop_2000

These two tibbles tell us something about the top names over a century. The top names in 1900 have much higher proportions than the top names in 2000s.

  • We’ll get the names from both tibbles and filter babynames::babynames to only these eight names between the years 1900 and 2000.

  • We’ll create a prop_range variable, which splits the prop variable into intervals based on the length argument.

show/hide
nms_1900 <- top_nms_prop_1900[["name"]]
nms_2000 <- top_nms_prop_2000[["name"]]

top_nms_1900_2000 <- c(nms_1900, nms_2000)

popular_bby_nms_prop <- babynames::babynames |>
  dplyr::filter(name %in% top_nms_1900_2000 &
                  year >= 1900 &
                    year <= 2000) |>
  dplyr::mutate(
    # proportion range 
    prop_range = ggplot2::cut_interval(x = prop, 
                                       length = 0.025))

Below we can see the proportion ranges have been built with the interval notation: "(a,b]"

  • We can also see the proportion of names changes considerably between the two groups of names.
show/hide
labs_inst_pop_nms_prop <- labs(
  title = "100 years of popular names",
  subtitle = "Most popular names from 1900 and 2000",
  caption = "*based on proportion",
  x = "Year",
  y = "Name",
  color = "Proportion ranges",
  alpha = "Proportion")
ggp2_inst_pop_nms_prop <- ggplot(data = popular_bby_nms_prop, 
    mapping = aes(x = year, 
                  y = name, 
              color = prop_range)) + 
  geom_point(aes(alpha = prop),
                shape = 124,
                size = 2.5) 
ggp2_inst_pop_nms_prop + 
  labs_inst_pop_nms_prop 

34.4.2 Facets with cut_number()

To demonstrate facets, we’ll create two other variables: year_range and group:

  • year_range uses ggplot2::cut_number() to create three groups based on the year, orders the results, and removes the labels. We manually assign the labels to this variable with case_when() and factor()

  • group is a factor variable with two levels: 1900 names and 2000 names

show/hide
popular_bby_nms_fct <- popular_bby_nms_prop |>
  dplyr::mutate(
    # year range 
    year_range = ggplot2::cut_number(year, 
                                       n = 3, 
                                  labels = FALSE,
                          ordered_result = TRUE),
  year_range = case_when(
    year_range == 1 ~ "1900 - 1936",
    year_range == 2 ~ "1937 - 1970",
    year_range == 3 ~ "1971 - 2000"),
  year_range = factor(year_range,
    levels = c("1900 - 1936",
               "1937 - 1970",
               "1971 - 2000"),
    ordered = TRUE),
    group = case_when(
      name %in% nms_1900 ~ "1900 names",
      name %in% nms_2000 ~ "2000 names",
      TRUE ~ NA_character_),
    group = factor(group, 
      levels = c("1900 names", 
                 "2000 names")))

Below we can see the total counts of names in the cross-table of year_range and group

show/hide
popular_bby_nms_fct |> 
  dplyr::count(year_range, group) |> 
  tidyr::pivot_wider(names_from = group, 
    values_from = n)

In the graph, we’ll create labels using the levels from the year_range variable.

  • We can also change the shape used in geom_point() to the pipe operator ("|")
show/hide
# create labels from factor levels 
st_lbls <- paste0(
              levels(popular_bby_nms_fct$year_range),
            collapse = ", ")

labs_inst_pop_nms_facet <- labs(
  title = "Baby name popularity over time",
  subtitle = paste0("Time span: ", st_lbls),
  x = "Year",
  y = "Name")

ggp2_inst_pop_nms_facet <- ggplot(popular_bby_nms_fct,
  mapping = aes(x = year,
                y = name)) +
  geom_point(aes(alpha = prop, color = name),
                 shape = "|",
                 size = 3,
                 show.legend = FALSE) +
  facet_wrap(year_range ~ group,
             scales = "free_y", 
             ncol = 2)
ggp2_inst_pop_nms_facet +
  labs_inst_pop_nms_facet