Data Visualization with ggplot2

ODSC West: https://bit.ly/odscw-ggp2

Outline

Part 1

  • Why do we need graphs?

  • An exploratory mindset

  • Surprise or confirm, then communicate

  • The grammar of graphics

  • Part 2
  • RStudio Cloud
  • Exercises & solutions
  • Creating a graph (layer-by-layer)
  • Applying the grammar

Part 1

  • Intros 👋
  • Workshop materials ⬇️
  • Basic understand of ggplot2 syntax ✔️
  • Build your first graph! ✔️

Why do we need graphs?

Raw data don’t communicate well


It’s hard to make sense of millions of rows and/or thousands of columns


Fortunately, we are excellent at seeing patterns:


the human brain has a superior ability to mentally manipulate animate and inanimate patterns into a myriad of intangible symbols that can then be recombined to produce new images of the world;


we therefore live partly in worlds of our own mental creation, super-imposed upon or distinct from the natural world.

Graphs allow us to explore complexity with symbols and images

Exploratory Data Analysis


“Exploratory Data Analysis (EDA)” first coined by American mathematician John Tukey in 1977

The greatest value of a picture is when it forces us to notice what we never expected to see.

- John Tukey, 1977

Exploration requires ‘listening’



“The role of the data analyst is to listen to the data in as many ways as possible until a plausible ‘story’ of the data is apparent”

Exploration is a ‘state of mind’


“More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends…”


“As your exploration continues, you will hone in on a few particularly productive areas that you’ll eventually write up and communicate to others.”


- Hadley Wickham, R for Data Science

An Exploratory Mindset

Exploration requires a Bayesian Mindset (1 of 3)


We all have implicit beliefs, or priors, about the world


What we think we know (i.e.,our expectations)

Exploration requires a Bayesian Mindset (2 of 3)


When we encounter new information or data, our priors get updated


Our expectations + new data (i.e., what we see)

Exploration requires a Bayesian Mindset (3 of 3)


Our updated beliefs, or posteriors, depend on our priors and our perceptions of the new information


What we expect + what we see = what we’ve learned

Graphs can confirm our expectations


What if our expectation was that X is related to Y?




…then we graphed the data…

We would say our expectations have been confirmed

Graphs can refute our expectations


What if our expectation was that X is related to Y?




…then we graphed the data…

We would say our expectations have been refuted

ggplot2: grammar & syntax

Grammar


The system of rules for any given language


Includes:

  1. Word meanings
  2. Internal structure
  3. Word arrangement

Syntax


The form, structure and order for constructing statements


[[students][[cook][and][serve grandparents]]]

[[students][[cook and serve][grandparents]]]

ggplot2 : grammar & syntax


Built on top of the grammar & syntax of R


In R, objects are like nouns, and functions (fn) are like verbs


fn(object)


functions do things to objects

ggplot2: a layered language for graphs


ggplot2 is comprised of layers

  • Data
  • Mapping
  • Statistics
  • Geometric objects
  • Position adjustments

ggplot2: data


The data layer consists of a rectangular object (like a spreadsheet) with columns and rows


ggplot(data = penguins)

ggplot2: mapping


The mapping layer assigns columns (variables) from the data to a visual property (i.e. graph ’aes’thetic)


ggplot(data = penguins,
  mapping = 
    aes(x = flipper_length_mm, 
      y = bill_length_mm))

ggplot2: geoms


geom_*() functions include statistical transformations, shapes, and position adjustments for how to ‘draw’ the data on the graph


ggplot(data = penguins,
  mapping = aes(
    x = flipper_length_mm, 
    y = bill_length_mm)) +
  geom_point()

ggplot2: layers


We can have multiple layers (data, mappings, geoms) in a single graph


ggplot(data = penguins,
 # layer 1
  mapping = aes(
    x = flipper_length_mm, 
    y = bill_length_mm)) +
  geom_point()
# layer 2
  geom_smooth(
    mapping = aes(
      x = flipper_length_mm,
      y = bill_length_mm,
      color = species))

Layers = infinitely extensible


ggplot2 is a system for,


making infinite use of finite means” - Wilhelm von Humboldt


With a finite number of objects & functions, we can combine ggplot2’s grammar and syntax to create an infinite number of graphs!

ggplot2: templates


Basic Template: Data, aesthetic mappings, geom


ggplot(data = <DATA>) +
  geom_*(mapping = aes(<AESTHETIC MAPPINGS>))

ggplot2: more templates


Template + 1 Layer: more geoms and more aesthetic mappings


ggplot(data = <DATA>) +
    geom_*(mapping = aes(<AESTHETIC MAPPINGS>)) +
    geom_*(mapping = aes(<AESTHETIC MAPPINGS>))

ggplot2: even more!


Template + 1 Layer + Facet Layer: template, more aesthetic mappings, and facets!


ggplot(data = <DATA>) +
    geom_*(mapping = aes(<AESTHETIC MAPPINGS>)) +
    geom_*(mapping = aes(<AESTHETIC MAPPINGS>)) +
    facet_*

Templates = infinitely extensible!


Themes

ggplot(data = <DATA>) +
    geom_*(mapping = aes(<AESTHETIC MAPPINGS>)) +
    geom_*(mapping = aes(<AESTHETIC MAPPINGS>)) +
    facet_* +
    theme_*

Don’t forget labels!

ggplot(data = <DATA>) +
    geom_*(mapping = aes(<AESTHETIC MAPPINGS>)) +
    geom_*(mapping = aes(<AESTHETIC MAPPINGS>)) +
    facet_* +
    theme_* +
    labs()

Part 2

  • RStudio Cloud ✔️
  • Exercises & solutions ✔️
  • Creating a graph (layer-by-layer) ✔️
  • Applying the grammar ✔️

RStudio.Cloud

RStudio.Cloud: Set up (1 of 4)


Head to RStudio.Cloud, you will see the following:




Log in with your GitHub credentials

RStudio.Cloud: Set up (2 of 4)


On the top of the RStudio IDE, you will see the following:







Click on Save a Permanent Copy to add this project to your workspace

RStudio.Cloud: Set up (3 of 4)


In the Files pane, click on the inst.R file to open it




RStudio.Cloud: Set up (4 of 4)


In the Source pane, click on the Source icon to run inst.R


This sends the code in inst.R to the Console

RStudio.Cloud: Exercises





The exercises are in the exercises/ folder


RStudio.Cloud: Solutions





Each exercise has a solution file in solutions/ folder


Quick Tip


Tip: writing code can be frustrating, especially in the beginning…


…it doesn’t always produce a tangible result…


…but creating visualizations is rewarding!!!

ggplot2: build the labels first!


Create a title, subtitle (with data source), and x/y axis labels


labs_pengiuns <- ggplot2::labs(
  title = "Flipper vs. Bill Length",
  subtitle = "source: palmerpenguins::penguins",
  x = "flipper length (mm)",
  y = "bill length (mm)")

<- expectations

ggplot2: build graph, check labels


Build labels, build graphs, then check labels!


labs_pengiuns <- ggplot2::labs(
  title = "Flipper vs. Bill Length",
  subtitle = "source: palmerpenguins::penguins",
  x = "flipper length (mm)",
  y = "bill length (mm)")
ggp_peng_point <- ggplot(data = penguins,
    mapping = aes(x = bill_length_mm,
                  y = flipper_length_mm)) +
  labs_pengiuns

What’s wrong here?

ggplot2: build graph, check labels, revise


x and y are flipped!


labs_pengiuns <- ggplot2::labs(
  title = "Flipper vs. Bill Length",
  subtitle = "source: palmerpenguins::penguins",
  x = "flipper length (mm)",
  y = "bill length (mm)")
ggp_peng_point <- ggplot(data = penguins,
    mapping = aes(x = flipper_length_mm, 
                  y = bill_length_mm)) +
                labs_pengiuns 

Fixed!

On the importance of revision:

Revision Sharpens Thinking:

“More particularly, rewriting is the key to improved thinking. It demands a real open-mindedness and objectivity.”

“It demands a willingness to cull verbiage so that ideas stand out clearly. And it demands a willingness to meet logical contradictions head on and trace them to the premises that have created them.”

“In short, it forces a writer to get up his courage and expose his thinking process to his own intelligence.

The data

Viewing data (1 of 3)


View() opens the RStudio data viewer


Viewing data (2 of 3)


glimpse() and str() are displayed in the console


glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Viewing data (3 of 3)


glimpse() and str() are displayed in the console


str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Build from scratch, layer-by-layer

graph 01: LABELS!


We want to build the labels first:


  • title = “Bill and flipper length of Palmer penguins”
  • subtitle = “Size measurements for adult foraging penguins”
  • x = “Bill length (mm)”
  • y = “Flipper length (mm)”
# build labels
labs_bill_vs_flippper <- ggplot2::labs(
  title = "Bill and flipper length of Palmer penguins",
  subtitle = "Size measurements for adult foraging penguins",
  x = "Bill length (mm)",
  y = "Flipper length (mm)")

graph 01: Initialize plot with data


The ggplot2::ggplot() function initializes the plot:



Place penguins in the data argument

ggplot(data = penguins)

This gives us a blank canvas!

graph 02: Map variables to positions


We have our data and labels–we just need to add our variables!



Map bill_length_mm to x

ggplot(data = penguins,
    mapping = aes(
      x = bill_length_mm, 
            ))


Map flipper_length_mm to y

ggplot(data = penguins,
  mapping = aes(
    x = bill_length_mm,
    y = flipper_length_mm))

Now our canvas has x and y axes

graph 03: Adding geoms


Add the geom_point() function with the + symbol


ggplot(data = penguins,
  mapping = aes(
    x = bill_length_mm,
    y = flipper_length_mm)) +
  geom_point()


Don’t confuse this with the pipes (|> or %>%)

The geom_point() function tells R we want to see the points (or dots) on our canvas:

graph 04: Don’t forget the labels!


Finally, we want to add the labels we created (labs_bill_vs_flippper)

Add labels with +

ggplot(data = penguins,
  mapping = aes(
    x = bill_length_mm,
    y = flipper_length_mm)) +
  geom_point() +
  labs_bill_vs_flippper

And we have our first graph!

Global vs. local mapping

Global mapping


The previous graphs mapped aesthetics globally

Global = aesthetics are mapped when the graph is initialized with ggplot():


ggplot(data = penguins,
  mapping = aes(
    x = bill_length_mm,
    y = flipper_length_mm)) 

Recall the layers from Part 1:

If we map aesthetics in ggplot(), all the following geom_*() layers will inherit these aesthetics

Local mapping


Mapping aesthetics globally and then adding the geom_*() function results in the same graph as when we map aesthetics locally (inside the geom_*() function)

ggplot(data = penguins,
  mapping = aes(
    x = bill_length_mm,
    y = flipper_length_mm)) +
  geom_point() +
    labs_bill_vs_flippper

ggplot(data = penguins) +
  geom_point(mapping = 
      aes(x = bill_length_mm,
          y = flipper_length_mm)) +
  labs_bill_vs_flippper

The ggplot2 templates (refresher)


The template from part 1 uses local mappings (i.e. aesthetic mappings are set inside the geom_* function).

# Recall our template from Part 1
ggplot(data = <DATA>) +
  geom_*(mapping = aes(<AESTHETIC MAPPINGS>))


Below we’ve adjusted the template to include global mappings (and the option to include aesthetic mappings locally)

# Adjusted template
ggplot(data = <DATA>,
  mapping = aes(<AESTHETIC MAPPINGS>)) + # global mappings
  geom_*(mapping = aes(<AESTHETIC MAPPINGS>)) # local mappings

Read more here.

graph 05: Convert global to local mappings


For graph-05.R, convert the global aesthetics to local aesthetics inside the geom_point() function


Global

ggplot(data = penguins,
  mapping = aes(x = bill_length_mm,
                y = flipper_length_mm)) +
  geom_point() +
  labs_bill_vs_flippper

Local

ggplot(data = penguins) +
geom_point(mapping = 
    aes(x = bill_length_mm,
        y = flipper_length_mm)) +
      labs_bill_vs_flippper

Visual encodings

What are visual encodings?


Visual encodings are what we see on the graph


Things like position, size, shape, color, etc.

Ranked by accuracy

graph 06: Adding color (global)


Map color to the species variable using global aesthetic mapping:

Inside the aes() function:

ggplot(data = penguins,
  mapping =
    aes(x = bill_length_mm,
        y = flipper_length_mm,
        color = species)) +
  geom_point() +
  labs_bill_vs_flippper

ggplot2 includes a legend by default

graph 07: Adding color (local)


Map color to the species variable using local aesthetic mapping

The x and y aesthetics are inherited from the ggplot() function…

ggplot(data = penguins,
  mapping =
    aes(x = bill_length_mm,
        y = flipper_length_mm)) +
  geom_point(
    aes(color = species)) +
  labs_bill_vs_flippper

…but the color aesthetic comes from the geom_point() layer

graph 08: Color vs. Fill (1 of 2)


Below we’ll look at the counts of sex vs. species of Palmer penguins

First create labels!

labs_sex_vs_species <- ggplot2::labs(
  title = "Sex by species of Palmer penguins",
  subtitle = "Counts for adult foraging penguins",
  x = "Sex",
  fill = "Species")

Create penguins_no_miss by removing missing values

penguins_no_miss <- drop_na(data = penguins)

View our data:

glimpse(penguins_no_miss, 50)
Rows: 333
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie…
$ island            <fct> Torgersen, Torgersen, …
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3…
$ flipper_length_mm <int> 181, 186, 195, 193, 19…
$ body_mass_g       <int> 3750, 3800, 3250, 3450…
$ sex               <fct> male, female, female, …
$ year              <int> 2007, 2007, 2007, 2007…

graph 08: Color vs. Fill (2 of 2)


Some geom_*()functions take the fill argument instead of color


Build a bar-graph using geom_bar() by locally mapping sex to the x axis and y to fill

ggplot(data = penguins_no_miss) +
  geom_bar(mapping = 
      aes(x = sex,
      fill = species)) +
    labs_sex_vs_species


Don’t forget the labels!

graph 09: Bar position


Stacked bar-graphs make it difficult to do side-by-side comparisons using the y axis


Using the same code as graph 08, add the position = "dodge" argument outside the aes() function

ggplot(data = penguins_no_miss) +
  geom_bar(mapping = aes(x = sex,
    fill = species),
    position = "dodge") +
  labs_sex_vs_species

graph 10: Histograms (special bar-graphs)


The geom_histogram() function uses ‘bins’ to represent counts for each value


Create new labels

labs_bodymass_vs_species <- ggplot2::labs(
  title = "Body mass by species of Palmer penguins",
  subtitle = "Counts for adult foraging penguins",
  x = "Body Mass (grams)",
  fill = "Species")

Create a histogram of body_mass_g, colored (filled) by species

ggplot(data = penguins) +
  geom_histogram(
    mapping = aes(
      x = body_mass_g,
      fill = species)) +
    labs_bodymass_vs_species

graph 11: Density plots


Density plots are also great for comparing overlapping distributions


Create a density plot with geom_density()

Set the alpha (color saturation) to 1/2

ggplot(data = penguins) +
  geom_density(
  mapping = 
      aes(x = body_mass_g,
          fill = species),
          alpha = 1/2) +
    labs_bodymass_vs_species

Also check out ridgeline plots

Mapping vs. setting aesthetics

Mapping vs. setting (1 of 2)


Variables are mapped to aesthetics inside aes()

ggplot(data = penguins_no_miss) +
  geom_point(
    mapping = 
      aes(x = bill_length_mm,
          y = flipper_length_mm,
          color = sex)) + # inside
      labs_bill_vs_flippper

Values are set outside the aes() function

ggplot(data = penguins_no_miss) +
  geom_point(
    mapping = 
      aes(x = bill_length_mm,
          y = flipper_length_mm),
          color = "dodgerblue") + # outside
      labs_bill_vs_flippper

Mapping vs. setting (2 of 2)


From ggplot2 book



If you want appearance to be governed by a variable, put the specification inside aes(); if you want override the default size or colour, put the value outside of aes().

graph 12: Setting graph aesthetics


Change the code below to make the points "firebrick" red


Create labels

labs_body_mass_vs_bill_depth <- ggplot2::labs(
  title = "Body mass and bill depth of Palmer penguins",
  subtitle = "Size measurements for adult foraging penguins",
  x = "Body mass (mm)",
  y = "Bill depth (mm)")

What color will the points be on this graph?

ggplot(data = penguins) +
  geom_point(
    mapping = aes(
      x = body_mass_g,
      y = bill_depth_mm,
      color = "firebrick")) +
    labs_body_mass_vs_bill_depth

TIP: the legend tells us geom_point() is looking for a mapped variable in the penguins dataset named "firebrick"

Combining layers

graph 13: New layer, new data, no problem


Each geom_*() function also has a data argument, so we can supply new data at each layer


Create a dataset of the max bill length and depth, body mass and flipper length (big_penguins):

big_penguins <- bind_rows(
  slice_max(penguins, bill_length_mm, n = 1),
  slice_max(penguins, bill_depth_mm, n = 1),
  slice_max(penguins, flipper_length_mm, n = 1),
  slice_max(penguins, body_mass_g, n = 1)
)

Create data label and source

big_penguins <- mutate(big_penguins,
 label = case_when(
  bill_length_mm == 59.6 ~ paste0("long bill = ", bill_length_mm),
  bill_depth_mm == 21.5 ~ paste0("deep bill = ", bill_depth_mm),
  flipper_length_mm == 231 ~ paste0("big flipper = ", flipper_length_mm),
  body_mass_g == 6300 ~ paste0("big bird = ", body_mass_g)),
 source = case_when(
  bill_length_mm == 59.6 ~ "max bill length",
  bill_depth_mm == 21.5 ~ "max bill depth",
  flipper_length_mm == 231 ~ "max flipper length",
  body_mass_g == 6300 ~ "max body mass"))

Our label dataset


Objective: Create a scatter-plot to show the relationship between body mass, flipper length, and bill length.

label source
long bill = 59.6 max bill length
deep bill = 21.5 max bill depth
big flipper = 231 max flipper length
big bird = 6300 max body mass

graph 13: Layer 1


Create layer 1 with penguins_no_miss data and geom_point()

Create labels

labs_bodymass_bill_depth_flipper_length <- labs(
  title = "Body mass, flipper length & bill depth",
  subtitle = "Size measures Palmer penguins",
  x = "Bill depth (mm)",
  y = "Flipper length (mm)",
  size = "Body mass (g)")

Assign x, y, size, and alpha

ggp_13 <- ggplot(data = penguins_no_miss) +
  # layer 1
  geom_point(
    mapping = 
      aes(x = bill_depth_mm,
          y = flipper_length_mm,
          size = body_mass_g),
      alpha = 1/2)
ggp_13 +
    # labels
    labs_bodymass_bill_depth_flipper_length

graph 14: Layer 2


Create layer 2 with another geom_point() using color and size

Use scale_size() to adjust point scaling

ggp14 <- ggp_13 +
  # layer 2
  geom_point(
    data = big_penguins,
    mapping = aes(
      x = bill_depth_mm,
      y = flipper_length_mm,
      # color by source
      color = source,
      size = body_mass_g)) +
  # re-scale
  scale_size(range = c(1, 5)) 
ggp14 +
  # labels
  labs_bodymass_bill_depth_flipper_length

graph 15: Label 3 (max values)


Add layer 3 with the geom_label_repel() function from ggrepel


Add layer for labels in big_penguins

library(ggrepel)
ggp15 <- ggp14 +
  # layer 3
  ggrepel::geom_label_repel(
    data = big_penguins,
    mapping = aes(x = bill_depth_mm,
      y = flipper_length_mm,
      label = label)) 
ggp15 +
  # labels
  labs_bodymass_bill_depth_flipper_length

Facets

Small multiples


Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different.

Facets = small multiples


In the previous graph, we used multiple aesthetics (color, size, shape)

Can we explore these relationships by sex or species?

Store graph 15 in ggp_penguin_measures

ggp15_l1 <- ggplot(data = penguins_no_miss) +
  geom_point(
    mapping = aes(x = bill_depth_mm,
      y = flipper_length_mm,
      size = body_mass_g),
    alpha = 1 / 3) 


ggp_penguin_measures <- ggp15_l1 +
  geom_point(data = big_penguins,
    mapping = aes(
      x = bill_depth_mm,
      y = flipper_length_mm,
      color = source,
      size = body_mass_g), 
    show.legend = FALSE) +
  scale_size(range = c(1, 5))

graph 16: Facet by sex


Use facet_wrap() to view our previous graph by sex


facet_wrap() uses . ~ [var]

ggp_penguin_measures +
  ggrepel::geom_label_repel(
    data = big_penguins,
    mapping = aes(
      x = bill_depth_mm,
      y = flipper_length_mm,
      label = label),
    size = 2) + # adjust size
  facet_wrap(. ~ sex) + # facet by sex
  # labels
  labs_bodymass_bill_depth_flipper_length

graph 17: Facet by species


Change facet_wrap() to build graphs by species and add theme


Change facet_wrap() to ~ species
Add theme_minimal() and labels

ggp_penguin_measures +
  ggrepel::geom_label_repel(
    data = big_penguins,
    mapping = aes(x = bill_depth_mm,
      y = flipper_length_mm,
      label = label),
    size = 2) +
  # change to species
  facet_wrap(. ~ species) +
  # add theme
  theme_minimal() +
  # labels
  labs_bodymass_bill_depth_flipper_length

Recap

What we’ve covered



  1. Build labels (set your expectations)
  2. View data before building any graphs
  3. Building graphs layer-by-layer (data, mapping, geoms)
  1. Mapping variables to graph elements (color, position, size, etc.)
  2. Extending graphs by combining layers
  3. Using facets to explore relationships

Thanks!

Twitter @mjfrigaard

GitHub @mjfrigaard

Email @mjfrigaard