class: center, middle, inverse, title-slide .title[ #
ggplot2
Graph Gallery ] .subtitle[ ## Categories and distributions:
distributions
] .author[ ### Martin Frigaard ] .date[ ### 2022-05-22 ] --- ### Load data packages <br> ```r library(palmerpenguins) library(fivethirtyeight) library(ggplot2movies) ``` --- class: left, top background-image: url(https://allisonhorst.github.io/palmerpenguins/reference/figures/logo.png) background-position: 95% 8% background-size: 6% ## `palmerpenguins` [package website](https://allisonhorst.github.io/palmerpenguins/) ```r penguins <- palmerpenguins::penguins penguins ``` .small[
] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% ## `fivethirtyeight` [package website](https://fivethirtyeight-r.netlify.app/) *All datasets are listed below with descriptions* .small[
] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% ## `ggplot2movies` [package website](https://github.com/hadley/ggplot2movies) *We're using `movies_data` (derived version of the `ggplot2movies::movies`)* ```r movies_data ``` .small[
] --- class: inverse, center, top background-image: url(images/ggplot2.png) background-position: 50% 50% background-size: 20% # Comparing Categories and Distributions <br><br><br><br><br><br><br><br><br><br><br> # Distributions --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: Histograms <br> .large[*Histograms use bars, but the `x` axis is divided into 'bins' that cover the range of the variable. The standard number of bins is `30` (but you should experiment to see how many bins fit your variable's distribution). In `ggplot2`, the geom for creating histograms is `geom_histogram()`*] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: Histograms *Map `flipper_length_mm` to the `x` axis, add the `geom_histogram()` layer and the labels* .panelset[ .panel[.panel-name[R Code] ```r labs_histogram <- labs( x = "Flipper length (millimeters)", title = "Adult foraging penguins") ggplot(data = penguins, aes(x = flipper_length_mm)) + geom_histogram() + labs_histogram ``` ] .panel[.panel-name[Plot] <img src="ggp2-distributions_files/figure-html/unnamed-chunk-1-1.png" width="972.288" style="display: block; margin: auto;" /> ] ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: Frequency Polygon <br><br> .large[ *Frequency polygons (`geom_freqpoly()`) are similar to histograms, but use lines instead of bars to represent the variable distribution.* ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: Frequency Polygon *Map `flipper_length_mm` to the `x` axis, add the `geom_freqpoly()` layer and the labels* .panelset[ .panel[.panel-name[R Code] ```r labs_freqpoly <- labs( x = "Flipper length (millimeters)", title = "Adult foraging penguins") ggplot(data = penguins, aes(x = flipper_length_mm)) + geom_freqpoly() + labs_freqpoly ``` ] .panel[.panel-name[Plot] <img src="ggp2-distributions_files/figure-html/unnamed-chunk-2-1.png" width="972.288" style="display: block; margin: auto;" /> ] ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: More Frequency Polygons <br><br> .large[ *Frequency polygons are helpful when we want to look at a continuous variable across the levels of a categorical variable* ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: More Frequency Polygons *Add the `color` and `group` aesthetics to see multiple polygons.* .panelset[ .panel[.panel-name[R Code] ```r labs_freqpoly_2 <- labs( x = "Flipper length (millimeters)", color = "Penguins species", title = "Adult foraging penguins") ggplot(data = penguins, aes(x = flipper_length_mm)) + geom_freqpoly( * aes(color = species, * group = species)) + labs_freqpoly_2 ``` ] .panel[.panel-name[Plot] <img src="ggp2-distributions_files/figure-html/unnamed-chunk-3-1.png" width="972.288" style="display: block; margin: auto;" /> ] ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: Dot-Plots <br> .large[*Dot-plots (`geom_dotplot()`) are similar to histograms and frequency polygons, except instead of using bars or lines, they use dots to represent the values of a given variable.*] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: Dot-Plots *Map `flipper_length_mm` to the `x` axis, adjust the `dotsize`, add the `geom_dotplot()` layer and the labels* .panelset[ .panel[.panel-name[R Code] ```r labs_dotplot <- labs( x = "Flipper length (millimeters)", title = "Adult foraging penguins") ggplot(data = penguins, aes(x = flipper_length_mm)) + geom_dotplot(dotsize = 0.5) + labs_dotplot ``` ] .panel[.panel-name[Plot] <img src="ggp2-distributions_files/figure-html/unnamed-chunk-4-1.png" width="972.288" style="display: block; margin: auto;" /> ] ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: More Dot-Plots .panelset[ .panel[.panel-name[R Code] ```r penguins_histodot <- filter(penguins, !is.na(sex)) ``` ] .panel[.panel-name[Data]
] ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: More Dot-Plots <br> .large[ *We can also use dot-plots to look at the range of a continuous (numerical) variable across the levels of a categorical (character) variable (like `sex` below).* *The default setting for the size of the dots is '1/30 of the range of the data.' We can adjust the size with `binwidth` (and `method = "histodot"`)* ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: More Dot-Plots .panelset[ .panel[.panel-name[R Code] ```r labs_histodot <- labs( x = "Flipper length (millimeters)", fill = "Sex", title = "Adult foraging penguins") ggplot(data = penguins_histodot, aes(x = flipper_length_mm, * fill = factor(sex))) + geom_dotplot( stackgroups = TRUE, * binwidth = 1, * method = "histodot") + labs_histodot ``` ] .panel[.panel-name[Plot] <img src="ggp2-distributions_files/figure-html/unnamed-chunk-6-1.png" width="972.288" style="display: block; margin: auto;" /> ] ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: Bee-swarm Plots <br> .large[*We can also use dots to show the spread of values for a particular variable with [`bee-swarm`](https://github.com/eclarke/ggbeeswarm) plots. These display the distribution of numeric values across the levels of a categorical variable.*] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: Bee-swarm Plots *Map `island` to the `x` axis and `color`, `body_mass_g` to the `y` axis, the `geom_beeswarm()` layer (with `alpha`), and the labels* .panelset[ .panel[.panel-name[R Code] ```r labs_beeswarm <- labs( x = "Island in Palmer Archipelago, Antarctica", y = "Body mass (grams)", color = "Penguin sex (female, male)", title = "Adult foraging penguins") ggplot(data = penguins, aes(x = island, y = body_mass_g, color = island)) + ggbeeswarm::geom_beeswarm( alpha = 1/2) + labs_beeswarm ``` ] .panel[.panel-name[Plot] <img src="ggp2-distributions_files/figure-html/unnamed-chunk-7-1.png" width="972.288" style="display: block; margin: auto;" /> ] ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: Density Plots <br> .large[*Density plots are similar to frequency polygons and histograms, except the line has been 'smoothed.' Instead of dividing the `x` axis into discrete quantitative ‘bins’ to create groups for the variable values, density plots transform the distribution according to a 'bandwidth' parameter.*] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: Density Plots *Map `flipper_length_mm` to the `x` axis, add the `geom_density()` layer and the labels* .panelset[ .panel[.panel-name[R Code] ```r labs_density <- labs( x = "Flipper length (millimeters)", title = "Adult foraging penguins") ggplot(data = penguins, aes(x = flipper_length_mm)) + geom_density() + labs_density ``` ] .panel[.panel-name[Plot] <img src="ggp2-distributions_files/figure-html/unnamed-chunk-8-1.png" width="972.288" style="display: block; margin: auto;" /> ] ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: More Density Plots <br><br> .large[*Similar to frequency polygons, `geom_density()` is useful when we want to look at the distribution of a continuous variable across the levels of a categorical variable*] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: More Density Plots *We can set the `fill` argument to a categorical variable, and use the `alpha` to handle the overlapping areas.* .panelset[ .panel[.panel-name[R Code] ```r labs_density_alpha <- labs( x = "Flipper length (millimeters)", fill = "Penguin sex (female, male)", title = "Adult foraging penguins") # remove missing sex penguins_density <- filter(penguins, !is.na(sex)) ggplot(data = penguins_density, aes(x = flipper_length_mm, * fill = sex)) + * geom_density(alpha = 1/3) + labs_density_alpha ``` ] .panel[.panel-name[Plot] <img src="ggp2-distributions_files/figure-html/unnamed-chunk-9-1.png" width="972.288" style="display: block; margin: auto;" /> ] ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: Ridgeline Plots <br> .large[ *If we want to plot density curves but retain the interpretability of the axes, consider comparing multiple distributions using [`ridgeline plots`](https://wilkelab.org/ggridges/) (`geom_density_ridges()`)* ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: Ridgeline Plots *Map `bill_length_mm` to the `x` axis, `island` to the `y` axis and `fill`, the `geom_density_ridges()` layer (with `alpha`) and the labels* .panelset[ .panel[.panel-name[R Code] .code80[ ```r labs_density_ridges <- labs( x = "Bill length (millimeters)", y = "Island in Palmer Archipelago, Antarctica", title = "Adult foraging penguins") # remove missing island penguins_density_ridges <- filter(penguins, !is.na(island)) ggplot(data = penguins_density_ridges, aes(x = bill_length_mm, y = island, fill = island)) + # adjust alpha ggridges::geom_density_ridges(alpha = 2/3) + labs_density_ridges ``` ] ] .panel[.panel-name[Plot] <img src="ggp2-distributions_files/figure-html/unnamed-chunk-10-1.png" width="972.288" style="display: block; margin: auto;" /> ] ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: Box-plots <br><br> .large[ *Box-plots (sometimes called box-and-whisker plots) are great because they display a collection of statistics in a single graph. We're going to build a box-plot of a single numeric variable and review it's contents.* ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: Box-plots *Map a blank character string (`" "`) to the `x` axis, `length` to the `y` axis, the `geom_boxplot()` layer, and the labels* .panelset[ .panel[.panel-name[R Code] ```r labs_boxplot <- labs( y = "length", title = "IMDB Movie information and user ratings") ggplot(data = movies_data, # place an empty string in the # x axis aes(x = " ", # place the length on the y y = length)) + geom_boxplot() + labs_boxplot ``` ] .panel[.panel-name[Plot] <img src="ggp2-distributions_files/figure-html/unnamed-chunk-11-1.png" width="972.288" style="display: block; margin: auto;" /> ] ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: More Box-plots .large[*The table below shows the 25th percentile, the median, the 75th percentile, the IQR, and a histogram of the `length` column from the `movies_data` dataset.*] <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["25th"],"name":[1],"type":["dbl"],"align":["right"]},{"label":["Median"],"name":[2],"type":["dbl"],"align":["right"]},{"label":["75th"],"name":[3],"type":["dbl"],"align":["right"]},{"label":["IQR"],"name":[4],"type":["dbl"],"align":["right"]},{"label":["Histogram"],"name":[5],"type":["chr"],"align":["left"]}],"data":[{"1":"92","2":"100","3":"113","4":"21","5":"▁▇▅▁▁"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> .large[*These three horizontal lines give us a picture of the 'spread' of the data. If there is equal distance on either side of the middle (`Median`) line, this tells us the distribution is symmetrical.*] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: More Box-plots .large[ *The numbers from the table can help you interpret the structure of the box-plot.* <img src="images/boxplot-diagram.png" width="75%" height="75%" style="display: block; margin: auto;" /> ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: More Box-plots <br> .leftcol[ <img src="images/boxplot-diagram.png" width="100%" height="100%" style="display: block; margin: auto;" /> ] -- .rightcol[ + *As we can see, the box-plot combines multiple summary statistics.* + *The 25th percentile (first quartile), the median (50th percentile or second quartile), and the 75th percentile (third quartile) values are common to all box-plots.* + *In `ggplot2`, values that fall more than 1.5 times the IQR are displayed as individual points (aka .green[*outliers*]). The lines extending from the bottom and top of the main box represent the last non-outlier value in the distribution.* ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: More Box-plots *Because box-plot provides so many helpful statistical measures, they are also helpful for viewing how a continuous variable varies across a categorical variable* .panelset[ .panel[.panel-name[R Code] ```r labs_boxplots <- labs( x = "mpaa", y = "length", title = "IMDB Movie information and user ratings") ggplot(data = movies_data, # place an empty string in the # x axis aes(x = mpaa, # place the length on the y y = length)) + geom_boxplot() + labs_boxplots ``` ] .panel[.panel-name[Plot] <img src="ggp2-distributions_files/figure-html/unnamed-chunk-12-1.png" width="972.288" style="display: block; margin: auto;" /> ] ] --- class: left, top background-image: url(images/pdg-hex.png) background-position: 95% 8% background-size: 7% # Distributions: More Box-plots *Compare the four graphs of `length` from `movie_data` below to the box-plot:* <img src="images/boxplot-comparisons.png" width="85%" height="85%" style="display: block; margin: auto;" /> --- class: inverse, center, bottom background-image: url(images/pdg-hex.png) background-position: 50% 50% background-size: 20% # Thanks!