This section provides advanced exercises to improve your understanding of R Markdown.
View the slides for this section here.
YAML parameters give us the ability to add variables we can later refer to in our document. We will add some parameters to our report to see how these can be used. Add the following code to the YAML header (at the bottom):
params:
data_dir: !r file.path("data/starwars.rds")
list_vars: !r c("films", "vehicles", "starships")
These parameters will give us global control over the data we will be importing (even if that file changes in the future).
It’s hard to do any analyses without data! We will load a toy dataset from the Star Wars API. Add the code below to your .Rmd file to import the StarWars
data. We will also name the code chunk StarWars
, because it’s the object this code creates.
```{r StarWars}
StarWars <- readr::read_rds(file = params$data_dir)
```
Note that we’ve loaded these data using the parameters we’ve defined above.
Details about the variables in the StarWars
dataset are accessible in RStudio’s help files, which we can access using ??starwars
```{r StarWars-help}
??starwars
```
When we read the help file, we find there are three variables that are lists: films
, vehicles
, and starships
. We have list-columns because the Star Wars API exports data as a JSON file, which is not tabular (like a spreadsheet).
glimpse()
We can see a basic transposed display of the StarWars
data with dplyr
’s glimpse()
function.
dplyr::glimpse(StarWars)
## Rows: 87
## Columns: 14
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
## $ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
## $ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
glimpse()
shows us the format and first few values of each variable in StarWars
.
skim()
Below is a skimr::skim()
view of the StarWars
data. We can see each variable broken down by type, along with some summary information.
skimr::skim(StarWars)
Name | StarWars |
Number of rows | 87 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 8 |
list | 3 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.00 | 3 | 21 | 0 | 87 | 0 |
hair_color | 5 | 0.94 | 4 | 13 | 0 | 12 | 0 |
skin_color | 0 | 1.00 | 3 | 19 | 0 | 31 | 0 |
eye_color | 0 | 1.00 | 3 | 13 | 0 | 15 | 0 |
sex | 4 | 0.95 | 4 | 14 | 0 | 4 | 0 |
gender | 4 | 0.95 | 8 | 9 | 0 | 2 | 0 |
homeworld | 10 | 0.89 | 4 | 14 | 0 | 48 | 0 |
species | 4 | 0.95 | 3 | 14 | 0 | 37 | 0 |
Variable type: list
skim_variable | n_missing | complete_rate | n_unique | min_length | max_length |
---|---|---|---|---|---|
films | 0 | 1 | 24 | 1 | 7 |
vehicles | 0 | 1 | 11 | 0 | 2 |
starships | 0 | 1 | 17 | 0 | 5 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
height | 6 | 0.93 | 174.36 | 34.77 | 66 | 167.0 | 180 | 191.0 | 264 | ▁▁▇▅▁ |
mass | 28 | 0.68 | 97.31 | 169.46 | 15 | 55.6 | 79 | 84.5 | 1358 | ▇▁▁▁▁ |
birth_year | 44 | 0.49 | 87.57 | 154.69 | 8 | 35.0 | 52 | 72.0 | 896 | ▇▁▁▁▁ |
The skimr
package is great for looking at large data summaries. Read more here.
jsonedit()
If you have JSON or lists (non-rectangular data) in R, sometimes these objects can be hard to visualize. The jsonedit()
function from listviewer
makes this easier by giving us an interactive display to click-through.
library(listviewer)
listviewer::jsonedit(listdata = StarWars, mode = "view")
When we are analyzing large datasets that take awhile to load, it might make sense to cache the data when it’s loaded into a code chunk.
We can do this by including cache=TRUE
in the previous StarWars
code chunk.
```{r StarWars, cache=TRUE}
StarWars <- readr::read_rds(file = params$data_dir)
```
Re-knit this chunk with the new cache
option.
We can determine the size of our dataset using object.size()
from the utils
package (which is loaded by default).
object.size(StarWars)
## 57520 bytes
Another option is using the inspect_mem()
function from the inspectdf
package.
library(inspectdf)
inspectdf::inspect_mem(df1 = StarWars) %>%
inspectdf::show_plot(text_labels = TRUE,
col_palette = 1)
We can see from the data visualization that the list variables are accounting for most of the memory.
Including the cache=TRUE
option stores the StarWars
data, so that R holds the data in memory until the StarWars
import chunk is changed. Sometimes we will only want to analyze a subset of a dataset, so it makes sense to cache the larger dataset import chunk.
```{r StarWars, cache=TRUE}
StarWars <- readr::read_rds(file = params$data_dir)
```
With the StarWars
data cached, we can remove the list variables from StarWars
and create a StarWarsSmall
dataset. We saved the names of the list-columns in params$list_vars
.
```{r StarWarsSmall}
StarWarsSmall <- StarWars %>% dplyr::select(-c(params$list_vars))
```
Lets check the size of the new StarWarsSmall
data by comparing it to the original StarWars
dataset. This code chunk should look like this:
```{r inspect_mem-StarWars-StarWarsSmall}
inspectdf::inspect_mem(df1 = StarWars, df2 = StarWarsSmall) %>%
inspectdf::show_plot(text_labels = TRUE, col_palette = 1)
```
When we cache data, a new folder named your-file-name
+ _cache/html/
is created in the same directory as our R Markdown file.
We can see the mk01_rmarkdown-reports_cache/html/
folder contents below:
## wk5-01_rmarkdown-in-practice_cache/html/
## ├── StarWars_e537726d085871e162ff24245c53a9c1.RData
## ├── StarWars_e537726d085871e162ff24245c53a9c1.rdb
## ├── StarWars_e537726d085871e162ff24245c53a9c1.rdx
## └── __packages
We can change the location of the data cache
by specifying cache.path
either in the code chunk, or in the setup
chunk.
```{r StarWars, cache=TRUE, cache.path='data/'}
StarWars <- readr::read_rds(file = params$data_dir)
```
Note: you will need to make sure the cache.path
folder exists, which can be solved by adding dir.create()
in a code chunk above the StarWars
chunk. I like using fs::dir_create()
, because it checks to see if a folder exists, then creates one if it doesn’t.
If we want to add cache options to the setup
chunk, it would look like this,
```{r setup, include=FALSE}
# create data folder
fs::dir_create(path = "data/")
# set chunk options
knitr::opts_chunk$set(cache = TRUE,
cache.path = "data/")
```
Data analysis and exploration typically moves along in a (somewhat) linear fashion, which means our code chunks should be run sequentially. Sometimes this isn’t true, and we need some code chunks to depend on other, specific code chunks. In this case, we can use the dependson
option in our code chunk.
In the Caching Data
tab, we compared StarWars
and StarWarsSmall
datasets using inspect_mem()
in a code chunk named inspect_mem-StarWars-StarWarsSmall
. Running this code is only possible after running the code in the StarWars
chunk.
We can make the inspect_mem-StarWars-StarWarsSmall
dependent on StarWars
by adding dependson
and the code chunk name.
```{r inspect_mem-StarWars-StarWarsSmall, dependson = "StarWars"}
inspectdf::inspect_mem(df1 = StarWars, df2 = StarWarsSmall) %>%
inspectdf::show_plot(text_labels = TRUE, col_palette = 1)
```
Now the inspect_mem-StarWars-StarWarsSmall
will only execute after the StarWarsSmall
chunk has been run.
Graphs and figures are great tools for communicating results, and we want to keep track of all the visualizations we create in our report. R Markdown comes with multiple options for controlling the size, location, and quality of images in our reports.
We can adjust the size of our figures with fig.height=
or fig.width=
. These both take numeric values, and control the dimensions of the figure in inches. We can also control the size with out.width=
and out.height=
.
Below we visualize the average BMI by species
and gender
in the Star Wars universe. We also load the hrbrthemes
package to give us more control over the aesthetics in our plot.
```{r gg_avg_bmi_spec_gend, fig.height=5.5, fig.width=8, out.width='100%', out.height='100%'}
library(hrbrthemes)
StarWars %>%
dplyr::filter(!is.na(mass) & !is.na(height) & !is.na(species)) %>%
dplyr::mutate(bmi = mass / ((height / 100) ^ 2)) %>%
dplyr::group_by(species, gender) %>%
dplyr::summarize(mean_bmi = mean(bmi, na.rm = TRUE)) %>%
dplyr::ungroup() %>%
dplyr::arrange(desc(mean_bmi)) %>%
dplyr::mutate(species = reorder(species, mean_bmi)) %>%
ggplot2::ggplot(aes(x = mean_bmi, y = species,
color = as.factor(species),
group = gender)) +
ggplot2::geom_point(show.legend = FALSE) +
ggplot2::facet_wrap(. ~ gender, scales = "free") +
ggplot2::labs(title = "Average BMI in Star Wars Universe",
subtitle = "Grouped by species and gender",
caption = "source = https://swapi.dev/",
x = "Mean BMI", y = "Species") +
hrbrthemes::theme_ipsum_rc(axis_text_size = 9,
axis_title_size = 13,
strip_text_size = 13) -> gg_avg_bmi_spec_gend
gg_avg_bmi_spec_gend
```
We can see this figure fits the page well because we are able to control the size of the height and width.
Now that we’ve created a few figures, we can see how these get stored to be used in the final .html file. Much like the default cache
settings, when we create graphs in R Markdown, a default folder is created that is your-file-name
+ _files
, and a subfolder figure-html
contains the images for the document.
We can see the figure we created above gg_avg_bmi_spec_gend
in the figure-html
folder below:
## wk5-01_rmarkdown-in-practice_files/figure-html
## └── gg_avg_bmi_spec_gend-1.png
We can also manually specify where we want the figures saved with fig.path=
. If we’re setting a folder for the figures, we can do it in the code chunk,
```{r figure-title, fig.path="img/"}
# code to create figure...
```
Or in the setup
chunk (but we need to make sure the folder exists!)
```{r setup, include=FALSE}
# create image folder
fs::dir_create(path = "img/")
knitr::opts_chunk$set(fig.path = "img/")
```
Most graphs also have options for saving, which we will demonstrate using the dm
and starwarsdb
packages to show how the Star Wars data are related to one another.
The starwarsdb
package comes with a data model function (starwars_dm()
), which we will pass to dm_draw()
from the dm
package. dm
stands for ‘data model’, and this package is great for visualizing relational data
```{r StarWarsDataModel}
library(dm)
library(starwarsdb)
StarWarsDataModel <- dm_draw(dm = starwars_dm(),
graph_name = "StarWarsDataModel")
StarWarsDataModel
```
We can see the individual data tables, and which keys link them together.
This graph requires some additional steps to save as a .png, but we can see we’re allowed to specify the file and folder path in the rsvg::rsvg_png()
function.
# packages to export
library(DiagrammeR)
library(DiagrammeRsvg)
library(rsvg)
# export file
StarWarsDataModel %>%
DiagrammeRsvg::export_svg() %>%
base::charToRaw() %>%
rsvg::rsvg_png(height = 1440,
file = "../img/StarWarsDataModel.png")
The biggest benefit to using HTML is the ability to create interactive graphs. One example comes from the plotly
package.
We can easily convert a ggplot2
graph to plotly
using the toWebGL()
and ggplotly()
functions. We also remove the legend with plotly::hide_legend()
so the plot looks identical to the version above.
library(plotly)
plotly::toWebGL(plotly::ggplotly(gg_avg_bmi_spec_gend)) %>%
# remove legend
plotly::hide_legend()