1 Objectives

This document outlines some advanced data wrangling with dplyr, tidyr, and stringr. Read more in the Data Transformation chapter of R for Data Science.

1.1 Load packages

These are all part of the tidyverse, so we install and load this meta-package below.

# install.packages("tidyverse")
library(tidyverse)

2 Pharmaceutical Companies

In the slides we had well-formatted dataset. In these exercises, we’re going to import some data from the wild (Wikipedia), because most data aren’t ready to visualize and model when we first get them.

2.1 Import the data

I Googled “largest pharmaceutical companies” and found this wikipedia page.

These packages will help us scrape the data in the table and manipulate it in R.

library(xml2)
library(rvest)

2.1.1 Read HTML

The first function I’ll use comes from the xml2 package. xml2::read_html() loads the html from the wikipedia page into an R object I call wiki_html.

wiki_html <- xml2::read_html("https://en.wikipedia.org/wiki/List_of_largest_biotechnology_and_pharmaceutical_companies")

We should check the structure of new objects, so we know what we’re working with.

wiki_html %>% str()

## List of 2
##  $ node:<externalptr> 
##  $ doc :<externalptr> 
##  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

I can see this is a list of two objects (a node and a doc).

2.1.2 HTML Nodes (css)

I want the html node, so I will use the html_nodes() function from the rvest package.

The css argument is set to "table". Once again I check the structure of the output object.

wiki_html_tables <- wiki_html %>% rvest::html_nodes(css = "table")
wiki_html_tables %>% str()

## List of 4
##  $ :List of 2
##   ..$ node:<externalptr> 
##   ..$ doc :<externalptr> 
##   ..- attr(*, "class")= chr "xml_node"
##  $ :List of 2
##   ..$ node:<externalptr> 
##   ..$ doc :<externalptr> 
##   ..- attr(*, "class")= chr "xml_node"
##  $ :List of 2
##   ..$ node:<externalptr> 
##   ..$ doc :<externalptr> 
##   ..- attr(*, "class")= chr "xml_node"
##  $ :List of 2
##   ..$ node:<externalptr> 
##   ..$ doc :<externalptr> 
##   ..- attr(*, "class")= chr "xml_node"
##  - attr(*, "class")= chr "xml_nodeset"

This is a list of three lists, each of them xml_nodes.

In order to find the relevant tables in the wiki_html_tables object, I need to be able to search on something. Fortunately, the base::grep() function can be used in combination with sub-setting to extract the relevant_tables from wiki_html_tables.

Get the relevant tables from the xml_nodeset in wiki_html_tables.

relevant_tables <- wiki_html_tables[base::grep("rank", wiki_html_tables)]
relevant_tables %>% str()

## List of 1
##  $ :List of 2
##   ..$ node:<externalptr> 
##   ..$ doc :<externalptr> 
##   ..- attr(*, "class")= chr "xml_node"
##  - attr(*, "class")= chr "xml_nodeset"

This returned yet another list of lists! (another set of node and doc).

2.1.3 Extract HTML table

Now I need to use rvest::html_table() function (combined with some bracket sub-setting) to explore this object and learn about it’s contents (read more about how this works here).

I will start with position [[1]] and set fill = TRUE.

rvest::html_table(relevant_tables[[1]], 
                  fill = TRUE) %>% utils::head()

This looks like the table I want! Let’s assign it to a data frame and get wrangling!

TopPharmCompRaw <- rvest::html_table(relevant_tables[[1]], 
                  fill = TRUE)

2.2 Outline before you start

We have a raw dataset now, and we should make a ‘to-do’ list of what we want the data to look like. I start with data tidying, then move into changes for the individual variables.

3 Wrangle Data

These data are in a wide format, with each market_cap_in_ spread across columns.

TopPharmCompRaw

3.1 Pivot data

Ideally, we want two variables, year and market_cap_us_bil.

3.1.1 `pivot_longer`

We can re-shape TopPharmComp with tidyr::pivot_longer(), but first we need all the variables in the same format. We can do this with mutate_if().

TopPharmCompRaw %>% 
  mutate_if(is.numeric, as.character) %>% 
  tidyr::pivot_longer(cols = starts_with("Market Cap in"), 
                      names_to = "year", 
                      values_to = "market_cap_us_bil")

This looks correct–let’s assign it to TidyTopPharmComp (because it’s a new data structure).

TidyTopPharmComp <- TopPharmCompRaw %>% 
  mutate_if(is.numeric, as.character) %>% 
  tidyr::pivot_longer(cols = starts_with("Market Cap in"), 
                      names_to = "year", 
                      values_to = "market_cap_us_bil")

3.2 Format Columns

Follow along on each tab for the steps to wrangle the column names:

Each variable in it’s own column:
- i.e. type of company in it’s own column (B – Biotechnology company, P – Pharmaceutical company)
- stock exchange identifier in it’s own column (i.e. NYSE, NASDAQ, etc.)
Properly formatted values (numeric, factor, etc.)
Missing variables formatted correctly (replace - with NA)

3.2.1 String manipulations

We can use janitor::clean_names(), but first we should do some string manipulation to remove the extra characters from the column names.

TidyTopPharmComp %>% 
    # get a vector of 'dirty' names
    names() %>% 
    # remove citations in brackets [ ]
    stringr::str_remove_all(string = ., pattern = "\\[[0-9]\\]") %>%
    # replace (USD billions)
    stringr::str_replace_all(string = ., pattern = "\\s*\\([^\\)]+\\)", 
                             replacement = "_us_bil") %>% 
    # remove bracket from Rank
    stringr::str_remove_all(string = ., pattern = "\\[|\\]") %>% 
    # remove alpha numeric from end of Rank
    stringr::str_remove_all(string = ., pattern = "N 1$") %>% 
    # make all lowercase
    stringr::str_to_lower() -> cleaned_names
cleaned_names

## [1] "rank"                      "company"                  
## [3] "largest market cap_us_bil" "year"                     
## [5] "market_cap_us_bil"

3.2.2 `janitor::clean_names`

Now we can apply the new names (cleaned_names) to TidyTopPharmComp and use janitor::clean_names() to standardize them.

TidyTopPharmComp %>% 
    set_names(nm = cleaned_names) %>% 
    janitor::clean_names()

3.2.3 Assign to data frame

If this looks acceptable, we can assign to TopPharmComp.

TopPharmComp <- TidyTopPharmComp %>% 
    set_names(nm = cleaned_names) %>% 
    janitor::clean_names()

3.3 Define Variables

We want to split up the following variables into their own columns:

Year as a four-number digit year
company_type = [P] for Pharmaceutical or [B] for Biotechnology
stock_exch = NSYE (New York Stock Exchange), NASDAQ (National Association of Securities Dealers Automated Quotations), FWB (Frankfurt Stock Exchange), TYO (Australian Securities Exchange), TSX (Toronto Stock Exchange), and SIX (Swiss Exchange).
stock_id = acronym for each company on stock_exch
largest_market_cap_date = date from largest_market_cap_us_bil
company_name a variable containing only the company name
a ranking variable with a numerically coded rank

3.3.1 `year`

We can also wrangle the market_cap_year variable so it only contains the four-number year.

TopPharmComp %>% 
  # remove characters
  dplyr::mutate(year = str_remove_all(string = year, pattern = "\\D"),
                year = str_sub(string = year, start = 1L, end = 4L),
                # make numeric
                year = as.integer(year))

Let’s assign year to TopPharmComp

TopPharmComp %>% 
  # remove characters
  dplyr::mutate(year = str_remove_all(string = year, pattern = "\\D"),
                year = str_sub(string = year, start = 1L, end = 4L),
                # make numeric
                year = as.integer(year)) -> TopPharmComp

3.3.2 `company_type`

Here we separate the company column into two new columns, then use the company_name_etc to create company_type

TopPharmComp %>% 
  # separate company into 2 columns
  separate(col = company, into = c("company_name_etc", "company_etc"), 
           sep = "\\]", 
           remove = FALSE) %>% 
  # create categories
  mutate(company_type = case_when(
    str_detect(string = company_name_etc, pattern = "\\[P$") ~ "Pharma", 
    str_detect(string = company_name_etc, pattern = "\\[B$") ~ "Biotech")) %>% 
  # check our work
  dplyr::select(contains("comp"))

These categories look are correct, so we can assign company_type to TopPharmComp.

TopPharmComp <- TopPharmComp %>% 
  # separate company into 2 columns
  separate(col = company, into = c("company_name_etc", "company_etc"), 
           sep = "\\]", 
           remove = FALSE) %>% 
  # create categories
  mutate(company_type = case_when(
    str_detect(string = company_name_etc, pattern = "\\[P$") ~ "Pharma", 
    str_detect(string = company_name_etc, pattern = "\\[B$") ~ "Biotech")) %>% 
  # remove company_name_etc
  select(-company_name_etc)
TopPharmComp

3.3.3 `stock_exch` & `stock_id`

We will use the company_etc column to create stock_exch and stock_id. We need to make sure we trim the whitespace from stock_id (after the )

TopPharmComp %>% 
  # first separate the columns
    separate(col = company_etc, 
             into = c("stock_exch", "stock_id"), 
             sep = ":", remove = FALSE) %>% 
  # check our work
  select(company,
         company_etc, 
         contains("stock"))

The sep argument helps us a lot here, so we can create two new variables with minimal steps! We’ll assign to TopPharmComp.

TopPharmComp %>% 
  # first separate the columns
    separate(col = company_etc, 
             into = c("stock_exch", "stock_id"), 
             sep = ":", remove = FALSE) %>% 
  select(rank, 
         contains("comp"),
         contains("stock"),
         everything(), 
         -company_etc) -> TopPharmComp
TopPharmComp

3.3.4 `largest_market_cap_date`

Here we’ll use dplyr::rename() to change largest_market_cap_us_bil to _original so we can check our work with remove = FALSE.

TopPharmComp %>% 
  # rename column to largest_market_cap_original
  dplyr::rename(largest_market_cap_original = largest_market_cap_us_bil) %>% 
  # separate into two columns
  separate(col = largest_market_cap_original, 
           into = c('largest_market_cap_us_bil', 'largest_market_cap_date'),
           sep = "\\(", 
           remove = FALSE) %>% 
  # remove parenthesis
  mutate(largest_market_cap_date = 
           str_remove_all(largest_market_cap_date, pattern = "\\)$")) %>% 
  # check our work
  select(contains("large"))

Now we have the month and year separated into largest_market_cap_date–we can assign to TopPharmComp

 TopPharmComp %>% 
  # rename column to largest_market_cap_original
  dplyr::rename(largest_market_cap_original = largest_market_cap_us_bil) %>% 
  # separate into two columns
  separate(col = largest_market_cap_original, 
           into = c('largest_market_cap_us_bil', 'largest_market_cap_date'),
           sep = "\\(", 
           remove = FALSE) %>% 
  # remove parenthesis
  mutate(largest_market_cap_date = 
           str_remove_all(largest_market_cap_date, pattern = "\\)$")) %>% 
  # reorganize
  select(rank, 
         contains("comp"),
         contains("stock"),
         contains("large"),
         everything(), 
         -largest_market_cap_original) -> TopPharmComp
TopPharmComp

3.3.5 `company_name`

We will create a company_name variable using a different sep argument here, and we clean the company_name variable with str_remove_all()

TopPharmComp %>% 
  # separate company into 2 columns
  separate(col = company, into = c("company_name", "company_junk"), 
           sep = "\\[",
           remove = FALSE) %>% 
  # remove preceding \
  mutate(company_name = 
           str_remove_all(string = company_name, pattern = "^/")) %>% 
  # check our work
  select(company, company_name, company_junk)

This gives us only the name in company_name. We can assign and remove company_junk.

TopPharmComp %>% 
  # separate company into 2 columns
  separate(col = company, into = c("company_name", "company_junk"), 
           sep = "\\[",
           remove = FALSE) %>% 
  # remove preceding \
  mutate(company_name = 
           str_remove_all(string = company_name, pattern = "^/")) %>% 
  # reorganize
  select(rank, 
         contains("comp"),
         contains("stock"),
         contains("large"),
         everything(),
         -c(company, company_junk)) -> TopPharmComp
TopPharmComp

3.3.6 `ranking`

Here we can use str_remove_all() to remove the extra numbers in rank and create the ranking variable.

TopPharmComp %>% 
  dplyr::mutate(ranking = 
          stringr::str_remove_all(string = rank, pattern = " \\d$")) %>% 
  select(contains("rank"))

We assign this to TopPharmComp and move onto the missing values.

TopPharmComp <- TopPharmComp %>% 
  dplyr::mutate(ranking = 
          stringr::str_remove_all(string = rank, pattern = " \\d$")) %>% 
  select(ranking, 
         contains("comp"),
         contains("stock"),
         contains("large"),
         everything(),
         # remove rank
         -rank)
TopPharmComp

3.3.7 Missing Values

We can see there is a missing value in market_cap_us_bil formatted as -. We can change these with dplyr::na_if()

TopPharmComp %>% 
  na_if(x = ., y = "-") %>% 
  select(market_cap_us_bil)

We can assign this and move onto formatting the variables.

TopPharmComp <- TopPharmComp %>% 
  na_if(x = ., y = "-")

3.4 Format Values

Let’s take a look with glimpse() to see how these are formatted.

# set width
options(width = 60)
# view transposed data
TopPharmComp %>% glimpse(60)

## Rows: 231
## Columns: 9
## $ ranking                   <chr> "1", "1", "1", "1", "1"…
## $ company_name              <chr> "Johnson & Johnson", "J…
## $ company_type              <chr> "Pharma", "Pharma", "Ph…
## $ stock_exch                <chr> "NYSE", "NYSE", "NYSE",…
## $ stock_id                  <chr> " JNJ", " JNJ", " JNJ",…
## $ largest_market_cap_us_bil <chr> "397.4 ", "397.4 ", "39…
## $ largest_market_cap_date   <chr> "Jan 2018", "Jan 2018",…
## $ year                      <int> 2019, 2018, 2017, 2016,…
## $ market_cap_us_bil         <chr> "385", "346.1", "375.4"…

We can see the market_cap_us_bil and largest_market_cap_us_bil are formatted as characters (but they should be double).

3.4.1 Remove whitespace

Remove the whitespace from largest_market_cap_us_bil with stringr::str_trim().

TopPharmComp %>% 
  mutate(
    largest_market_cap_us_bil = str_trim(string = largest_market_cap_us_bil, 
                                         side = "both"),
    largest_market_cap_us_bil = as.numeric(largest_market_cap_us_bil))

This is working on the largest_market_cap_us_bil. We will also format market_cap_us_bil as a numeric value.

TopPharmComp %>% 
  mutate(
    largest_market_cap_us_bil = str_trim(string = largest_market_cap_us_bil, 
                                         side = "both"),
    largest_market_cap_us_bil = as.numeric(largest_market_cap_us_bil), 
    market_cap_us_bil = as.numeric(market_cap_us_bil))

Assign this to TopPharmComp.

TopPharmComp <- TopPharmComp %>% 
  mutate(
    largest_market_cap_us_bil = str_trim(string = largest_market_cap_us_bil, 
                                         side = "both"),
    largest_market_cap_us_bil = as.numeric(largest_market_cap_us_bil), 
    market_cap_us_bil = as.numeric(market_cap_us_bil))

3.4.2 Dates

We will use the _original suffix again to check our work.

TopPharmComp %>% 
  rename(cap_date_original = largest_market_cap_date) %>% 
  mutate(largest_market_cap_date = lubridate::my(cap_date_original)) %>% 
  dplyr::select(contains("date"))

This is giving us the date in largest_market_cap_date, but each day is rounded to the first day of the relative month (which is OK). Assign to TopPharmComp (but keep cap_date_original).

TopPharmComp <- TopPharmComp %>% 
  # rename
  rename(cap_date_original = largest_market_cap_date) %>% 
  # change format 
  mutate(largest_market_cap_date = lubridate::my(cap_date_original)) %>% 
  # reorganize
  select(ranking, 
         contains("comp"),
         contains("stock"),
         contains("date"),
         year,
         contains("us_bil"))
TopPharmComp

3.4.3 Factors

We want to keep the order of ranking, so we will format this as an ordered factor. We don’t need to supply any levels because they are already numeric.

TopPharmComp <- TopPharmComp %>% 
  mutate(
    ranking = as.integer(ranking),
    ranking = factor(ranking, ordered = TRUE))
# check
TopPharmComp %>% arrange(ranking)

4 Export data

We now have a dataset we can use for modeling or visualizations! We’ll export this to the data folder and use it in the data visualization section.

readr::write_csv(x = TopPharmComp, file = "data/TopPharmComp.csv")

Scraping Wikipedia Tables