This document outlines some advanced data wrangling with dplyr
, tidyr
, and stringr
. Read more in the Data Transformation chapter of R for Data Science.
These are all part of the tidyverse
, so we install and load this meta-package below.
# install.packages("tidyverse")
library(tidyverse)
In the slides we had well-formatted dataset. In these exercises, we’re going to import some data from the wild (Wikipedia), because most data aren’t ready to visualize and model when we first get them.
I Googled “largest pharmaceutical companies” and found this wikipedia page.
These packages will help us scrape the data in the table and manipulate it in R.
library(xml2)
library(rvest)
The first function I’ll use comes from the xml2
package. xml2::read_html()
loads the html from the wikipedia page into an R object I call wiki_html
.
wiki_html <- xml2::read_html("https://en.wikipedia.org/wiki/List_of_largest_biotechnology_and_pharmaceutical_companies")
We should check the structure of new objects, so we know what we’re working with.
wiki_html %>% str()
## List of 2
## $ node:<externalptr>
## $ doc :<externalptr>
## - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
I can see this is a list of two objects (a node
and a doc
).
I want the html node
, so I will use the html_nodes()
function from the rvest
package.
The css
argument is set to "table"
. Once again I check the structure of the output object.
wiki_html_tables <- wiki_html %>% rvest::html_nodes(css = "table")
wiki_html_tables %>% str()
## List of 4
## $ :List of 2
## ..$ node:<externalptr>
## ..$ doc :<externalptr>
## ..- attr(*, "class")= chr "xml_node"
## $ :List of 2
## ..$ node:<externalptr>
## ..$ doc :<externalptr>
## ..- attr(*, "class")= chr "xml_node"
## $ :List of 2
## ..$ node:<externalptr>
## ..$ doc :<externalptr>
## ..- attr(*, "class")= chr "xml_node"
## $ :List of 2
## ..$ node:<externalptr>
## ..$ doc :<externalptr>
## ..- attr(*, "class")= chr "xml_node"
## - attr(*, "class")= chr "xml_nodeset"
This is a list of three lists, each of them xml_node
s.
In order to find the relevant tables in the wiki_html_tables
object, I need to be able to search on something. Fortunately, the base::grep()
function can be used in combination with sub-setting to extract the relevant_tables
from wiki_html_tables
.
Get the relevant tables from the xml_nodeset
in wiki_html_tables
.
relevant_tables <- wiki_html_tables[base::grep("rank", wiki_html_tables)]
relevant_tables %>% str()
## List of 1
## $ :List of 2
## ..$ node:<externalptr>
## ..$ doc :<externalptr>
## ..- attr(*, "class")= chr "xml_node"
## - attr(*, "class")= chr "xml_nodeset"
This returned yet another list of lists! (another set of node
and doc
).
Now I need to use rvest::html_table()
function (combined with some bracket sub-setting) to explore this object and learn about it’s contents (read more about how this works here).
I will start with position [[1]]
and set fill = TRUE
.
rvest::html_table(relevant_tables[[1]],
fill = TRUE) %>% utils::head()
This looks like the table I want! Let’s assign it to a data frame and get wrangling!
TopPharmCompRaw <- rvest::html_table(relevant_tables[[1]],
fill = TRUE)
We have a raw dataset now, and we should make a ‘to-do’ list of what we want the data to look like. I start with data tidying, then move into changes for the individual variables.
These data are in a wide format, with each market_cap_in_
spread across columns.
TopPharmCompRaw
Ideally, we want two variables, year
and market_cap_us_bil
.
pivot_longer
We can re-shape TopPharmComp
with tidyr::pivot_longer()
, but first we need all the variables in the same format. We can do this with mutate_if()
.
TopPharmCompRaw %>%
mutate_if(is.numeric, as.character) %>%
tidyr::pivot_longer(cols = starts_with("Market Cap in"),
names_to = "year",
values_to = "market_cap_us_bil")
This looks correct–let’s assign it to TidyTopPharmComp
(because it’s a new data structure).
TidyTopPharmComp <- TopPharmCompRaw %>%
mutate_if(is.numeric, as.character) %>%
tidyr::pivot_longer(cols = starts_with("Market Cap in"),
names_to = "year",
values_to = "market_cap_us_bil")
Follow along on each tab for the steps to wrangle the column names:
B
– Biotechnology company, P
– Pharmaceutical company)NYSE
, NASDAQ
, etc.)-
with NA
)We can use janitor::clean_names()
, but first we should do some string manipulation to remove the extra characters from the column names.
TidyTopPharmComp %>%
# get a vector of 'dirty' names
names() %>%
# remove citations in brackets [ ]
stringr::str_remove_all(string = ., pattern = "\\[[0-9]\\]") %>%
# replace (USD billions)
stringr::str_replace_all(string = ., pattern = "\\s*\\([^\\)]+\\)",
replacement = "_us_bil") %>%
# remove bracket from Rank
stringr::str_remove_all(string = ., pattern = "\\[|\\]") %>%
# remove alpha numeric from end of Rank
stringr::str_remove_all(string = ., pattern = "N 1$") %>%
# make all lowercase
stringr::str_to_lower() -> cleaned_names
cleaned_names
## [1] "rank" "company"
## [3] "largest market cap_us_bil" "year"
## [5] "market_cap_us_bil"
janitor::clean_names
Now we can apply the new names (cleaned_names
) to TidyTopPharmComp
and use janitor::clean_names()
to standardize them.
TidyTopPharmComp %>%
set_names(nm = cleaned_names) %>%
janitor::clean_names()
If this looks acceptable, we can assign to TopPharmComp
.
TopPharmComp <- TidyTopPharmComp %>%
set_names(nm = cleaned_names) %>%
janitor::clean_names()
We want to split up the following variables into their own columns:
year
company_type
= [P]
for Pharmaceutical
or [B]
for Biotechnology
stock_exch
= NSYE
(New York Stock Exchange), NASDAQ
(National Association of Securities Dealers Automated Quotations), FWB
(Frankfurt Stock Exchange), TYO
(Australian Securities Exchange), TSX
(Toronto Stock Exchange), and SIX
(Swiss Exchange).stock_id
= acronym for each company on stock_exch
largest_market_cap_date
= date from largest_market_cap_us_bil
company_name
a variable containing only the company nameranking
variable with a numerically coded rankyear
We can also wrangle the market_cap_year
variable so it only contains the four-number year.
TopPharmComp %>%
# remove characters
dplyr::mutate(year = str_remove_all(string = year, pattern = "\\D"),
year = str_sub(string = year, start = 1L, end = 4L),
# make numeric
year = as.integer(year))
Let’s assign year
to TopPharmComp
TopPharmComp %>%
# remove characters
dplyr::mutate(year = str_remove_all(string = year, pattern = "\\D"),
year = str_sub(string = year, start = 1L, end = 4L),
# make numeric
year = as.integer(year)) -> TopPharmComp
company_type
Here we separate
the company
column into two new columns, then use the company_name_etc
to create company_type
TopPharmComp %>%
# separate company into 2 columns
separate(col = company, into = c("company_name_etc", "company_etc"),
sep = "\\]",
remove = FALSE) %>%
# create categories
mutate(company_type = case_when(
str_detect(string = company_name_etc, pattern = "\\[P$") ~ "Pharma",
str_detect(string = company_name_etc, pattern = "\\[B$") ~ "Biotech")) %>%
# check our work
dplyr::select(contains("comp"))
These categories look are correct, so we can assign company_type
to TopPharmComp
.
TopPharmComp <- TopPharmComp %>%
# separate company into 2 columns
separate(col = company, into = c("company_name_etc", "company_etc"),
sep = "\\]",
remove = FALSE) %>%
# create categories
mutate(company_type = case_when(
str_detect(string = company_name_etc, pattern = "\\[P$") ~ "Pharma",
str_detect(string = company_name_etc, pattern = "\\[B$") ~ "Biotech")) %>%
# remove company_name_etc
select(-company_name_etc)
TopPharmComp
stock_exch
& stock_id
We will use the company_etc
column to create stock_exch
and stock_id
. We need to make sure we trim the whitespace from stock_id
(after the )
TopPharmComp %>%
# first separate the columns
separate(col = company_etc,
into = c("stock_exch", "stock_id"),
sep = ":", remove = FALSE) %>%
# check our work
select(company,
company_etc,
contains("stock"))
The sep
argument helps us a lot here, so we can create two new variables with minimal steps! We’ll assign to TopPharmComp
.
TopPharmComp %>%
# first separate the columns
separate(col = company_etc,
into = c("stock_exch", "stock_id"),
sep = ":", remove = FALSE) %>%
select(rank,
contains("comp"),
contains("stock"),
everything(),
-company_etc) -> TopPharmComp
TopPharmComp
largest_market_cap_date
Here we’ll use dplyr::rename()
to change largest_market_cap_us_bil
to _original
so we can check our work with remove = FALSE
.
TopPharmComp %>%
# rename column to largest_market_cap_original
dplyr::rename(largest_market_cap_original = largest_market_cap_us_bil) %>%
# separate into two columns
separate(col = largest_market_cap_original,
into = c('largest_market_cap_us_bil', 'largest_market_cap_date'),
sep = "\\(",
remove = FALSE) %>%
# remove parenthesis
mutate(largest_market_cap_date =
str_remove_all(largest_market_cap_date, pattern = "\\)$")) %>%
# check our work
select(contains("large"))
Now we have the month and year separated into largest_market_cap_date
–we can assign to TopPharmComp
TopPharmComp %>%
# rename column to largest_market_cap_original
dplyr::rename(largest_market_cap_original = largest_market_cap_us_bil) %>%
# separate into two columns
separate(col = largest_market_cap_original,
into = c('largest_market_cap_us_bil', 'largest_market_cap_date'),
sep = "\\(",
remove = FALSE) %>%
# remove parenthesis
mutate(largest_market_cap_date =
str_remove_all(largest_market_cap_date, pattern = "\\)$")) %>%
# reorganize
select(rank,
contains("comp"),
contains("stock"),
contains("large"),
everything(),
-largest_market_cap_original) -> TopPharmComp
TopPharmComp
company_name
We will create a company_name
variable using a different sep
argument here, and we clean the company_name
variable with str_remove_all()
TopPharmComp %>%
# separate company into 2 columns
separate(col = company, into = c("company_name", "company_junk"),
sep = "\\[",
remove = FALSE) %>%
# remove preceding \
mutate(company_name =
str_remove_all(string = company_name, pattern = "^/")) %>%
# check our work
select(company, company_name, company_junk)
This gives us only the name in company_name
. We can assign and remove company_junk
.
TopPharmComp %>%
# separate company into 2 columns
separate(col = company, into = c("company_name", "company_junk"),
sep = "\\[",
remove = FALSE) %>%
# remove preceding \
mutate(company_name =
str_remove_all(string = company_name, pattern = "^/")) %>%
# reorganize
select(rank,
contains("comp"),
contains("stock"),
contains("large"),
everything(),
-c(company, company_junk)) -> TopPharmComp
TopPharmComp
ranking
Here we can use str_remove_all()
to remove the extra numbers in rank
and create the ranking
variable.
TopPharmComp %>%
dplyr::mutate(ranking =
stringr::str_remove_all(string = rank, pattern = " \\d$")) %>%
select(contains("rank"))
We assign this to TopPharmComp
and move onto the missing values.
TopPharmComp <- TopPharmComp %>%
dplyr::mutate(ranking =
stringr::str_remove_all(string = rank, pattern = " \\d$")) %>%
select(ranking,
contains("comp"),
contains("stock"),
contains("large"),
everything(),
# remove rank
-rank)
TopPharmComp
We can see there is a missing value in market_cap_us_bil
formatted as -
. We can change these with dplyr::na_if()
TopPharmComp %>%
na_if(x = ., y = "-") %>%
select(market_cap_us_bil)
We can assign this and move onto formatting the variables.
TopPharmComp <- TopPharmComp %>%
na_if(x = ., y = "-")
Let’s take a look with glimpse()
to see how these are formatted.
# set width
options(width = 60)
# view transposed data
TopPharmComp %>% glimpse(60)
## Rows: 231
## Columns: 9
## $ ranking <chr> "1", "1", "1", "1", "1"…
## $ company_name <chr> "Johnson & Johnson", "J…
## $ company_type <chr> "Pharma", "Pharma", "Ph…
## $ stock_exch <chr> "NYSE", "NYSE", "NYSE",…
## $ stock_id <chr> " JNJ", " JNJ", " JNJ",…
## $ largest_market_cap_us_bil <chr> "397.4 ", "397.4 ", "39…
## $ largest_market_cap_date <chr> "Jan 2018", "Jan 2018",…
## $ year <int> 2019, 2018, 2017, 2016,…
## $ market_cap_us_bil <chr> "385", "346.1", "375.4"…
We can see the market_cap_us_bil
and largest_market_cap_us_bil
are formatted as character
s (but they should be double
).
Remove the whitespace from largest_market_cap_us_bil
with stringr::str_trim()
.
TopPharmComp %>%
mutate(
largest_market_cap_us_bil = str_trim(string = largest_market_cap_us_bil,
side = "both"),
largest_market_cap_us_bil = as.numeric(largest_market_cap_us_bil))
This is working on the largest_market_cap_us_bil
. We will also format market_cap_us_bil
as a numeric value.
TopPharmComp %>%
mutate(
largest_market_cap_us_bil = str_trim(string = largest_market_cap_us_bil,
side = "both"),
largest_market_cap_us_bil = as.numeric(largest_market_cap_us_bil),
market_cap_us_bil = as.numeric(market_cap_us_bil))
Assign this to TopPharmComp
.
TopPharmComp <- TopPharmComp %>%
mutate(
largest_market_cap_us_bil = str_trim(string = largest_market_cap_us_bil,
side = "both"),
largest_market_cap_us_bil = as.numeric(largest_market_cap_us_bil),
market_cap_us_bil = as.numeric(market_cap_us_bil))
We will use the _original
suffix again to check our work.
TopPharmComp %>%
rename(cap_date_original = largest_market_cap_date) %>%
mutate(largest_market_cap_date = lubridate::my(cap_date_original)) %>%
dplyr::select(contains("date"))
This is giving us the date in largest_market_cap_date
, but each day is rounded to the first day of the relative month (which is OK). Assign to TopPharmComp
(but keep cap_date_original
).
TopPharmComp <- TopPharmComp %>%
# rename
rename(cap_date_original = largest_market_cap_date) %>%
# change format
mutate(largest_market_cap_date = lubridate::my(cap_date_original)) %>%
# reorganize
select(ranking,
contains("comp"),
contains("stock"),
contains("date"),
year,
contains("us_bil"))
TopPharmComp
We want to keep the order of ranking
, so we will format this as an ordered factor. We don’t need to supply any levels
because they are already numeric.
TopPharmComp <- TopPharmComp %>%
mutate(
ranking = as.integer(ranking),
ranking = factor(ranking, ordered = TRUE))
# check
TopPharmComp %>% arrange(ranking)
We now have a dataset we can use for modeling or visualizations! We’ll export this to the data
folder and use it in the data visualization section.
readr::write_csv(x = TopPharmComp, file = "data/TopPharmComp.csv")