multiple-comparisons
multiple-comparisons.Rmd
Motivation
The goal of the dfdiffs
is to answer the following
questions:
- What rows are here now that weren’t here before?
- What rows were here before that aren’t here now?
- What values have been changed?
This vignette covers how to apply the functions in
dfdiffs
to multiple datasets (separated by folder)
Packages
Our package
Packages for import/export, iteration, wrangling, etc.
Packages for tables.
Similar packages for comparisons.
Folder structure
Below is a folder structure of datasets separated by year.
fs::dir_tree("../inst/extdata/csv/by-year")
#> ../inst/extdata/csv/by-year
#> ├── 20
#> │ ├── PlayerBirth.csv
#> │ ├── PlayerDebut.csv
#> │ └── PlayerName.csv
#> └── 21
#> ├── PlayerBirth.csv
#> ├── PlayerDebut.csv
#> └── PlayerName.csv
We start by storing these files in a vector with
list.files()
, then we clean up the names a bit, and pass
the list of paths to readr::read_csv
with
purrr::map()
:
Import base dfs
Below we import the compare data tables (from 2020)
base_files <- list.files(path = "../inst/extdata/csv/by-year/20",
pattern = ".csv", recursive = TRUE, full.names = TRUE)
base_files_nms <- purrr::map_chr(base_files, base::basename)
base_dfs <- base_files |>
purrr::set_names(base_files_nms) |>
purrr::map(readr::read_csv)
base_dfs |> str()
#> List of 3
#> $ PlayerBirth.csv: spc_tbl_ [20,673 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> ..$ playerID : chr [1:20673] "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
#> ..$ birthYear : num [1:20673] 1981 1934 1939 1954 1972 ...
#> ..$ birthMonth: num [1:20673] 12 2 8 9 8 12 11 4 11 10 ...
#> ..$ birthDay : num [1:20673] 27 5 5 8 25 17 4 15 11 14 ...
#> ..- attr(*, "spec")=
#> .. .. cols(
#> .. .. playerID = col_character(),
#> .. .. birthYear = col_double(),
#> .. .. birthMonth = col_double(),
#> .. .. birthDay = col_double()
#> .. .. )
#> ..- attr(*, "problems")=<externalptr>
#> $ PlayerDebut.csv: spc_tbl_ [20,673 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> ..$ playerID: chr [1:20673] "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
#> ..$ debut : Date[1:20673], format: "2004-04-06" "1954-04-13" ...
#> ..- attr(*, "spec")=
#> .. .. cols(
#> .. .. playerID = col_character(),
#> .. .. debut = col_date(format = "")
#> .. .. )
#> ..- attr(*, "problems")=<externalptr>
#> $ PlayerName.csv : spc_tbl_ [20,673 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> ..$ playerID : chr [1:20673] "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
#> ..$ nameGiven: chr [1:20673] "David Allan" "Henry Louis" "Tommie Lee" "Donald William" ...
#> ..- attr(*, "spec")=
#> .. .. cols(
#> .. .. playerID = col_character(),
#> .. .. nameGiven = col_character()
#> .. .. )
#> ..- attr(*, "problems")=<externalptr>
Import compare dfs
Below we import the compare data tables (from 2021)
compare_files <- list.files(path = "../inst/extdata/csv/by-year/21",
pattern = ".csv", recursive = TRUE, full.names = TRUE)
compare_files_nms <- purrr::map_chr(compare_files, base::basename)
compare_dfs <- compare_files |>
purrr::set_names(compare_files_nms) |>
purrr::map(readr::read_csv)
compare_dfs |> str()
#> List of 3
#> $ PlayerBirth.csv: spc_tbl_ [20,370 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> ..$ playerID : chr [1:20370] "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
#> ..$ birthYear : num [1:20370] 1981 1934 1939 1954 1972 ...
#> ..$ birthMonth: num [1:20370] 12 2 8 9 8 12 11 4 11 10 ...
#> ..$ birthDay : num [1:20370] 27 5 5 8 25 17 4 15 11 14 ...
#> ..- attr(*, "spec")=
#> .. .. cols(
#> .. .. playerID = col_character(),
#> .. .. birthYear = col_double(),
#> .. .. birthMonth = col_double(),
#> .. .. birthDay = col_double()
#> .. .. )
#> ..- attr(*, "problems")=<externalptr>
#> $ PlayerDebut.csv: spc_tbl_ [20,370 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> ..$ playerID: chr [1:20370] "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
#> ..$ debut : Date[1:20370], format: "2004-04-06" "1954-04-13" ...
#> ..- attr(*, "spec")=
#> .. .. cols(
#> .. .. playerID = col_character(),
#> .. .. debut = col_date(format = "")
#> .. .. )
#> ..- attr(*, "problems")=<externalptr>
#> $ PlayerName.csv : spc_tbl_ [20,370 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> ..$ playerID : chr [1:20370] "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
#> ..$ nameGiven: chr [1:20370] "David Allan" "Henry Louis" "Tommie Lee" "Donald William" ...
#> ..- attr(*, "spec")=
#> .. .. cols(
#> .. .. playerID = col_character(),
#> .. .. nameGiven = col_character()
#> .. .. )
#> ..- attr(*, "problems")=<externalptr>
Iteration
We’re going to follow the iteration steps from Charlotte Wickham’s Happy R Users Purrr Tutorial
- DO IT FOR ONE
- TURN IT INTO A RECIPE
- DO IT FOR ALL!
1) Do it for one (or two)
We’re going to use the arsenal::comparedf()
function
first. Remember that this function requires an additional
summary()
call!
arsenal::comparedf
by_var <- "playerID"
cdf_people <- arsenal::comparedf(
x = base_dfs$PlayerBirth.csv,
y = compare_dfs$PlayerBirth.csv,
by = by_var)
sumcdf_people <- summary(cdf_people)
names(sumcdf_people)
#> [1] "frame.summary.table" "comparison.summary.table"
#> [3] "vars.ns.table" "vars.nc.table"
#> [5] "obs.table" "diffs.byvar.table"
#> [7] "diffs.table" "attrs.table"
#> [9] "control"
diffdf::diffdf
Next we’ll use the diffdf::diffdf()
function.
diffdf_people <- diffdf::diffdf(
compare = compare_dfs$PlayerBirth.csv,
base = base_dfs$PlayerBirth.csv,
keys = "playerID")
diffdf_people |> names()
#> [1] "ExtRowsBase" "NumDiff" "VarDiff_birthYear"
#> [4] "VarDiff_birthMonth" "VarDiff_birthDay"
Just to ensure we’re getting the same display in the output from both
functions, we’re going to use the
diffs.byvar.table
/NumDiff
tables, and the
diffs.table
/ VarDiff_
tables.
diffs.byvar.table
sumcdf_people[["diffs.byvar.table"]] |>
select(Variable = var.x,
`No of Differences` = n) |>
knitr::kable(caption = "diffs.byvar.table") |>
kableExtra::kable_paper()
Variable | No of Differences |
---|---|
birthYear | 19 |
birthMonth | 3 |
birthDay | 6 |
NumDiff
diffdf_people[["NumDiff"]] |>
knitr::kable(caption = "NumDiff") |>
kableExtra::kable_paper()
Variable | No of Differences |
---|---|
birthYear | 19 |
birthMonth | 3 |
birthDay | 6 |
diffs.table
sumcdf_people[["diffs.table"]] |>
unnest(values.x) |>
unnest(values.y) |>
select(
VARIABLE = var.x,
all_of(by_var),
BASE = values.x,
COMPARE = values.y
) |>
knitr::kable(caption = "diffs.table") |>
kableExtra::kable_paper()
VARIABLE | playerID | BASE | COMPARE |
---|---|---|---|
birthYear | bearnla01 | 1940 | 1941 |
birthYear | candito01 | 1956 | 1957 |
birthYear | childbi01 | 1867 | NA |
birthYear | coopean99 | 1897 | 1898 |
birthYear | herremi01 | 1892 | 1897 |
birthYear | jimenma01 | 1936 | 1938 |
birthYear | jonesco02 | 1905 | 1907 |
birthYear | kelletr01 | 1992 | 1993 |
birthYear | mcfaror01 | 1935 | 1938 |
birthYear | mendomi01 | 1934 | 1933 |
birthYear | minosmi01 | 1923 | 1925 |
birthYear | naranch01 | 1933 | 1934 |
birthYear | olivaed01 | 1937 | 1938 |
birthYear | olivoch01 | 1926 | 1928 |
birthYear | posadle01 | 1934 | 1936 |
birthYear | quirkar01 | 1937 | 1938 |
birthYear | senerso01 | 1929 | 1931 |
birthYear | willida02 | 1879 | 1880 |
birthYear | zamoros01 | 1943 | 1944 |
birthMonth | hickmch01 | 3 | 5 |
birthMonth | mendomi01 | 12 | 11 |
birthMonth | willida02 | 7 | 2 |
birthDay | barnesk01 | 7 | 3 |
birthDay | jamesbo01 | 15 | 18 |
birthDay | mendomi01 | 3 | 16 |
birthDay | posadle01 | 1 | 15 |
birthDay | walshed01 | 19 | 14 |
birthDay | willida02 | NA | 7 |
VarDiff_
tables
tbl_names <- names(diffdf_people)
vardiff_tbls <- tbl_names[stringr::str_detect(tbl_names, "VarDiff_")]
bind_rows(diffdf_people[vardiff_tbls]) |>
knitr::kable(caption = "'VarDiff_' tables") |>
kableExtra::kable_paper()
VARIABLE | playerID | BASE | COMPARE |
---|---|---|---|
birthYear | bearnla01 | 1940 | 1941 |
birthYear | candito01 | 1956 | 1957 |
birthYear | childbi01 | 1867 | NA |
birthYear | coopean99 | 1897 | 1898 |
birthYear | herremi01 | 1892 | 1897 |
birthYear | jimenma01 | 1936 | 1938 |
birthYear | jonesco02 | 1905 | 1907 |
birthYear | kelletr01 | 1992 | 1993 |
birthYear | mcfaror01 | 1935 | 1938 |
birthYear | mendomi01 | 1934 | 1933 |
birthYear | minosmi01 | 1923 | 1925 |
birthYear | naranch01 | 1933 | 1934 |
birthYear | olivaed01 | 1937 | 1938 |
birthYear | olivoch01 | 1926 | 1928 |
birthYear | posadle01 | 1934 | 1936 |
birthYear | quirkar01 | 1937 | 1938 |
birthYear | senerso01 | 1929 | 1931 |
birthYear | willida02 | 1879 | 1880 |
birthYear | zamoros01 | 1943 | 1944 |
birthMonth | hickmch01 | 3 | 5 |
birthMonth | mendomi01 | 12 | 11 |
birthMonth | willida02 | 7 | 2 |
birthDay | barnesk01 | 7 | 3 |
birthDay | jamesbo01 | 15 | 18 |
birthDay | mendomi01 | 3 | 16 |
birthDay | posadle01 | 1 | 15 |
birthDay | walshed01 | 19 | 14 |
birthDay | willida02 | NA | 7 |
2) Turn it into a recipe
We have two vectors with file paths (.x
and
.y
)
map2(.x = base_dfs, .y = compare_files, )
And the .f
is our arsenal::comparedf
or
diffdf::diffdf
functions.
map2(.x = base_dfs, .y = compare_files, .f = arsenal::comparedf)
map2(.x = compare_files, .y = base_dfs, .f = diffdf::diffdf)
We can pass the by
and keys
arguments to
...
.
3) Do it for all
map2()
+ arsenal::comparedf
Now we use the purrr::map2()
function to compare all of
the items in base_dfs
to compare_dfs
.
all_cdfs <- purrr::map2(
.x = base_dfs,
.y = compare_dfs,
.f = arsenal::comparedf,
by = "playerID")
all_scdfs <- purrr::map(all_cdfs, .f = summary)
all_scdfs[["PlayerName.csv"]] |> names()
#> [1] "frame.summary.table" "comparison.summary.table"
#> [3] "vars.ns.table" "vars.nc.table"
#> [5] "obs.table" "diffs.byvar.table"
#> [7] "diffs.table" "attrs.table"
#> [9] "control"