create-new-data
create-new-data.Rmd
Motivation
The goal of the dfdiffs
is to answer the following
questions:
- What rows are here now that weren’t here before?
- What rows were here before that aren’t here now?
- What values have been changed?
This vignette takes us through the create_new_data()
function, which answers the “What rows are here now that weren’t
here before?”
What rows are here now that weren’t here before?
We’re going to use two test datasets to demonstrate the
create_new_data()
function.
Base Data
The T1Data
below contains six rows and eight
variables.
T1Data <- dfdiffs::T1Data
glimpse(T1Data)
#> Rows: 6
#> Columns: 7
#> $ subject <chr> "A", "A", "B", "C", "D", "D"
#> $ record <int> 1, 2, 3, 4, 5, 6
#> $ start_date <date> 2022-01-28, 2022-01-25, 2022-01-26, 2022-01-29, 2022-01-30,…
#> $ mid_date <date> 2022-03-20, 2022-03-15, 2022-03-19, 2022-03-18, 2022-03-16,…
#> $ end_date <date> 2022-03-30, 2022-03-29, 2022-03-25, 2022-03-27, 2022-03-26…
#> $ text_var <chr> "The birch canoe slid on the smooth planks.", "Glue the sh…
#> $ factor_var <chr> "food", "most", "park", "between", "regard", "law"
The unique identifier in this dataset is the combination of subject and record (which we can see below):
Compare Data
We will be comparing the T2Data
to the original data,
and we can see this dataset has the same six rows as
T1Data
, but includes three additional rows.
T2Data <- dfdiffs::T2Data
glimpse(T2Data)
#> Rows: 9
#> Columns: 7
#> $ subject <chr> "D", "D", "D", "C", "B", "B", "A", "A", "A"
#> $ record <int> 5, 6, 5, 4, 3, 4, 1, 2, 2
#> $ start_date <date> 2022-01-30, 2022-01-27, 2022-04-04, 2022-01-29, 2022-01-26,…
#> $ mid_date <date> 2022-03-16, 2022-03-17, 2022-04-13, 2022-03-18, 2022-03-19,…
#> $ end_date <date> 2022-03-26, 2022-03-31, 2022-04-22, 2022-03-27, 2022-03-25…
#> $ text_var <chr> "Rice is often served in round bowls.", "The juice of lemo…
#> $ factor_var <chr> "regard", "law", "associate", "between", "park", "encourag…
The unique identifier in this dataset is the combination of subject and record (which we can see below):
Creating a unique identifier
We also need a function that allows users to specify joining
variables, which creates a new join_var
given the supplied
columns that constitute a unique identifier in each dataset.
For example, in T1Data
and T2Data
, we can
create a unique identifier named join_var
T1DataJoin <- mutate(T1Data,
join_var = as.character(row_number())) %>%
dplyr::relocate(join_var, everything())
T2DataJoin <- mutate(T2Data,
join_var = as.character(row_number())) %>%
dplyr::relocate(join_var, everything())
T1DataJoin
join_var | subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|---|
1 | A | 1 | 2022-01-28 | 2022-03-20 | 2022-03-30 | The birch canoe slid on the smooth planks. | food |
2 | A | 2 | 2022-01-25 | 2022-03-15 | 2022-03-29 | Glue the sheet to the dark blue background. | most |
3 | B | 3 | 2022-01-26 | 2022-03-19 | 2022-03-25 | It’s easy to tell the depth of a well. | park |
4 | C | 4 | 2022-01-29 | 2022-03-18 | 2022-03-27 | These days a chicken leg is a rare dish. | between |
5 | D | 5 | 2022-01-30 | 2022-03-16 | 2022-03-26 | Rice is often served in round bowls. | regard |
6 | D | 6 | 2022-01-27 | 2022-03-17 | 2022-03-31 | The juice of lemons makes fine punch. | law |
T2DataJoin
join_var | subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|---|
1 | D | 5 | 2022-01-30 | 2022-03-16 | 2022-03-26 | Rice is often served in round bowls. | regard |
2 | D | 6 | 2022-01-27 | 2022-03-17 | 2022-03-31 | The juice of lemons makes fine punch. | law |
3 | D | 5 | 2022-04-04 | 2022-04-13 | 2022-04-22 | Four hours of steady work faced us. | associate |
4 | C | 4 | 2022-01-29 | 2022-03-18 | 2022-03-27 | These days a chicken leg is a rare dish. | between |
5 | B | 3 | 2022-01-26 | 2022-03-19 | 2022-03-25 | It’s easy to tell the depth of a well. | park |
6 | B | 4 | 2022-04-02 | 2022-04-14 | 2022-04-20 | The hogs were fed chopped corn and garbage. | encourage |
7 | A | 1 | 2022-01-28 | 2022-03-20 | 2022-03-30 | The birch canoe slid on the smooth planks. | food |
8 | A | 2 | 2022-01-25 | 2022-03-15 | 2022-03-29 | Glue the sheet to the dark blue background. | most |
9 | A | 2 | 2022-04-04 | 2022-04-15 | 2022-04-21 | The box was thrown beside the parked truck. | pension |
Conditions
Each function in the dfdiffs
package assumes the
following conditions:
Two datasets
Multiple columns to compare (
cols
)Single by column
Single
by
column, new column name (by_col
)Single
by
column, multiple compare columns (cols
)Single
by column, new column name (by_col
), multiple compare columns (cols
)Multiple
by
columnsMultiple
by
columns, new column name (by_col
)Multiple
by
columns, multiple compare columns (cols
)Multiple
by
columns, a newby_col
, andcols
Single by column conditions
-
Two datasets, compare all columns:
create_new_data( compare = T2Data, base = T1Data)
-
Multiple columns to compare (
cols
):create_new_data( compare = T2Data, base = T1Data, cols = c("text_var", "factor_var"))
-
Single
by
column, no new column namecreate_new_data( compare = T2DataJoin, base = T1DataJoin, by = "join_var")
-
Single by column, new column name (
by_col
)create_new_data( compare = T2DataJoin, base = T1DataJoin, by = "join_var", by_col = 'new_join_var')
-
Single
by
column, multiple compare columnscols
create_new_data( compare = T2DataJoin, base = T1DataJoin, by = "join_var", cols = c("text_var", "factor_var", "subject", "record"))
-
Single
by
column, new column name (by_col
), multiple compare columns (cols
)create_new_data( compare = T2DataJoin, base = T1DataJoin, # unique id by = "join_var", # new name for id by_col = 'new_join_var', # cols to compare cols = c("subject", "record", "text_var", "factor_var"))
Multiple by column conditions
-
Multiple
by
columnscreate_new_data( compare = T2Data, base = T1Data, by = c('subject', 'record'))
-
Multiple
by
columns, new column name (by_col
)create_new_data( compare = T2Data, base = T1Data, by = c('subject', 'record'), by_col = "new_join_col")
-
Multiple
by
columns, multiple compare columns (cols
)create_new_data( compare = T2Data, base = T1Data, by = c('subject', 'record'), cols = c("subject", "record", "factor_var", "text_var"))
-
Multiple
by
columns, a newby_col
, andcols
create_new_data( compare = T2Data, base = T1Data, by = c('subject', 'record'), by_col = "new_join_col", cols = c("subject", "record", "text_var", "factor_var"))
create_new_data()
Below is the create_new_data()
function:
create_new_data(compare, base, by = NULL, by_col = NULL, cols = NULL)
compare
= The current or new dataset in the
comparison
base
= The previous or old dataset in the comparison
by
= the unique identifier for joining the two
tables
by_col
= a new name for the joining column
cols
= the columns to compare (if none are provided, all
columns are compared)
Single by
column conditions
The function should also be able to handle multiple conditions. Below
we cover the conditions for a single by
columns (assuming
there is an existing unique identifier in each dataset). But first,
we’ll cover a few uncommon conditions, like a missing by
column, or a missing by
column and specific columns
selected for comparison.
1) Two datasets
- No
by
columns (only two datasets)
create_new_data(
compare = T2Data,
base = T1Data)
subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|
D | 5 | 2022-04-04 | 2022-04-13 | 2022-04-22 | Four hours of steady work faced us. | associate |
B | 4 | 2022-04-02 | 2022-04-14 | 2022-04-20 | The hogs were fed chopped corn and garbage. | encourage |
A | 2 | 2022-04-04 | 2022-04-15 | 2022-04-21 | The box was thrown beside the parked truck. | pension |
create_new_data(
compare = T2DataJoin,
base = T1DataJoin)
join_var | subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|---|
1 | D | 5 | 2022-01-30 | 2022-03-16 | 2022-03-26 | Rice is often served in round bowls. | regard |
2 | D | 6 | 2022-01-27 | 2022-03-17 | 2022-03-31 | The juice of lemons makes fine punch. | law |
3 | D | 5 | 2022-04-04 | 2022-04-13 | 2022-04-22 | Four hours of steady work faced us. | associate |
5 | B | 3 | 2022-01-26 | 2022-03-19 | 2022-03-25 | It’s easy to tell the depth of a well. | park |
6 | B | 4 | 2022-04-02 | 2022-04-14 | 2022-04-20 | The hogs were fed chopped corn and garbage. | encourage |
7 | A | 1 | 2022-01-28 | 2022-03-20 | 2022-03-30 | The birch canoe slid on the smooth planks. | food |
8 | A | 2 | 2022-01-25 | 2022-03-15 | 2022-03-29 | Glue the sheet to the dark blue background. | most |
9 | A | 2 | 2022-04-04 | 2022-04-15 | 2022-04-21 | The box was thrown beside the parked truck. | pension |
We can see this performs a row-by-row comparison.
2) Multiple columns to compare (cols
)
- No
by
columns (only two datasets) and multiple compare (cols
)
create_new_data(
compare = T2Data,
base = T1Data,
cols = c("text_var", "factor_var"))
text_var | factor_var |
---|---|
Four hours of steady work faced us. | associate |
The hogs were fed chopped corn and garbage. | encourage |
The box was thrown beside the parked truck. | pension |
create_new_data(
compare = T2DataJoin,
base = T1DataJoin,
cols = c("text_var", "factor_var"))
text_var | factor_var |
---|---|
Four hours of steady work faced us. | associate |
The hogs were fed chopped corn and garbage. | encourage |
The box was thrown beside the parked truck. | pension |
3) Single by
column
- We can provide a single
by
column (using ourT1DataJoin
andT2DataJoin
) datasets we created above.
create_new_data(
compare = T2DataJoin,
base = T1DataJoin,
by = "join_var")
join_var | subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|---|
7 | A | 1 | 2022-01-28 | 2022-03-20 | 2022-03-30 | The birch canoe slid on the smooth planks. | food |
8 | A | 2 | 2022-01-25 | 2022-03-15 | 2022-03-29 | Glue the sheet to the dark blue background. | most |
9 | A | 2 | 2022-04-04 | 2022-04-15 | 2022-04-21 | The box was thrown beside the parked truck. | pension |
4) Single by
column, new column name
(by_col
)
- We can also provide a single
by
column (for unique identifiers) and a new name for theby_col
create_new_data(
compare = T2DataJoin,
base = T1DataJoin,
by = "join_var",
by_col = 'new_join_var')
new_join_var | subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|---|
7 | A | 1 | 2022-01-28 | 2022-03-20 | 2022-03-30 | The birch canoe slid on the smooth planks. | food |
8 | A | 2 | 2022-01-25 | 2022-03-15 | 2022-03-29 | Glue the sheet to the dark blue background. | most |
9 | A | 2 | 2022-04-04 | 2022-04-15 | 2022-04-21 | The box was thrown beside the parked truck. | pension |
5) Single by
column, multiple compare columns
cols
- Single
by
column and multiple compare columns (cols
)
create_new_data(
compare = T2DataJoin,
base = T1DataJoin,
by = "join_var",
cols = c("text_var", "factor_var", "subject", "record"))
join_var | text_var | factor_var | subject | record |
---|---|---|---|---|
7 | The birch canoe slid on the smooth planks. | food | A | 1 |
8 | Glue the sheet to the dark blue background. | most | A | 2 |
9 | The box was thrown beside the parked truck. | pension | A | 2 |
6) Single by
column, new column name
(by_col
), multiple compare columns (cols
)
- Single
by
column, a new name for the by column (by_col
), and multiple compare columns (cols
)
create_new_data(
# data
compare = T2DataJoin,
base = T1DataJoin,
# unique id
by = "join_var",
# new name for id
by_col = 'new_join_var',
# cols to compare
cols = c("subject", "record", "text_var", "factor_var"))
new_join_var | subject | record | text_var | factor_var |
---|---|---|---|---|
7 | A | 1 | The birch canoe slid on the smooth planks. | food |
8 | A | 2 | Glue the sheet to the dark blue background. | most |
9 | A | 2 | The box was thrown beside the parked truck. | pension |
Multiple by
column conditions
Now we’re going to test conditions in which there are multiple columns used to create a unique identifier.
7) Multiple by
columns
- Multiple
by
columns (assuming the columns create a unique identifier)
create_new_data(
compare = T2Data,
base = T1Data,
by = c('subject', 'record'))
join | subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|---|
D-5 | D | 5 | 2022-04-04 | 2022-04-13 | 2022-04-22 | Four hours of steady work faced us. | associate |
B-4 | B | 4 | 2022-04-02 | 2022-04-14 | 2022-04-20 | The hogs were fed chopped corn and garbage. | encourage |
A-2 | A | 2 | 2022-04-04 | 2022-04-15 | 2022-04-21 | The box was thrown beside the parked truck. | pension |
This creates a new join
column and it’s a combination of
subject
and record
.
8) Multiple by
columns, new column name
(by_col
)
We can provide multiple by
columns, a new
by_col
, and no cols
create_new_data(
compare = T2Data,
base = T1Data,
by = c('subject', 'record'),
by_col = "new_join_col")
new_join_col | subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|---|
D-5 | D | 5 | 2022-04-04 | 2022-04-13 | 2022-04-22 | Four hours of steady work faced us. | associate |
B-4 | B | 4 | 2022-04-02 | 2022-04-14 | 2022-04-20 | The hogs were fed chopped corn and garbage. | encourage |
A-2 | A | 2 | 2022-04-04 | 2022-04-15 | 2022-04-21 | The box was thrown beside the parked truck. | pension |
9) Multiple by
columns, multiple compare columns
(cols
)
- Multiple
by
columns and multiple compare columns (cols
), and no newby_col
.
create_new_data(
compare = T2Data,
base = T1Data,
by = c('subject', 'record'),
cols = c("subject", "record", "factor_var", "text_var"))
join | subject | record | factor_var | text_var |
---|---|---|---|---|
D-5 | D | 5 | associate | Four hours of steady work faced us. |
B-4 | B | 4 | encourage | The hogs were fed chopped corn and garbage. |
A-2 | A | 2 | pension | The box was thrown beside the parked truck. |
10) Multiple by
columns, a new by_col
, and
cols
We can provide multiple by
columns, new
by_col
, and multiple cols
create_new_data(
compare = T2Data,
base = T1Data,
by = c('subject', 'record'),
by_col = "new_join_col",
cols = c("subject", "record", "text_var", "factor_var"))
new_join_col | subject | record | text_var | factor_var |
---|---|---|---|---|
D-5 | D | 5 | Four hours of steady work faced us. | associate |
B-4 | B | 4 | The hogs were fed chopped corn and garbage. | encourage |
A-2 | A | 2 | The box was thrown beside the parked truck. | pension |