getting-started
getting-started.Rmd
Motivation
This vignette is an intro to the dfdiffs
package. The
goal of dfdiffs
is to answer the following questions:
- What rows are here now that weren’t here before?
- What rows were here before that aren’t here now?
- What values have been changed?
Package functions
We have functions for answering each of the questions posed above. Each function has a pair of datasets to demonstrate how they work (which we’ll cover below).
What rows are here now that weren’t here before?
To check new data, we’re going to use T1Data
and
T2Data
.
Timepoint 1 data (original)
These data represent data taken at T1.
T1Data : Simulated 'time-point 1' data |
||||||
subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|
A | 1 | 2022-01-28 | 2022-03-20 | 2022-03-30 | The birch canoe slid on the smooth planks. | food |
A | 2 | 2022-01-25 | 2022-03-15 | 2022-03-29 | Glue the sheet to the dark blue background. | most |
B | 3 | 2022-01-26 | 2022-03-19 | 2022-03-25 | It's easy to tell the depth of a well. | park |
C | 4 | 2022-01-29 | 2022-03-18 | 2022-03-27 | These days a chicken leg is a rare dish. | between |
D | 5 | 2022-01-30 | 2022-03-16 | 2022-03-26 | Rice is often served in round bowls. | regard |
D | 6 | 2022-01-27 | 2022-03-17 | 2022-03-31 | The juice of lemons makes fine punch. | law |
Timepoint 2 data (new)
This is a ‘new’ dataset representing T2.
T2Data : Simulated 'time-point 2' data |
||||||
subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|
D | 5 | 2022-01-30 | 2022-03-16 | 2022-03-26 | Rice is often served in round bowls. | regard |
D | 6 | 2022-01-27 | 2022-03-17 | 2022-03-31 | The juice of lemons makes fine punch. | law |
D | 5 | 2022-04-04 | 2022-04-13 | 2022-04-22 | Four hours of steady work faced us. | associate |
C | 4 | 2022-01-29 | 2022-03-18 | 2022-03-27 | These days a chicken leg is a rare dish. | between |
B | 3 | 2022-01-26 | 2022-03-19 | 2022-03-25 | It's easy to tell the depth of a well. | park |
B | 4 | 2022-04-02 | 2022-04-14 | 2022-04-20 | The hogs were fed chopped corn and garbage. | encourage |
A | 1 | 2022-01-28 | 2022-03-20 | 2022-03-30 | The birch canoe slid on the smooth planks. | food |
A | 2 | 2022-01-25 | 2022-03-15 | 2022-03-29 | Glue the sheet to the dark blue background. | most |
A | 2 | 2022-04-04 | 2022-04-15 | 2022-04-21 | The box was thrown beside the parked truck. | pension |
create_new_data()
The create_new_data()
function shows us the ‘new data’
(i.e. what is here now that wasn’t here before?)
create_new_data(
compare = T2Data,
base = T1Data)
Output from create_new_data()
|
||||||
Difference between 'time-point 1' & 'time-point 2' | ||||||
subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|
D | 5 | 2022-04-04 | 2022-04-13 | 2022-04-22 | Four hours of steady work faced us. | associate |
B | 4 | 2022-04-02 | 2022-04-14 | 2022-04-20 | The hogs were fed chopped corn and garbage. | encourage |
A | 2 | 2022-04-04 | 2022-04-15 | 2022-04-21 | The box was thrown beside the parked truck. | pension |
We can check this against the NewData
dataset (which
should match the output from create_new_data()
)
NewData |
||||||
stored differences from create_new_data()
|
||||||
subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|
D | 5 | 2022-04-04 | 2022-04-13 | 2022-04-22 | Four hours of steady work faced us. | associate |
B | 4 | 2022-04-02 | 2022-04-14 | 2022-04-20 | The hogs were fed chopped corn and garbage. | encourage |
A | 2 | 2022-04-04 | 2022-04-15 | 2022-04-21 | The box was thrown beside the parked truck. | pension |
What rows were here before that aren’t here now?
To test for the deleted data, we use the CompleteData
,
IncompleteData
, and check these with
DeletedData
.
CompleteData <- dfdiffs::CompleteData
IncompleteData <- dfdiffs::IncompleteData
DeletedData <- dfdiffs::DeletedData
A complete dataset
CompleteData
represents a ‘complete’ set of data.
CompleteData |
||||||
simulated data for checking 'deleted data' | ||||||
subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|
A | 1 | 2021-12-28 | 2022-01-27 | 2022-02-26 | The copper bowl shone in the sun's rays. | interest |
A | 2 | 2021-12-28 | 2022-01-27 | 2022-02-26 | Mark the spot with a sign painted red. | state |
B | 1 | 2021-12-26 | 2022-01-25 | 2022-02-24 | Take a chance and win a china doll. | sure |
B | 2 | 2021-12-26 | 2022-01-25 | 2022-02-24 | A cramp is no small danger on a swim. | white |
C | 1 | 2021-12-30 | 2022-01-29 | 2022-02-28 | It's easy to tell the depth of a well. | grant |
D | 1 | 2021-12-27 | 2022-01-26 | 2022-02-25 | The sky that morning was clear and bright blue. | tape |
A | 3 | 2021-12-28 | 2022-01-27 | 2022-02-26 | Wake and rise, and step into the green outdoors. | situate |
B | 3 | 2021-12-26 | 2022-01-25 | 2022-02-24 | A blue crane is a tall wading bird. | shut |
D | 2 | 2021-12-27 | 2022-01-26 | 2022-02-25 | Say it slow!y but make it ring clear. | document |
An incomplete dataset
This is a dataset with rows removed from
CompleteData
.
IncompleteData |
||||||
simulated data for checking 'deleted data' | ||||||
subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|
A | 1 | 2021-12-28 | 2022-01-27 | 2022-02-26 | The copper bowl shone in the sun's rays. | interest |
B | 1 | 2021-12-26 | 2022-01-25 | 2022-02-24 | Take a chance and win a china doll. | sure |
B | 2 | 2021-12-26 | 2022-01-25 | 2022-02-24 | A cramp is no small danger on a swim. | white |
A | 3 | 2021-12-28 | 2022-01-27 | 2022-02-26 | Wake and rise, and step into the green outdoors. | situate |
D | 2 | 2021-12-27 | 2022-01-26 | 2022-02-25 | Say it slow!y but make it ring clear. | document |
create_deleted_data()
When we run the create_deleted_data()
, we check for the
deleted rows between IncompleteData
and
CompleteData
.
create_deleted_data(
compare = IncompleteData,
base = CompleteData)
Output from create_deleted_data()
|
||||||
Differences between CompleteData and IncompleteData
|
||||||
subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|
A | 2 | 2021-12-28 | 2022-01-27 | 2022-02-26 | Mark the spot with a sign painted red. | state |
B | 3 | 2021-12-26 | 2022-01-25 | 2022-02-24 | A blue crane is a tall wading bird. | shut |
C | 1 | 2021-12-30 | 2022-01-29 | 2022-02-28 | It's easy to tell the depth of a well. | grant |
D | 1 | 2021-12-27 | 2022-01-26 | 2022-02-25 | The sky that morning was clear and bright blue. | tape |
The deleted data
This is identical to the data stored in DeletedData
DeletedData |
||||||
Output from create_deleted_data()
|
||||||
subject | record | start_date | mid_date | end_date | text_var | factor_var |
---|---|---|---|---|---|---|
A | 2 | 2021-12-28 | 2022-01-27 | 2022-02-26 | Mark the spot with a sign painted red. | state |
B | 3 | 2021-12-26 | 2022-01-25 | 2022-02-24 | A blue crane is a tall wading bird. | shut |
C | 1 | 2021-12-30 | 2022-01-29 | 2022-02-28 | It's easy to tell the depth of a well. | grant |
D | 1 | 2021-12-27 | 2022-01-26 | 2022-02-25 | The sky that morning was clear and bright blue. | tape |
What values have been changed?
To answer this question, we have two options:
create_changed_data()
and
create_modified_data()
.
create_changed_data()
relies on thediffdf()
function from thediffdf
package package.create_modified_data()
relies on thecomparedf()
function from thearsenal
package package.
To check for changes between two datasets, we use the
InitialData
and ChangedData
.
InitialData <- dfdiffs::InitialData
ChangedData <- dfdiffs::ChangedData
Initial data
InitialData |
||||||
simulated data for checking 'changed/modified data' | ||||||
subject_id | record | text_value_a | text_value_b | created_date | updated_date | entered_date |
---|---|---|---|---|---|---|
A | 1 | Issue unresolved | Fatigue | 2021-07-29 | 2021-09-29 | 2021-09-29 |
A | 2 | Issue unresolved | Fatigue | 2021-07-29 | 2021-10-03 | 2021-10-29 |
B | 3 | Issue resolved | Fever | 2021-07-16 | 2021-09-02 | 2021-08-18 |
C | 4 | Issue resolved | Joint pain | 2021-08-24 | 2021-10-03 | 2021-10-03 |
C | 5 | Issue resolved | Joint pain | 2021-08-24 | 2021-09-20 | 2021-10-20 |
Changed data
ChangedData |
||||||
simulated data for checking 'modified data' | ||||||
subject_id | record | text_value_a | text_value_b | created_date | updated_date | entered_date |
---|---|---|---|---|---|---|
A | 1 | Issue resolved | Fatigue | 2021-07-29 | 2021-10-03 | 2021-11-30 |
A | 2 | Issue resolved | Fatigue | 2021-07-29 | 2021-11-27 | 2021-11-30 |
B | 3 | Issue resolved | Fever | 2021-07-16 | 2021-10-20 | 2021-11-21 |
C | 4 | Issue resolved | Joint pain, stiffness and swelling | 2021-08-24 | 2021-10-13 | 2021-11-11 |
C | 5 | Issue resolved | Joint pain | 2021-08-24 | 2021-10-14 | 2021-11-16 |
create_changed_data()
create_changed_data()
creates a list of tables.
changed <- create_changed_data(
compare = ChangedData,
base = InitialData)
names(changed)
#> [1] "num_diffs" "var_diffs"
Counts of changes (num_diffs
)
The counts of changes by variable are stored in
num_diffs
.
num_diffs |
|
counts of changes for 'changed data' | |
variable | no_of_differences |
---|---|
text_value_a | 2 |
text_value_b | 1 |
updated_date | 5 |
entered_date | 5 |
Changes by row (var_diffs
)
The changes by row are stored in var_diffs
.
var_diffs |
|||
Row-by-row of changes for 'changed data' | |||
variable | rownumber | base | compare |
---|---|---|---|
text_value_a | 1 | Issue unresolved | Issue resolved |
text_value_a | 2 | Issue unresolved | Issue resolved |
text_value_b | 4 | Joint pain | Joint pain, stiffness and swelling |
updated_date | 1 | 2021-09-29 | 2021-10-03 |
updated_date | 2 | 2021-10-03 | 2021-11-27 |
updated_date | 3 | 2021-09-02 | 2021-10-20 |
updated_date | 4 | 2021-10-03 | 2021-10-13 |
updated_date | 5 | 2021-09-20 | 2021-10-14 |
entered_date | 1 | 2021-09-29 | 2021-11-30 |
entered_date | 2 | 2021-10-29 | 2021-11-30 |
entered_date | 3 | 2021-08-18 | 2021-11-21 |
entered_date | 4 | 2021-10-03 | 2021-11-11 |
entered_date | 5 | 2021-10-20 | 2021-11-16 |
create_modified_data()
The create_modified_data()
function also creates a list
of tables.
modified <- create_modified_data(
compare = ChangedData,
base = InitialData)
names(modified)
#> [1] "diffs" "diffs_byvar"
Counts of changes (diffs_byvar
)
The counts of changes by variable are stored in
diffs_byvar
.
diffs_byvar |
||
counts of changes for 'modified data' | ||
Variable name | Modified Values | Missing Values |
---|---|---|
subject_id | 0 | 0 |
record | 0 | 0 |
text_value_a | 2 | 0 |
text_value_b | 1 | 0 |
created_date | 0 | 0 |
updated_date | 5 | 0 |
entered_date | 5 | 0 |
Changes by row
The changes by row are stored in diffs
.
diffs |
||
Row-by-row changes of 'modified data' | ||
Variable name | Current Value | Previous Value |
---|---|---|
text_value_a | Issue resolved | Issue unresolved |
text_value_a | Issue resolved | Issue unresolved |
text_value_b | Joint pain, stiffness and swelling | Joint pain |
updated_date | 2021-10-03 | 2021-09-29 |
updated_date | 2021-11-27 | 2021-10-03 |
updated_date | 2021-10-20 | 2021-09-02 |
updated_date | 2021-10-13 | 2021-10-03 |
updated_date | 2021-10-14 | 2021-09-20 |
entered_date | 2021-11-30 | 2021-09-29 |
entered_date | 2021-11-30 | 2021-10-29 |
entered_date | 2021-11-21 | 2021-08-18 |
entered_date | 2021-11-11 | 2021-10-03 |
entered_date | 2021-11-16 | 2021-10-20 |