Skip to contents

Motivation

The goal of the dfdiffs is to answer the following questions:

  1. What rows are here now that weren’t here before?
  2. What rows were here before that aren’t here now?
    3. What values have been changed?

This vignette takes us through two functions:

  1. create_changed_data(): which uses the arsenal::comparedf() function

  2. create_modified_data(): which uses the diffdf::diffdf() function

Both functions answers the “What values have been changed?

Changed data

We have two datasets to test what’s been modified:

Initial Data

InitialData <- dfdiffs::InitialData
InitialData
subject_id record text_value_a text_value_b created_date updated_date entered_date
A 1 Issue unresolved Fatigue 2021-07-29 2021-09-29 2021-09-29
A 2 Issue unresolved Fatigue 2021-07-29 2021-10-03 2021-10-29
B 3 Issue resolved Fever 2021-07-16 2021-09-02 2021-08-18
C 4 Issue resolved Joint pain 2021-08-24 2021-10-03 2021-10-03
C 5 Issue resolved Joint pain 2021-08-24 2021-09-20 2021-10-20

Changed Data

ChangedData <- dfdiffs::ChangedData
ChangedData
subject_id record text_value_a text_value_b created_date updated_date entered_date
A 1 Issue resolved Fatigue 2021-07-29 2021-10-03 2021-11-30
A 2 Issue resolved Fatigue 2021-07-29 2021-11-27 2021-11-30
B 3 Issue resolved Fever 2021-07-16 2021-10-20 2021-11-21
C 4 Issue resolved Joint pain, stiffness and swelling 2021-08-24 2021-10-13 2021-11-11
C 5 Issue resolved Joint pain 2021-08-24 2021-10-14 2021-11-16

Creating join columns

We will use our create_new_column() function to create join_var from subject_id and record

ChangedDataJoin <- dfdiffs::create_new_column(
  data = ChangedData, 
  cols = c("subject_id", "record"), 
  new_name = "join_var")
ChangedDataJoin
ChangedDataJoin
join_var subject_id record text_value_a text_value_b created_date updated_date entered_date
A-1 A 1 Issue resolved Fatigue 2021-07-29 2021-10-03 2021-11-30
A-2 A 2 Issue resolved Fatigue 2021-07-29 2021-11-27 2021-11-30
B-3 B 3 Issue resolved Fever 2021-07-16 2021-10-20 2021-11-21
C-4 C 4 Issue resolved Joint pain, stiffness and swelling 2021-08-24 2021-10-13 2021-11-11
C-5 C 5 Issue resolved Joint pain 2021-08-24 2021-10-14 2021-11-16
InitialDataJoin <- dfdiffs::create_new_column(
  data = InitialData, 
  cols = c("subject_id", "record"), 
  new_name = "join_var")
InitialDataJoin
InitialDataJoin
join_var subject_id record text_value_a text_value_b created_date updated_date entered_date
A-1 A 1 Issue unresolved Fatigue 2021-07-29 2021-09-29 2021-09-29
A-2 A 2 Issue unresolved Fatigue 2021-07-29 2021-10-03 2021-10-29
B-3 B 3 Issue resolved Fever 2021-07-16 2021-09-02 2021-08-18
C-4 C 4 Issue resolved Joint pain 2021-08-24 2021-10-03 2021-10-03
C-5 C 5 Issue resolved Joint pain 2021-08-24 2021-09-20 2021-10-20

create_changed_data()

Each comparison function in the dfdiffs package assumes base and compare datasets might have one of the following conditions:

  1. Two datasets

  2. Multiple columns to compare (cols)

  3. Single by column

  4. Single by column, new column name (by_col)

  5. Single by column, multiple compare columns (cols)

  6. Single by column, new column name (by_col), multiple compare columns (cols)

  7. Multiple by columns

  8. Multiple by columns, new column name (by_col)

  9. Multiple by columns, multiple compare columns (cols)

  10. Multiple by columns, a new by_col, and cols

Single by column conditions

1) Two datasets

compare all columns:

create_changed_data(
  compare = ChangedData, 
  base = InitialData)
variable no_of_differences
text_value_a 2
text_value_b 1
updated_date 5
entered_date 5
variable rownumber base compare
text_value_a 1 Issue unresolved Issue resolved
text_value_a 2 Issue unresolved Issue resolved
text_value_b 4 Joint pain Joint pain, stiffness and swelling
updated_date 1 2021-09-29 2021-10-03
updated_date 2 2021-10-03 2021-11-27
updated_date 3 2021-09-02 2021-10-20
updated_date 4 2021-10-03 2021-10-13
updated_date 5 2021-09-20 2021-10-14
entered_date 1 2021-09-29 2021-11-30
entered_date 2 2021-10-29 2021-11-30
entered_date 3 2021-08-18 2021-11-21
entered_date 4 2021-10-03 2021-11-11
entered_date 5 2021-10-20 2021-11-16

2) Multiple columns to compare (cols)

create_changed_data(
  compare = ChangedData, 
  base = InitialData, 
  cols = c("text_value_a", "text_value_b"))
variable no_of_differences
text_value_a 2
text_value_b 1
variable rownumber base compare
text_value_a 1 Issue unresolved Issue resolved
text_value_a 2 Issue unresolved Issue resolved
text_value_b 4 Joint pain Joint pain, stiffness and swelling

3) Single by column

No new column name

create_changed_data(
  compare = ChangedDataJoin, 
  base = InitialDataJoin, 
  by = "join_var")
variable no_of_differences
text_value_a 2
text_value_b 1
updated_date 5
entered_date 5
variable join_var base compare
text_value_a A-1 Issue unresolved Issue resolved
text_value_a A-2 Issue unresolved Issue resolved
text_value_b C-4 Joint pain Joint pain, stiffness and swelling
updated_date A-1 2021-09-29 2021-10-03
updated_date A-2 2021-10-03 2021-11-27
updated_date B-3 2021-09-02 2021-10-20
updated_date C-4 2021-10-03 2021-10-13
updated_date C-5 2021-09-20 2021-10-14
entered_date A-1 2021-09-29 2021-11-30
entered_date A-2 2021-10-29 2021-11-30
entered_date B-3 2021-08-18 2021-11-21
entered_date C-4 2021-10-03 2021-11-11
entered_date C-5 2021-10-20 2021-11-16

4) Single by column, new column name (by_col)

create_changed_data(
  compare = ChangedDataJoin, 
  base = InitialDataJoin, 
  by = "join_var",
  by_col = "join")
variable no_of_differences
text_value_a 2
text_value_b 1
updated_date 5
entered_date 5
variable join base compare
text_value_a A-1 Issue unresolved Issue resolved
text_value_a A-2 Issue unresolved Issue resolved
text_value_b C-4 Joint pain Joint pain, stiffness and swelling
updated_date A-1 2021-09-29 2021-10-03
updated_date A-2 2021-10-03 2021-11-27
updated_date B-3 2021-09-02 2021-10-20
updated_date C-4 2021-10-03 2021-10-13
updated_date C-5 2021-09-20 2021-10-14
entered_date A-1 2021-09-29 2021-11-30
entered_date A-2 2021-10-29 2021-11-30
entered_date B-3 2021-08-18 2021-11-21
entered_date C-4 2021-10-03 2021-11-11
entered_date C-5 2021-10-20 2021-11-16

5) Single by column, multiple compare columns (cols)

create_changed_data(
  compare = ChangedDataJoin, 
  base = InitialDataJoin, 
  by = "join_var", 
  cols = c("text_value_a", "text_value_b"))
variable no_of_differences
text_value_a 2
text_value_b 1
variable join_var base compare
text_value_a A-1 Issue unresolved Issue resolved
text_value_a A-2 Issue unresolved Issue resolved
text_value_b C-4 Joint pain Joint pain, stiffness and swelling

6) Single by column, new column name (by_col), multiple compare columns (cols)

create_changed_data(
  compare = ChangedDataJoin, 
  base = InitialDataJoin, 
  by = "join_var", 
  by_col = "join",
  cols = c("text_value_a", "text_value_b"))
variable no_of_differences
text_value_a 2
text_value_b 1
variable join base compare
text_value_a A-1 Issue unresolved Issue resolved
text_value_a A-2 Issue unresolved Issue resolved
text_value_b C-4 Joint pain Joint pain, stiffness and swelling

Multiple by column conditions

7) Multiple by columns

create_changed_data(
  compare = ChangedData, 
  base = InitialData, 
  by = c("subject_id", "record"))
variable no_of_differences
text_value_a 2
text_value_b 1
updated_date 5
entered_date 5
variable join base compare
text_value_a A-1 Issue unresolved Issue resolved
text_value_a A-2 Issue unresolved Issue resolved
text_value_b C-4 Joint pain Joint pain, stiffness and swelling
updated_date A-1 2021-09-29 2021-10-03
updated_date A-2 2021-10-03 2021-11-27
updated_date B-3 2021-09-02 2021-10-20
updated_date C-4 2021-10-03 2021-10-13
updated_date C-5 2021-09-20 2021-10-14
entered_date A-1 2021-09-29 2021-11-30
entered_date A-2 2021-10-29 2021-11-30
entered_date B-3 2021-08-18 2021-11-21
entered_date C-4 2021-10-03 2021-11-11
entered_date C-5 2021-10-20 2021-11-16

8) Multiple by columns, new column name (by_col)

create_changed_data(
  compare = ChangedData, 
  base = InitialData, 
  by = c("subject_id", "record"), 
  by_col = "new_join_var")
variable no_of_differences
text_value_a 2
text_value_b 1
updated_date 5
entered_date 5
variable new_join_var base compare
text_value_a A-1 Issue unresolved Issue resolved
text_value_a A-2 Issue unresolved Issue resolved
text_value_b C-4 Joint pain Joint pain, stiffness and swelling
updated_date A-1 2021-09-29 2021-10-03
updated_date A-2 2021-10-03 2021-11-27
updated_date B-3 2021-09-02 2021-10-20
updated_date C-4 2021-10-03 2021-10-13
updated_date C-5 2021-09-20 2021-10-14
entered_date A-1 2021-09-29 2021-11-30
entered_date A-2 2021-10-29 2021-11-30
entered_date B-3 2021-08-18 2021-11-21
entered_date C-4 2021-10-03 2021-11-11
entered_date C-5 2021-10-20 2021-11-16

9) Multiple by columns, multiple compare columns (cols)

create_changed_data(
  compare = ChangedData, 
  base = InitialData, 
  by = c("subject_id", "record"), 
  cols = c("text_value_a", "text_value_b"))
variable no_of_differences
text_value_a 2
text_value_b 1
variable join base compare
text_value_a A-1 Issue unresolved Issue resolved
text_value_a A-2 Issue unresolved Issue resolved
text_value_b C-4 Joint pain Joint pain, stiffness and swelling

10) Multiple by columns, a new by_col, and cols

create_changed_data(
  compare = ChangedData, 
  base = InitialData, 
  by = c("subject_id", "record"), 
  by_col = "join",
  cols = c("text_value_a", "text_value_b"))
variable no_of_differences
text_value_a 2
text_value_b 1
variable join base compare
text_value_a A-1 Issue unresolved Issue resolved
text_value_a A-2 Issue unresolved Issue resolved
text_value_b C-4 Joint pain Joint pain, stiffness and swelling

create_modified_data()

Below we have the create_modified_data() function, which takes two data frames (compare and base), a by column, and the cols to compare.

create_modified_data(compare, base, by = NULL, by_col = NULL, cols = NULL)

We’ll test this function below on all ten possible conditions for base and compare.

Single joining (by) column conditions

1) Two datasets

  • No by columns (only two datasets)
compare_list <- create_modified_data(
  compare = ChangedData, 
  base = InitialData)
Variable name Modified Values Missing Values
subject_id 0 0
record 0 0
text_value_a 2 0
text_value_b 1 0
created_date 0 0
updated_date 5 0
entered_date 5 0
Variable name Current Value Previous Value
text_value_a Issue resolved Issue unresolved
text_value_a Issue resolved Issue unresolved
text_value_b Joint pain, stiffness and swelling Joint pain
updated_date 2021-10-03 2021-09-29
updated_date 2021-11-27 2021-10-03
updated_date 2021-10-20 2021-09-02
updated_date 2021-10-13 2021-10-03
updated_date 2021-10-14 2021-09-20
entered_date 2021-11-30 2021-09-29
entered_date 2021-11-30 2021-10-29
entered_date 2021-11-21 2021-08-18
entered_date 2021-11-11 2021-10-03
entered_date 2021-11-16 2021-10-20

2) Multiple columns to compare (cols)

  • No by columns (only two datasets) and multiple compare (cols)
compare_list <- create_modified_data(
  compare = ChangedData, 
  base = InitialData, 
  cols = c("text_value_a", "text_value_b"))
Variable name Modified Values Missing Values
text_value_a 2 0
text_value_b 1 0
Variable name Current Value Previous Value
text_value_a Issue resolved Issue unresolved
text_value_a Issue resolved Issue unresolved
text_value_b Joint pain, stiffness and swelling Joint pain

3) Single by column

  • We can provide a single by column (using our InitialDataJoin and ChangedDataJoin) datasets we created above.
compare_list <- create_modified_data(
  compare = ChangedDataJoin, 
  base = InitialDataJoin, 
  by = "join_var")
Variable name Modified Values Missing Values
subject_id 0 0
record 0 0
text_value_a 2 0
text_value_b 1 0
created_date 0 0
updated_date 5 0
entered_date 5 0
Variable name join_var Current Value Previous Value
text_value_a A-1 Issue resolved Issue unresolved
text_value_a A-2 Issue resolved Issue unresolved
text_value_b C-4 Joint pain, stiffness and swelling Joint pain
updated_date A-1 2021-10-03 2021-09-29
updated_date A-2 2021-11-27 2021-10-03
updated_date B-3 2021-10-20 2021-09-02
updated_date C-4 2021-10-13 2021-10-03
updated_date C-5 2021-10-14 2021-09-20
entered_date A-1 2021-11-30 2021-09-29
entered_date A-2 2021-11-30 2021-10-29
entered_date B-3 2021-11-21 2021-08-18
entered_date C-4 2021-11-11 2021-10-03
entered_date C-5 2021-11-16 2021-10-20

4) Single by column, new column name (by_col)

  • We can also provide a single by column (for unique identifiers) and a new name for the by_col
compare_list <- create_modified_data(
  compare = ChangedDataJoin, 
  base = InitialDataJoin, 
  by = "join_var",
  by_col = "new_join_var")
Variable name Modified Values Missing Values
subject_id 0 0
record 0 0
text_value_a 2 0
text_value_b 1 0
created_date 0 0
updated_date 5 0
entered_date 5 0
Variable name new_join_var Current Value Previous Value
text_value_a A-1 Issue resolved Issue unresolved
text_value_a A-2 Issue resolved Issue unresolved
text_value_b C-4 Joint pain, stiffness and swelling Joint pain
updated_date A-1 2021-10-03 2021-09-29
updated_date A-2 2021-11-27 2021-10-03
updated_date B-3 2021-10-20 2021-09-02
updated_date C-4 2021-10-13 2021-10-03
updated_date C-5 2021-10-14 2021-09-20
entered_date A-1 2021-11-30 2021-09-29
entered_date A-2 2021-11-30 2021-10-29
entered_date B-3 2021-11-21 2021-08-18
entered_date C-4 2021-11-11 2021-10-03
entered_date C-5 2021-11-16 2021-10-20

5) Single by column, multiple compare columns cols

  • Single by column and multiple compare columns (cols)
compare_list <- create_modified_data(
  compare = ChangedDataJoin, 
  base = InitialDataJoin, 
  by = "join_var", 
  cols = c("text_value_a", "text_value_b"))
Variable name Modified Values Missing Values
text_value_a 2 0
text_value_b 1 0
Variable name join_var Current Value Previous Value
text_value_a A-1 Issue resolved Issue unresolved
text_value_a A-2 Issue resolved Issue unresolved
text_value_b C-4 Joint pain, stiffness and swelling Joint pain

6) Single by column, new column name (by_col), multiple compare columns (cols)

  • Single by column, a new name for the by column (by_col), and multiple compare columns (cols)
compare_list <- create_modified_data(
  compare = ChangedDataJoin, 
  base = InitialDataJoin, 
  by = "join_var", 
  by_col = "new_join_var",
  cols = c("text_value_a", "text_value_b"))
Variable name Modified Values Missing Values
text_value_a 2 0
text_value_b 1 0
Variable name new_join_var Current Value Previous Value
text_value_a A-1 Issue resolved Issue unresolved
text_value_a A-2 Issue resolved Issue unresolved
text_value_b C-4 Joint pain, stiffness and swelling Joint pain

Multiple by column conditions

The next conditions cover when two existing columns are used to create a unique identifier between the two datasets.

7) Multiple by columns

  • Multiple by columns (assuming the columns create a unique identifier).
compare_list <- create_modified_data(
  compare = ChangedData, 
  base = InitialData, 
  by = c("subject_id", "record"))
Variable name Modified Values Missing Values
subject_id 0 0
record 0 0
text_value_a 2 0
text_value_b 1 0
created_date 0 0
updated_date 5 0
entered_date 5 0
Variable name join Current Value Previous Value
text_value_a A-1 Issue resolved Issue unresolved
text_value_a A-2 Issue resolved Issue unresolved
text_value_b C-4 Joint pain, stiffness and swelling Joint pain
updated_date A-1 2021-10-03 2021-09-29
updated_date A-2 2021-11-27 2021-10-03
updated_date B-3 2021-10-20 2021-09-02
updated_date C-4 2021-10-13 2021-10-03
updated_date C-5 2021-10-14 2021-09-20
entered_date A-1 2021-11-30 2021-09-29
entered_date A-2 2021-11-30 2021-10-29
entered_date B-3 2021-11-21 2021-08-18
entered_date C-4 2021-11-11 2021-10-03
entered_date C-5 2021-11-16 2021-10-20

This creates a join column from the by columns.

8) Multiple by columns, new column name (by_col)

We can provide multiple by columns, a new by_col, and no cols

compare_list <- create_modified_data(
  compare = ChangedData, 
  base = InitialData, 
  by = c("subject_id", "record"), 
  by_col = "new_join_var")
Variable name Modified Values Missing Values
subject_id 0 0
record 0 0
text_value_a 2 0
text_value_b 1 0
created_date 0 0
updated_date 5 0
entered_date 5 0
Variable name new_join_var Current Value Previous Value
text_value_a A-1 Issue resolved Issue unresolved
text_value_a A-2 Issue resolved Issue unresolved
text_value_b C-4 Joint pain, stiffness and swelling Joint pain
updated_date A-1 2021-10-03 2021-09-29
updated_date A-2 2021-11-27 2021-10-03
updated_date B-3 2021-10-20 2021-09-02
updated_date C-4 2021-10-13 2021-10-03
updated_date C-5 2021-10-14 2021-09-20
entered_date A-1 2021-11-30 2021-09-29
entered_date A-2 2021-11-30 2021-10-29
entered_date B-3 2021-11-21 2021-08-18
entered_date C-4 2021-11-11 2021-10-03
entered_date C-5 2021-11-16 2021-10-20

9) Multiple by columns, multiple compare columns (cols)

  • Multiple by columns and multiple compare columns (cols), and no new by_col.
compare_list <- create_modified_data(
  compare = ChangedData, 
  base = InitialData, 
  by = c("subject_id", "record"), 
  cols = c("text_value_a", "text_value_b"))
Variable name Modified Values Missing Values
text_value_a 2 0
text_value_b 1 0
Variable name join Current Value Previous Value
text_value_a A-1 Issue resolved Issue unresolved
text_value_a A-2 Issue resolved Issue unresolved
text_value_b C-4 Joint pain, stiffness and swelling Joint pain

This creates a join column from the by columns.

10) Multiple by columns, a new by_col, and cols

  • We can provide multiple by columns, a new by_col, and multiple cols.
compare_list <- create_modified_data(
  compare = ChangedData, 
  base = InitialData, 
  by = c("subject_id", "record"), 
  by_col = "new_join_var",
  cols = c("text_value_a", "text_value_b"))
Variable name Modified Values Missing Values
text_value_a 2 0
text_value_b 1 0
Variable name new_join_var Current Value Previous Value
text_value_a A-1 Issue resolved Issue unresolved
text_value_a A-2 Issue resolved Issue unresolved
text_value_b C-4 Joint pain, stiffness and swelling Joint pain