Skip to contents

Motivation

The goal of the dfdiffs is to answer the following questions:

  1. What rows are here now that weren’t here before?
  2. What rows were here before that aren’t here now?
  3. What values have been changed?

This vignette takes us through the create_deleted_data() function, which answers the “What rows were here before that aren’t here now?

What rows were here before that aren’t here now?

We will need three datasets to test for deleted data: CompleteData, IncompleteData, and DeletedData

CompleteData

The CompleteData has 9 rows and 7 column. Unique rows are identified by a combination of subject and record:

CompleteData <- dfdiffs::CompleteData
flextable::qflextable(CompleteData)

IncompleteData

IncompeleteData has 5 rows (4 have been removed)

IncompleteData <- dfdiffs::IncompleteData
flextable::qflextable(IncompleteData) |> 
  flextable::set_table_properties(layout = "autofit")

DeletedData

DeletedData contains the 4 rows of data removed from CompleteData to create IncompleteData

DeletedData <- dfdiffs::DeletedData
flextable::qflextable(DeletedData) |>
  flextable::set_table_properties(layout = "autofit")

If we check, the combination of IncompleteData and DeletedData create CompleteData.

dplyr::all_equal(target = bind_rows(IncompleteData, DeletedData), 
                 current = CompleteData)
#> [1] TRUE

Conditions

Each function in the dfdiffs package assumes the following conditions:

  1. Two datasets

  2. Multiple columns to compare (cols)

  3. Single by column

  4. Single by column, new column name (by_col)

  5. Single by column, multiple compare columns (cols)

  6. Single by column, new column name (by_col), multiple compare columns (cols)

  7. Multiple by columns

  8. Multiple by columns, new column name (by_col)

  9. Multiple by columns, multiple compare columns (cols)

  10. Multiple by columns, a new by_col, and cols

Single by column conditions

  1. Two datasets, compare all columns:

    create_deleted_data(
      compare = IncompleteData, 
      base = CompleteData)
  2. Multiple columns to compare (cols):

    create_deleted_data(
      compare = IncompleteDataJoin, 
      base = CompleteDataJoin,
      cols = c("text_var", "factor_var"))
  3. Single by column, no new column name

    create_deleted_data(
      compare = IncompleteDataJoin, 
      base = CompleteDataJoin, 
      by = "join_var")
  4. Single by column, new column name (by_col)

    create_deleted_data(
      compare = IncompleteDataJoin, 
      base = CompleteDataJoin, 
      by = "join_var", 
      by_col = 'new_join_var')
  5. Single by column, multiple compare columns cols

    create_deleted_data(
      compare = IncompleteDataJoin, 
      base = CompleteDataJoin, 
      by = "join_var", 
      cols = c("subject", "record", "factor_var", "text_var"))
  6. Single by column, new column name (by_col), multiple compare columns (cols)

    create_deleted_data(
      # data 
      compare = IncompleteDataJoin, 
      base = CompleteDataJoin, 
      # unique id
      by = "join_var", 
      # new name for id
      by_col = 'new_join_var', 
      # cols to compare
      cols = c("subject", "record", "text_var", "factor_var"))

Multiple by column conditions

  1. Multiple by columns

    create_deleted_data(
      compare = IncompleteData, 
      base = CompleteData, 
      by = c('subject', 'record'))
  2. Multiple by columns, new column name (by_col)

    create_deleted_data(
      compare = IncompleteData, 
      base = CompleteData,
      by = c('subject', 'record'),
      by_col = "new_join_col")
  3. Multiple by columns, multiple compare columns (cols)

    create_deleted_data(
      compare = IncompleteData, 
      base = CompleteData, 
      by = c('subject', 'record'),
      cols = c("subject",  "record", "factor_var", "text_var"))
  4. Multiple by columns, a new by_col, and cols

    create_deleted_data(
      compare = IncompleteData, 
      base = CompleteData, 
      by = c('subject', 'record'),
      by_col = "new_join_col",
      cols = c("subject", "record", "text_var", "factor_var"))

create_new_column()

We have a small helper function to create the join variables, create_new_column():

create_new_column(data = , cols = , new_name = )

We can use create_new_column() with CompleteData and IncompleteData to create a joining variable with subject and record:

CompleteDataJoin <- create_new_column(data = CompleteData, 
  cols = c("subject", "record"), 
  new_name = "join_var")
CompleteDataJoin
IncompleteDataJoin <- create_new_column(data = IncompleteData, 
  cols = c("subject", "record"), 
  new_name = "join_var")
IncompleteDataJoin



create_deleted_data()

Below is our create_deleted_data() function, which returns a tibble of the deleted rows.

create_deleted_data <- function(compare, base, by = NULL, by_col = NULL, cols = NULL)

Single by column conditions

The function should also be able to handle multiple conditions. Below we cover the conditions for a single by columns (assuming there is an existing unique identifier in each dataset). But first, we’ll cover a few uncommon conditions, like a missing by column, or a missing by column and specific columns selected for comparison.

1) Two datasets

  • No by columns (only two datasets)
create_deleted_data(
  compare = IncompleteData, 
  base = CompleteData)

When we compare this DeletedData, we can see this performs a row-by-row comparison.

2) Multiple columns to compare (cols)

  • No by columns (only two datasets) and multiple compare (cols)
create_deleted_data(
  compare = IncompleteDataJoin, 
  base = CompleteDataJoin,
  cols = c("text_var", "factor_var"))

When we compare this DeletedData, we can see the text_var and factor_var are identical.

3) Single by column

  • If the tables have a joining column, like CompleteDataJoin and IncompleteDataJoin, we can supply the (by) joining column
create_deleted_data(
  compare = IncompleteDataJoin, 
  base = CompleteDataJoin, 
  by = "join_var")

When we compare this to DeletedData, we can see the rows are identical.

4) Single by column, new column name (by_col)

  • We can also provide a single by column (for unique identifiers) and a new name for the by_col
create_deleted_data(
  compare = IncompleteDataJoin, 
  base = CompleteDataJoin, 
  by = "join_var", 
  by_col = 'new_join_var')

When we compare this to DeletedData, we can see the rows are identical.

5) Single by column, multiple compare columns cols

  • Single by column and multiple compare columns (cols)
create_deleted_data(
  compare = IncompleteDataJoin, 
  base = CompleteDataJoin, 
  by = "join_var", 
  cols = c("subject", "record", "factor_var", "text_var"))

When we compare this to DeletedData, we can see the rows are identical.

6) Single by column, new column name (by_col), multiple compare columns (cols)

  • Single by column, a new name for the by column (by_col), and multiple compare columns (cols)
create_deleted_data(
  compare = IncompleteDataJoin, 
  base = CompleteDataJoin, 
  by = "join_var", 
  by_col = 'new_join_var', 
  cols = c("subject", "record", "text_var", "factor_var"))

When we compare this to DeletedData, we can see the rows are identical.

Multiple by column conditions

Now we’re going to test conditions in which there are multiple columns used to create a unique identifier.

7) Multiple by columns

  • Multiple by columns (assuming the columns create a unique identifier)
create_deleted_data(
  compare = IncompleteData, 
  base = CompleteData, 
  by = c('subject', 'record'))

This creates a new join column and it’s a combination of subject and record, and when we compare this to DeletedData, we can see the rows are identical.

8) Multiple by columns, new column name (by_col)

We can provide multiple by columns, a new by_col, and no cols

create_deleted_data(
  compare = IncompleteData, 
  base = CompleteData,
  by = c('subject', 'record'),
  by_col = "new_join_col")

This creates a new new_join_col column and it’s a combination of subject and record, and when we compare this to DeletedData, we can see the rows are identical.

9) Multiple by columns, multiple compare columns (cols)

  • Multiple by columns and multiple compare columns (cols), and no new by_col.
create_deleted_data(
  compare = IncompleteData, 
  base = CompleteData, 
  by = c('subject', 'record'),
  cols = c("subject",  "record", "factor_var", "text_var"))

This creates a new join column, and it’s a combination of subject and record, and when we compare this to DeletedData, we can see the rows are identical.

10) Multiple by columns, a new by_col, and cols

We can provide multiple by columns, new by_col, and multiple cols

create_deleted_data(
  compare = IncompleteData, 
  base = CompleteData, 
  by = c('subject', 'record'),
  by_col = "new_join_col",
  cols = c("subject", "record", "text_var", "factor_var"))

This creates a new join column, and it’s a combination of subject and record, and when we compare this to DeletedData, we can see the rows are identical.