Filtering isn't cleaning

@sharif.islam, recording individual annotations (and corrections) in a structured way is of course a good idea, but while you are thinking about it, please consider

  • suggested corrections that apply to multiple records (> 1000s)
  • suggested corrections that involve multiple fields at the same time
  • suggested corrections that involve multiple records at the same time (“These records contradict each other”)
  • suggested corrections that identify strict duplicate records
  • suggestions without explicit corrections (“This can’t be right, please check”; “There is a missing-but-expected value here”; “One or more of these values is incorrect, please check”; see here)

Atomic annotations for individual values in individual fields are suited to TDWG-type “assertions” but not for the real-world checking that I and other data auditors do, and presumably will also do when DiSSCo and participants begin their own checking.

An alternative type of record-keeping is to have an original dataset version, a corrected or queried dataset version and a diff that applies to the whole dataset. Each diff could be given a PID, and the new versions would be generated by both the data publisher (DiSSCo participant or DiSSCo staff) and data checkers.

Yes, this is information-dense and requires substantial storage, but it is less complicated than tracking individual annotations/corrections.

3 Likes