Filtering isn't cleaning

datafixer · September 1, 2023, 8:43pm

@sharif.islam, recording individual annotations (and corrections) in a structured way is of course a good idea, but while you are thinking about it, please consider

suggested corrections that apply to multiple records (> 1000s)
suggested corrections that involve multiple fields at the same time
suggested corrections that involve multiple records at the same time (“These records contradict each other”)
suggested corrections that identify strict duplicate records
suggestions without explicit corrections (“This can’t be right, please check”; “There is a missing-but-expected value here”; “One or more of these values is incorrect, please check”; see here)

Atomic annotations for individual values in individual fields are suited to TDWG-type “assertions” but not for the real-world checking that I and other data auditors do, and presumably will also do when DiSSCo and participants begin their own checking.

An alternative type of record-keeping is to have an original dataset version, a corrected or queried dataset version and a diff that applies to the whole dataset. Each diff could be given a PID, and the new versions would be generated by both the data publisher (DiSSCo participant or DiSSCo staff) and data checkers.

Yes, this is information-dense and requires substantial storage, but it is less complicated than tracking individual annotations/corrections.

Topic		Replies	Views
Darwin Core Half-Million - UPDATE Data Publishing	11	1095	December 8, 2022
GBIF's data quality workflow (GBIF technical support hour for nodes) Data Publishing NodesSupportHour	5	538	March 15, 2024
GBIF Issues & Flags - GBIF Data Blog data-blog	15	7019	May 22, 2024
The strange case(s) of the missing identity	23	152	September 8, 2024
Annotating specimens and other data Digital/Extended Specimen	82	4203	April 8, 2021

Filtering isn't cleaning

Related topics