Filtering isn't cleaning

datafixer · August 24, 2023, 8:33pm

@sharif.islam, of course there is a balance between what is known now and what might be known in future. But that should not be a reason to filter rather than to clean. The argument “I can’t clean all the records, so I can’t clean any of them” is not only logically indefensible, it’s bad science.

“What do you do when eventDate is not in place? if eventDate is missing, can I use automated algorithms that attempt to infer it from other available data within the record? If some cases yes, most cases, no.”

I would disagree with both your “most cases, no” and your omission of human effort (“can I use human effort or automated algorithms…”), and especially the latter. I’ve discussed that issue here: Cleaning GBIF-mediated data: "manual" vs "automated" is a false dichotomy and would be happy to talk over the details with you privately by email.

What you might like to discuss further in this thread is this question: Is it better to publish/cite/link low-quality data as soon as possible, or to publish/cite/link data when that data reaches best-possible quality after careful investigation? Filtering suits the first, cleaning the second.

It’s always easier, quicker and cheaper to clean data at source rather than downstream. Nevertheless, according to some of the collection managers I’ve corresponded with, the digitisation and mobilisation of collection records is and has been promoted (and funded) largely to “get the data out there ASAP”, with cleaning to be done “later” (no date specified), by persons and with funding sources unknown.

The result is that the ocean of collections-based biodiversity data is grossly polluted. Cleaning is left to users, who cannot easily return cleaned data to that ocean. I’m not surprised that filtering has become popular (cf ALA, rgbif), but it needs to be seen as a last resort when cleaning fails, not as a first choice for data use.

Topic		Replies	Views
Darwin Core Half-Million - UPDATE Data Publishing	11	1085	December 8, 2022
GBIF's data quality workflow (GBIF technical support hour for nodes) Data Publishing NodesSupportHour	5	537	March 15, 2024
GBIF Issues & Flags - GBIF Data Blog data-blog	15	7017	May 22, 2024
The strange case(s) of the missing identity	23	143	September 8, 2024
Annotating specimens and other data Digital/Extended Specimen	82	4174	April 8, 2021

Filtering isn't cleaning

Related topics