Filtering isn't cleaning

@sharif.islam, of course there is a balance between what is known now and what might be known in future. But that should not be a reason to filter rather than to clean. The argument “I can’t clean all the records, so I can’t clean any of them” is not only logically indefensible, it’s bad science.

“What do you do when eventDate is not in place? if eventDate is missing, can I use automated algorithms that attempt to infer it from other available data within the record? If some cases yes, most cases, no.”

I would disagree with both your “most cases, no” and your omission of human effort (“can I use human effort or automated algorithms…”), and especially the latter. I’ve discussed that issue here: Cleaning GBIF-mediated data: "manual" vs "automated" is a false dichotomy and would be happy to talk over the details with you privately by email.

What you might like to discuss further in this thread is this question: Is it better to publish/cite/link low-quality data as soon as possible, or to publish/cite/link data when that data reaches best-possible quality after careful investigation? Filtering suits the first, cleaning the second.

It’s always easier, quicker and cheaper to clean data at source rather than downstream. Nevertheless, according to some of the collection managers I’ve corresponded with, the digitisation and mobilisation of collection records is and has been promoted (and funded) largely to “get the data out there ASAP”, with cleaning to be done “later” (no date specified), by persons and with funding sources unknown.

The result is that the ocean of collections-based biodiversity data is grossly polluted. Cleaning is left to users, who cannot easily return cleaned data to that ocean. I’m not surprised that filtering has become popular (cf ALA, rgbif), but it needs to be seen as a last resort when cleaning fails, not as a first choice for data use.

1 Like