Introducing Rule-based Annotations - GBIF Data Blog

Rule-based annotations is an experimental tool that will allow users to mark certain occurrence data as suspicious. The main goal of the project is to facilitate data cleaning and user feedback to publishers.

Nicely built and implemented, but the two goals are mistaken.

First, the “cleaning” shown in the blog post is not “cleaning”, it’s filtering (Filtering isn’t cleaning). The clean_download function

Returns cleaned data with suspicious records removed or flagged

The removal of suspicious records does nothing to clean those records, and the flagging of suspicious records does not guarantee that the user of the results will do anything about the records.

The second mistaken goal is providing another means for end-users to contact data publishers in hopes that the publishers will investigate and correct, if required, any suspicious records.

I wrote “another” because GBIF already flags issues, and has been doing so for many years. The great majority of data publishers take no action whatsoever on GBIF-flagged issues. This can easily be demonstrated by tracking individual issues through successive versions of datasets shared with GBIF.

GBIF annotations are primarily for data consumers, not publishers, and the primary use of the annotations is for data filtering, not data cleaning. I suggest editing the post’s summary to

Rule-based annotations is an experimental tool that will allow users to mark certain occurrence data as suspicious. The main goal of the project is to expand data filtering capabilities.