Introducing Rule-based Annotations - GBIF Data Blog

NOTE: This is experimental, and the implementation may change. Please see the GBIF technical documentation for the latest updates on the tool. GBIF makes no guarantees about the availability or stability of this tool.


This is a companion discussion topic for the original entry at https://data-blog.gbif.org/post/2026-01-21-rule-based-annotations
1 Like

Rule-based annotations is an experimental tool that will allow users to mark certain occurrence data as suspicious. The main goal of the project is to facilitate data cleaning and user feedback to publishers.

Nicely built and implemented, but the two goals are mistaken.

First, the “cleaning” shown in the blog post is not “cleaning”, it’s filtering (Filtering isn’t cleaning). The clean_download function

Returns cleaned data with suspicious records removed or flagged

The removal of suspicious records does nothing to clean those records, and the flagging of suspicious records does not guarantee that the user of the results will do anything about the records.

The second mistaken goal is providing another means for end-users to contact data publishers in hopes that the publishers will investigate and correct, if required, any suspicious records.

I wrote “another” because GBIF already flags issues, and has been doing so for many years. The great majority of data publishers take no action whatsoever on GBIF-flagged issues. This can easily be demonstrated by tracking individual issues through successive versions of datasets shared with GBIF.

GBIF annotations are primarily for data consumers, not publishers, and the primary use of the annotations is for data filtering, not data cleaning. I suggest editing the post’s summary to

Rule-based annotations is an experimental tool that will allow users to mark certain occurrence data as suspicious. The main goal of the project is to expand data filtering capabilities.

Great! I can already think of several useful cases for drawing inverted polygons.
By the way, as mentioned in the pots, a species may occur outside of its expected range because it is housed in a botanical garden or a zoo (but it still occurs there). Will occurrences based on specimens from living collection still be marked as suspicious? Perhaps, such occurrences could be excluded from the Rule-based annotation? (relying on users knowledge how to filter on dwc:basisOfRecord).

I agree that sometimes rules will overlap with other filters. What I was aiming for was something that would be broadly useful.

I assume if you are a user that is truly interested in fossil records or zoo records or anything unfiltered, like a living specimen, that you shouldn’t really be filtering your records based on user-created rules.

Currently there are ways to create a complex rule that would not flag a certain basisOfRecord value, but it is unlikely that users will always make good use of this feature. Similarly it is unlikely that all publishers will fill out the basisOfRecord field “correctly”, marking fossils, living specimens, ect correctly. There are a lot of zoo records marked as preserved specimens.

Currently rules don’t do anything on GBIF.org, so it is very much opt-in.

I don’t think there is a technical agreed definition of “data cleaning”. And cleaning is used quite a lot by the community. I hope people will be able to understand the intended purpose.

@jwaller , I am very surprised by your comment, and wonder what you mean by “technical”. Just 2 examples:

Data cleaning, also called data cleansing or data scrubbing, is the process of identifying and correcting errors and inconsistencies in raw data sets to improve data quality.

from IBM advice on data cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

from Tableau on data cleaning

Fixing and correcting are key elements of data cleaning in data science. Filtering out or discarding data and calling that “cleaning” seems to be a usage restricted to some, but not all, members of the biodiversity data community.

1 Like