Determining if Occurrences Have Been Deleted

I’m reviving an old thread of @MatDillen’s that went unanswered:

My situation is the same as their post—I have a fairly large, periodic query that I’m running on the download request API. This sync is working well, and I’m able to filter to records which were modified after my last sync to prevent from running a huge query every time. However, the case of deleted records is worrying me.

It’s my impression that deleted records do not continue to exist with a “deleted” attribute of some sort, although I would love to be wrong! Is there any way to see, from a download, that a record has disappeared? If not, does anyone have any ideas for how to avoid downloading every record in a large query, every single time, just to check for deleted records?

Thanks in advance!

The only method I found was to set up predicate queries for all the records that were present in the previous query, but not anymore in the most recent one. Then you’ll get a result like https://doi.org/10.15468/dl.n6rrdm where you’ll only get 44.765 out of the queried batch of 100.000. The missing records from this download will have effectively been deleted from the index and will only show a tombstone record on their /occurrence/[GBIFID] endpoint. The 44.765 found can be presumed to having (possibly temporarily) no longer corresponded to your recurring query’s conditions.

However, it is also possible for records to go missing intermittently, reappearing later with a new gbifID. When I ran my process over a year ago, I ran into the example of https://www.gbif.org/occurrence/5004364847 and https://www.gbif.org/occurrence/4910179219 . The latter is tombstoned and thus was no longer present in my most recent download. But a new record with the same data (catalog number, occurrence id, cetaf id…) had appeared in the most recent download.

My theory at the time was that this record had been omitted in a published version at the (Biocase) source, and then re-added in a later one. I think this may/will cause the process to preserve the link between source IDs and the int gbifIDs to break.

It is not trivial to identify and troubleshoot these kind of glitches, because, as far as I know, GBIF does not preserve the gbif-processed version of a deleted record - only the raw source data in the tombstone page. So you’ll have to do some mapping and converting yourself to enable pairwise comparison or clustering to flag these kind of “duplicates”. And, of course, because source record identifiers are diverse in protocol and not always stable themselves.

2 Likes

Deleted records aren’t indexed so they can’t be searched. @MatDillen solution is probably the easiest for you.

As Mathias mentioned, some records disappear but may reappear later. Some of this phenomenon is due to changes in occurrenceID values by data providers. If you want to learn more on the topic, you can read this blogpost or watch this video.

Would comparing parquet dumps be an option?