OccurrenceID stability (GBIF technical support hour for Nodes)

The video is available here: GBIF and occurenceID stability on Vimeo

Here is the transcript of the questions during the session. Please note that one question isn’t mentioned here because we are still looking into solving the issue. The post will be updated later.

How did you choose the 2 to 3 months time limit you mentioned? (the question refers to part of the presentation/video where we say that if a dataset is paused because of an occurrenceID change for more than a few months, we let is through)

We are still experiencing with the time threshold. We picked a time (2 months), we are trying it out. If we get feedback that it is too short and we don’t leave enough time for a response, we will extend it.

If I provide values in both catalogue number field and the occurrenceID field, what happens if I change the values in one of the fields?

The priority goes to the occurrenceID field. So if you have occurrenceIDs, you can change the catalogue numbers and the occurrences will keep the same GBIFID and URLs. If you change the occurrenceIDs, the GBIFIDs and occurrenceURLs will be changed, even if your catalogue numbers remain the same.
The catalogue number will only be taken into account for making GBIFIDs if there is no occurrenceID provided.

Note that if you decide to change catalogue numbers, you can always put the former catalogue number in the otherCatalogueNumbers field. This field is also used in the context of the data Clustering algorithm (see this blogpost for more information).

Our Nodes gets data from National aggregator(s) which don’t keep tracks of the identifiers. Do other Nodes encounter similar issues? How do you handle identifier stability?

Unfortunately, we couldn’t provide an answer for that question. If any Node is reading this, please add your comments.

If the publishers maintain global unique identifiers, then the GIBIFIDs will be stable?

Yes. Of course, the identifiers don’t have to be globally unique, as long as they are unique within the dataset. That being said, we have encountered more issues with datasets using identifiers that encoded information. For example, our system caught many datasets where occurrrenceIDs contained a collection and institution codes, because when the publisher updated the codes, all the occurrenceIDs were changed.

Does GBIF use a specific code for international water?

We use ZZ (unknown) for international water. If the coordinates falls into an EEZ, we recommend using the corresponding country code.

Would you advise to leave the country code field empty when coordinates are provided so GBIF can infer the country?

We run a number of automated checks where we compare the coordinates provided to the country code provided. Providing both values makes error detection a bit easier. We can at least flag the records with mismatched information. While if you provide only coordinates, we will infer and index the country based on those coordinates (we wouldn’t be able to detect possible issues).

How to teach publishers about those occurrenceID changes? Right now, it is a black box for a lot of publishers, should we (Nodes) create and/or translate material and teach the publishers?

Here are the slides used for that presentation. You are very welcome to reuse and translate them. You are also very welcome to share and translate the video. In general, please contact us (helpdesk@gbif.org) if you would like to reuse and translate slides or presentations from previous Technical Support Hour for Nodes.

Is there any blogpost on the topic of occurrenceID stability?

We now have a blogpost published on the topic: GBIF attempts to improve identifier stability by monitoring changes of occurrenceIDs - GBIF Data Blog

On the forum ( https://discourse.gbif.org), inactive threads are closed. Would it be able to increase the threshold for inactivity so relevant topics remain open?

It is possible, but we need to know which threads to keep open. Please notify us at helpdesk@gbif.org.

2 Likes