OccurrenceID stability (GBIF technical support hour for Nodes)

In the October issue of the support hour, the Data Products team will explain what the current routine is to improve the stability of occurrence URLs (IDs). We will go over what our system checks and what can be done when changes are detected. See this news item for some background. We will also answer any helpdesk question related or not to the topic of the presentation.

The next session is on October the 4th at 4pm CEST (UTC+2).

1 Like

Where can I find details for that session?

@jegelewicz This is an event for Nodes and Node staff. They were sent the invitation and are free to invite external people as well.
The edited recording as well as a transcription of the Q&A will be made publicly available afterwards on this thread.

The video is available here: GBIF and occurenceID stability on Vimeo

Here is the transcript of the questions during the session. Please note that one question isn’t mentioned here because we are still looking into solving the issue. The post will be updated later.

How did you choose the 2 to 3 months time limit you mentioned? (the question refers to part of the presentation/video where we say that if a dataset is paused because of an occurrenceID change for more than a few months, we let is through)

We are still experiencing with the time threshold. We picked a time (2 months), we are trying it out. If we get feedback that it is too short and we don’t leave enough time for a response, we will extend it.

If I provide values in both catalogue number field and the occurrenceID field, what happens if I change the values in one of the fields?

The priority goes to the occurrenceID field. So if you have occurrenceIDs, you can change the catalogue numbers and the occurrences will keep the same GBIFID and URLs. If you change the occurrenceIDs, the GBIFIDs and occurrenceURLs will be changed, even if your catalogue numbers remain the same.
The catalogue number will only be taken into account for making GBIFIDs if there is no occurrenceID provided.

Note that if you decide to change catalogue numbers, you can always put the former catalogue number in the otherCatalogueNumbers field. This field is also used in the context of the data Clustering algorithm (see this blogpost for more information).

Our Nodes gets data from National aggregator(s) which don’t keep tracks of the identifiers. Do other Nodes encounter similar issues? How do you handle identifier stability?

Unfortunately, we couldn’t provide an answer for that question. If any Node is reading this, please add your comments.

If the publishers maintain global unique identifiers, then the GIBIFIDs will be stable?

Yes. Of course, the identifiers don’t have to be globally unique, as long as they are unique within the dataset. That being said, we have encountered more issues with datasets using identifiers that encoded information. For example, our system caught many datasets where occurrrenceIDs contained a collection and institution codes, because when the publisher updated the codes, all the occurrenceIDs were changed.

Does GBIF use a specific code for international water?

We use ZZ (unknown) for international water. If the coordinates falls into an EEZ, we recommend using the corresponding country code.

Would you advise to leave the country code field empty when coordinates are provided so GBIF can infer the country?

We run a number of automated checks where we compare the coordinates provided to the country code provided. Providing both values makes error detection a bit easier. We can at least flag the records with mismatched information. While if you provide only coordinates, we will infer and index the country based on those coordinates (we wouldn’t be able to detect possible issues).

How to teach publishers about those occurrenceID changes? Right now, it is a black box for a lot of publishers, should we (Nodes) create and/or translate material and teach the publishers?

Here are the slides used for that presentation. You are very welcome to reuse and translate them. You are also very welcome to share and translate the video. In general, please contact us (helpdesk@gbif.org) if you would like to reuse and translate slides or presentations from previous Technical Support Hour for Nodes.

Is there any blogpost on the topic of occurrenceID stability?

We now have a blogpost published on the topic: GBIF attempts to improve identifier stability by monitoring changes of occurrenceIDs - GBIF Data Blog

On the forum ( https://discourse.gbif.org), inactive threads are closed. Would it be able to increase the threshold for inactivity so relevant topics remain open?

It is possible, but we need to know which threads to keep open. Please notify us at helpdesk@gbif.org.

2 Likes

Thanks for the post and recording; this really helps data providers get a deeper sense for the implications of altering occurrenceID from the source. Because gbifID is also tightly coupled to datasetKey I wonder if this too needs to be stressed. I have seen many instances of deleted datasets, only to be recreated from either a different publisher or with slightly altered branding/title whereas their occurrenceID remain the same. Under these circumstances, the gbifID also changes. I’d also like to see the logic for creating/reviving gbifID spelled out in a white paper with flow logic diagrams. This too will help publishers better understand the implications of changing occurrenceID &/or hosting a dataset elsewhere / deleting it / recreating it.

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.