How to publish data via the GBIF API (GBIF technical support hour for Nodes)

This August session of the Technical Support Hour for Nodes, we will show you how to publish data on GBIF via the registry API. This is particularly relevant for data providers who wish to automate the publishing process. The Data Product team will go through everything you need to know to get started.
You can read this blog post for some background. We will be happy to answer any question relating or not to the topic.

The event will be on the 7th of August 2024 at 4pm CEST (UTC+2). The invitation with registration link will be sent to the GBIF Nodes. If you are interested in attending, you can reach out to your local node.

The edited recording and the transcript of the questions will be made available here.

1 Like

The video recording is available here: How to publish data on GBIF via the registry API on Vimeo

Here is the transcript of the Q&A:

What is the role of the participant nodes in the context of publishing data via the API?

Nodes were mentioned in the presentation for the purpose of mentioning all the types of entities available on GBIF. In the presentation, we focused on publishing on behalf of one (or several organizations). However, it is also possible for nodes to ask permission to publish datasets via the API. In these cases, a node can publish data on behalf of the organization that it endorses. This is for example, the case for some of the Living Atlases publishing on GBIF.

How can nodes ensure the the quality of the data content published via the API?

The endorsement process is the same regardless of the mode of publishing (IPT, API, etc.) Publishers register via the online form and the endorsement request is forwarded to the node manager. A publisher is only given the permission to publish data on GBIF after it has been endorsed. Nodes can have the option of using the API to automate the registration process but most Nodes don’t.

Can you please go through how are fields mapped to various vocabularies?

Right now we are transitioning between using Java enums and using the GBIF Vocabulary server. The interpretation depends also on the fields interpreted and on the vocabulary mapping. Which field did you have in mind?

The question was in the context of the work of the TDWG Biodiversity Data Quality Interest Group , if an occurrence record contains the value m in the sex field, does it get get interpreted as male or mixed ?

In this example, m is interpreted as male . You can see that in the mapping we have here, m is mapped to male and not mixed . Values like m/f are mapped to mixed . Note that we created the concept of mixed because we found so many occurrences where both male and female individuals were mentioned in the sam occurrence (for example a jar in a museum). TDWG didn’t originally recommend mixed value for the sex field. So by default, we assume that data providers mean male when they provide the value m .

Note that we have a session on the topic of vocabularies at the TDWG/SPNHC 2024 conference. See the description here. In addition to that, the October 2024 session of the technical support hour for nodes will be on the topic of vocabularies.

How does the vocabulary concept mapping handles multiple languages?

Vocabularies need to be translated explicitly. For some vocabularies, like the GeoTime vocabulary, we have many languages integrated but it isn’t the case for most of the other vocabularies.

Is there also an API system for publishing metadata to link to the dataset published using API?

Some metadata can be posted directly via the API but most metadata is inside the the DwC archive and the EML file and is ingested along with the data (occurrence, taxon, event, etc.)
In other words, if there is an EML file in your DwC Archive, it will overwrite the metadata are provided via the API.

Note that if you are crafting your own archive, you can go on a test IPT and enter metadata manually in the interface. Then you can download the EML file generated by the IPT and use it as a template for future EML files. It can be really helpful to have a handful of examples generated for crafting EML files.

One of our publishers has migrated to Specify 7, which I understands, can generate Darwin Core Archives. Can publishers use the GBIF API to share data from their Specify installation directly on GBIF?

As far as I know, Specify 7 generates and put online DwC archives but doesn’t use the GBIF API to publish them.If the datasets are already published on GBIF, the endpoint can be updated or replaced so the data come from the Specify-generated Archive instead of the IPT. You can change an endpoint by using the GBIF API or you can helpdesk@gbif.org and we can do it for you.

You might want to consider deleting the datasets from the IPT. As publishing updates in the IPT will switch back the endpoint to the IPT one.You can also create and publish new datasets via the API and use the Specify endpoints.

Is there any way that we can contribute to particular set of vocabulary (i.e. eventType)?

Yes please contact us.

We (nodes) sometimes get data from GBIF for our own reports. We never cite the data as these reports are for internal use. As such, they don’t appear in citation metrics. Does anyone cite data for internal reports? Other Nodes? The GBIF Secretariat?

In general, citing data is a good practise and publishers appreciate the citations showing on their publisher pages. It might be less relevant for internal documents which don’t have DOIs. We don’t specifically have provisions in citation guidelines: Citation guidelines.

Could you make your slides from today available?

Yes you can access them here: Box. In general, if you would like the slides from any of the presentations in the Technical Support Hour for Nodes, please let us know.

As a reminder, all the previous edited recordings are available here: Technical support hour for GBIF Nodes on Vimeo

I submitted an abstract for the TDWG/SPNHC 2024 converence about occurrenceIDs (titled “What matters for an occurrenceID and what is an occurrenceID that matters?” SYM25), a related question would be why would you want occurrenceIDs to be stable?

We had a session on occurrenceID stability: GBIF and occurenceID stability on Vimeo and a related blogpost: GBIF attempts to improve identifier stability by monitoring changes of occurrenceIDs - GBIF Data Blog. Briefly, occurrenceIDs are used to identify the same occurrences between updates. This allows our system to identify the previously exiting occurrences and update them (as opposed to creating new URLs for these occurrences).

The problem that we have is that most of the dataset that our node receives are not occurrences. They are usually a combination of events and counts. I transform these into occurrences. When we get updates for these datasets, I have to recreate everything again. In this context, it becomes very difficult to use and maintain meaningless identifiers.

The main issue with meaningful identifiers is that when the information encoded in the identifier changes, the identifier also changes.

You can also think about what to keep stable and what can change. For example, in the context of digitised specimen, it might be particularly important that a record remains stable so people can refer to this record. In the context of monitoring, it would be particularly relevant to keep the events stable but the details of each occurrence might not be as important (especially if the occurrences are on an aggregated grid cell).

The work on improving occurrence stability was prompted by the community giving feedback on specific records or citing specific specimen records. In this context, if the occurrence disappears, it becomes challenging to keep track of the information associated with it. In some context it makes more sense than others.If there is no way to know from what version to the next which occurrences are the same, then the occurrences should get new identifiers.
Stability is of course the goal but sometimes, it isn’t possible.

Our node advises publishers to use meaningful occurrenceIDs. The problem with meaningless occurrenceIDs is that they are difficult to use. Many publishers accidentally create new UUIDs in the CMS. There are more chances to connect occurrences from previous version if you have meaningful occurrenceIDs. Also for small publishers that use excel to manage data have an easier time creating meaningful IDs than to UUIDs.