GBIF's vocabulary server (GBIF technical support hour for nodes)

Join us for the next session of the technical support hour for GBIF nodes on October 2nd, 2024, at 4 pm CEST (UTC+2), where the topic is GBIF’s vocabulary server. Please note there will be no support hour in September due to the SPNCH-TDWG conference.

The GBIF Secretariat is transitioning its interpretation pipeline for controlled occurrence core fields from Java enums to a more dynamic vocabulary server. This change will facilitate the translation of vocabulary concepts into multiple languages and shift the management of vocabularies from a rigid, hard-coded system to a flexible, community-driven platform. During this session, we will give an overview of the role and benefits of vocabularies within the GBIF framework.

We will be happy to answer any question relating or not to the topic. Please feel free to post questions in advance in this thread or write to helpdesk@gbif.org.

1 Like

Here is the video recording for the session: GBIF Vocabulary server on Vimeo

Here is the transcript of the Q&A:

Would the controlled vocabulary for dataset category ( Add category to dataset · Issue #247 · gbif/registry · GitHub) be redundant with EML categories?

There is no dataset type categories defined in the EML schema and no field to share the category information in the EML. To learn more, please check: DatasetType in the eml profile? · Issue #11 · gbif/eml-profile · GitHub

Our node was wondering how to convey that eDNA datasets are eDNA datasets on GBIF and thinking about EML keywords and other options?

We don’t have yet clear guidelines on how to do it. For now, it would be great if you could keep track of these datasets in spreadsheet and such (until we have a better way of tagging them). It also helps us to have examples and use cases for this work on categorisation.

It is similar to a question we had some time ago about how to convey/specify datasets published in the context of projects that are not GBIF funded. Do you have any recommendation?

Here is what we currently have:

  • You can always use the project data fields of the dataset metadata, these fields can be used to empathise any project (not just GBIF-funded).
    • Project ID fields is indexed and datasets, occurrences and citations can be searched by projectID
    • Since the IPT version 3.1.0 (which is the current latest version), more than one project can be associated with a dataset (multiple “related projects” can be added).
  • You can use the datasetID and the datasetName fields which are multi value fields for occurrences. These fields are indexed and can be searched both via the API and via the interface. They aren’t exactly meant for projects but can be used for selecting and aggregating a group of occurrences.

We are working on making projectIDs multi-valued and sharable at the occurrence level (ProjectIDs on individual records, rather than a dataset as a whole · Issue #836 · gbif/pipelines · GitHub) but some challenges remain so it isn’t currently possible yet.

Are dataset keywords currently indexed by GBIF?

Datasets can be searched by keywords with the API, see for example: https://api.gbif.org/v1/dataset/search?keyword=Wildlfe%20Disease (and the relevant documentation here: Registry API :: Technical Documentation). However, occurrences aren’t indexed based on keywords associated with datasets.
For dataset categories to be helpful, they would have to be used to filter occurrences.

How will the translation of labels be taken care of? Can the community of translators help with that?

Yes the community can help with translations. Right now, the vocabulary server isn’t integrated with Crowdin. So the only way to add translations to the vocabularies is through spreadsheets which we have to upload to the vocabulary server.
Right now, the main examples of use of multilingual vocabularies is on the GRSciColl portal where you can see Spanish concept translations when available. See this example: Datos - GRSciColl
If you would like to contribute to the vocabularies translation, please email vocabularies@gbif.org. Thank you!

Is there an interface to browse vocabularies?

Yes, you can consult it here: GBIF Registry

Is there a way to look up concepts?

Yes, you can find our lookup here Vocabulary API :: Technical Documentation. The lookup will give you a concept for any value you enter based on our mapping of GBIF values. This can be used to programmatically clean up (normalise) data. For example, if you look for the value m,f in the sex vocabulary, it will return the concept mixed: https://api.gbif.org/v1/vocabularies/Sex/concepts/lookup?q=m,f
Here are two other examples with values from the life stage vocabulary:

Are the concepts of controlled vocabularies mandatory values (implemented in the IPT as drop down values) or more guidelines for best practises?

Right now, the vocabulary concepts are suggestions. The IPT drop down values are handled here: GBIF Resources. The vocabularies in the GBIF vocabulary server aren’t used on the IPT drop downs. We don’t know yet if the concepts in vocabularies from the vocabulary server will become drop down values in the IPT. Right now, the vocabulary API is available for anyone interested in accessing and using the vocabulary values programatically.

Note that we aren’t trying to invent new things. We are aggregating different external sources to built vocabularies that make sense in the GBIF context. Many good external sources can be very group-specific. For example, there might very good vocabularies for marine life but since at GBIF, we don’t only have marine species, we need to combine it with other sources.
We have a “sameAsURI” field so we can link concepts to their original sources.If you are aware of sources that could be used to build the GBIF vocabularies, please let us know by emailing vocabularies@gbif.org or adding a GitHub issue to GitHub · Where software is built.

If you are interested in the eventType vocabulary, please let us know, we have monthly calls to work on it. It is part of supporting the Humboldt Extension.

If you are interested in seeing a vocabulary for a specific field, please let us know. Here are some current suggestions: Suggestions for vocabularies to do: waterBody, island and island group and sampleSizeUnit and organismQuantityType · Issue #145 · gbif/vocabulary · GitHub

I’m interested in term use for AI suggested species identification (e.g. image recognition or audio AI recognition) where species ID is probabilistic ie. 70% likely. Is that type of data being handled by GBIF?

In the Darwin Core Standard, there doesn’t seem to be any specific field dedicated for sharing the uncertainty of a given identification based on eDNA or image recognition. For example, the identificationRemarks is used in the context of eDNA-based identification here: Occurrence Detail 4850142066. The identificationVerificationStatus (Darwin Core Quick Reference Guide - Darwin Core) can be used (perhaps the definition could be modified for accommodating other types of records, not just specimen identification). It might be worth opening an issue for the Darwin Core maintenance group.
There is also the identification history extension: https://rs.gbif.org/extension/identification_history_2024-02-19.xml.
Perhaps this is a way to share that information?

In the EML.xml file, we couldn’t find where the dataset type is defined?

The type of a dataset is explicitly declared when the dataset is registered on GBIF. This is what the IPT does and there is a specific field in the API (type) used to convey the dataset type. So the only way to be certain of a dataset type is to see under which type it is registered on GBIF.org. In practise, you can guess at a dataset’s type by looking at the core used for the mapping. This information would be available in the meta.xml file.

When is the EML 2.2.0 changes will be implemented in the IPT?

It is now available in the latest IPT version (as I write this transcript).

Could one of the future sessions be about CameraTrap and other DP in the IPT3?

This is somewhat covered in the video recording of the December 2023 Technical support for Nodes: New features of the Integrated Publishing Toolkit version 3.0 (IPT3). Please watch it and let us know if this already answers your question. We can always have a session with more details.

There is a suggestion to add new terms to capture more information on project-related data in DwC:

Please support the idea and comment if it would be helpful for your work.