GBIF's vocabulary server (GBIF technical support hour for nodes)

cecsve · August 26, 2024, 12:47pm

Join us for the next session of the technical support hour for GBIF nodes on October 2nd, 2024, at 4 pm CEST (UTC+2), where the topic is GBIF’s vocabulary server. Please note there will be no support hour in September due to the SPNCH-TDWG conference.

The GBIF Secretariat is transitioning its interpretation pipeline for controlled occurrence core fields from Java enums to a more dynamic vocabulary server. This change will facilitate the translation of vocabulary concepts into multiple languages and shift the management of vocabularies from a rigid, hard-coded system to a flexible, community-driven platform. During this session, we will give an overview of the role and benefits of vocabularies within the GBIF framework.

We will be happy to answer any question relating or not to the topic. Please feel free to post questions in advance in this thread or write to helpdesk@gbif.org.

mgrosjean · October 25, 2024, 1:42pm

Here is the video recording for the session: GBIF Vocabulary server on Vimeo

Here is the transcript of the Q&A:

Would the controlled vocabulary for dataset category ( Add category to dataset · Issue #247 · gbif/registry · GitHub) be redundant with EML categories?

There is no dataset type categories defined in the EML schema and no field to share the category information in the EML. To learn more, please check: DatasetType in the eml profile? · Issue #11 · gbif/eml-profile · GitHub

Our node was wondering how to convey that eDNA datasets are eDNA datasets on GBIF and thinking about EML keywords and other options?

We don’t have yet clear guidelines on how to do it. For now, it would be great if you could keep track of these datasets in spreadsheet and such (until we have a better way of tagging them). It also helps us to have examples and use cases for this work on categorisation.

It is similar to a question we had some time ago about how to convey/specify datasets published in the context of projects that are not GBIF funded. Do you have any recommendation?

Here is what we currently have:

You can always use the project data fields of the dataset metadata, these fields can be used to empathise any project (not just GBIF-funded).
- Project ID fields is indexed and datasets, occurrences and citations can be searched by projectID
- Since the IPT version 3.1.0 (which is the current latest version), more than one project can be associated with a dataset (multiple “related projects” can be added).
You can use the datasetID and the datasetName fields which are multi value fields for occurrences. These fields are indexed and can be searched both via the API and via the interface. They aren’t exactly meant for projects but can be used for selecting and aggregating a group of occurrences.

We are working on making projectIDs multi-valued and sharable at the occurrence level (ProjectIDs on individual records, rather than a dataset as a whole · Issue #836 · gbif/pipelines · GitHub) but some challenges remain so it isn’t currently possible yet.

Are dataset keywords currently indexed by GBIF?

Datasets can be searched by keywords with the API, see for example: https://api.gbif.org/v1/dataset/search?keyword=Wildlfe%20Disease (and the relevant documentation here: Registry API :: Technical Documentation). However, occurrences aren’t indexed based on keywords associated with datasets.
For dataset categories to be helpful, they would have to be used to filter occurrences.

How will the translation of labels be taken care of? Can the community of translators help with that?

Yes the community can help with translations. Right now, the vocabulary server isn’t integrated with Crowdin. So the only way to add translations to the vocabularies is through spreadsheets which we have to upload to the vocabulary server.
Right now, the main examples of use of multilingual vocabularies is on the GRSciColl portal where you can see Spanish concept translations when available. See this example: Datos - GRSciColl
If you would like to contribute to the vocabularies translation, please email vocabularies@gbif.org. Thank you!

Is there an interface to browse vocabularies?

Yes, you can consult it here: GBIF Registry

Is there a way to look up concepts?

Yes, you can find our lookup here Vocabulary API :: Technical Documentation. The lookup will give you a concept for any value you enter based on our mapping of GBIF values. This can be used to programmatically clean up (normalise) data. For example, if you look for the value m,f in the sex vocabulary, it will return the concept mixed: https://api.gbif.org/v1/vocabularies/Sex/concepts/lookup?q=m,f
Here are two other examples with values from the life stage vocabulary:

Are the concepts of controlled vocabularies mandatory values (implemented in the IPT as drop down values) or more guidelines for best practises?

Right now, the vocabulary concepts are suggestions. The IPT drop down values are handled here: GBIF Resources. The vocabularies in the GBIF vocabulary server aren’t used on the IPT drop downs. We don’t know yet if the concepts in vocabularies from the vocabulary server will become drop down values in the IPT. Right now, the vocabulary API is available for anyone interested in accessing and using the vocabulary values programatically.

Note that we aren’t trying to invent new things. We are aggregating different external sources to built vocabularies that make sense in the GBIF context. Many good external sources can be very group-specific. For example, there might very good vocabularies for marine life but since at GBIF, we don’t only have marine species, we need to combine it with other sources.
We have a “sameAsURI” field so we can link concepts to their original sources.If you are aware of sources that could be used to build the GBIF vocabularies, please let us know by emailing vocabularies@gbif.org or adding a GitHub issue to GitHub · Where software is built.

If you are interested in the eventType vocabulary, please let us know, we have monthly calls to work on it. It is part of supporting the Humboldt Extension.

If you are interested in seeing a vocabulary for a specific field, please let us know. Here are some current suggestions: Suggestions for vocabularies to do: waterBody, island and island group and sampleSizeUnit and organismQuantityType · Issue #145 · gbif/vocabulary · GitHub

I’m interested in term use for AI suggested species identification (e.g. image recognition or audio AI recognition) where species ID is probabilistic ie. 70% likely. Is that type of data being handled by GBIF?

In the Darwin Core Standard, there doesn’t seem to be any specific field dedicated for sharing the uncertainty of a given identification based on eDNA or image recognition. For example, the identificationRemarks is used in the context of eDNA-based identification here: Occurrence Detail 4850142066. The identificationVerificationStatus (Darwin Core Quick Reference Guide - Darwin Core) can be used (perhaps the definition could be modified for accommodating other types of records, not just specimen identification). It might be worth opening an issue for the Darwin Core maintenance group.
There is also the identification history extension: https://rs.gbif.org/extension/identification_history_2024-02-19.xml.
Perhaps this is a way to share that information?

In the EML.xml file, we couldn’t find where the dataset type is defined?

The type of a dataset is explicitly declared when the dataset is registered on GBIF. This is what the IPT does and there is a specific field in the API (type) used to convey the dataset type. So the only way to be certain of a dataset type is to see under which type it is registered on GBIF.org. In practise, you can guess at a dataset’s type by looking at the core used for the mapping. This information would be available in the meta.xml file.

When is the EML 2.2.0 changes will be implemented in the IPT?

It is now available in the latest IPT version (as I write this transcript).

Could one of the future sessions be about CameraTrap and other DP in the IPT3?

This is somewhat covered in the video recording of the December 2023 Technical support for Nodes: New features of the Integrated Publishing Toolkit version 3.0 (IPT3). Please watch it and let us know if this already answers your question. We can always have a session with more details.

cecsve · November 13, 2024, 8:03am

mgrosjean:

It is similar to a question we had some time ago about how to convey/specify datasets published in the context of projects that are not GBIF funded. Do you have any recommendation?

Here is what we currently have:

You can always use the project data fields of the dataset metadata, these fields can be used to empathise any project (not just GBIF-funded).

Project ID fields is indexed and datasets, occurrences and citations can be searched by projectID

Since the IPT version 3.1.0 (which is the current latest version), more than one project can be associated with a dataset (multiple “related projects” can be added).

You can use the datasetID and the datasetName fields which are multi value fields for occurrences. These fields are indexed and can be searched both via the API and via the interface. They aren’t exactly meant for projects but can be used for selecting and aggregating a group of occurrences.

We are working on making projectIDs multi-valued and sharable at the occurrence level (ProjectIDs on individual records, rather than a dataset as a whole · Issue #836 · gbif/pipelines · GitHub) but some challenges remain so it isn’t currently possible yet.

There is a suggestion to add new terms to capture more information on project-related data in DwC:

github.com/tdwg/dwc

New Terms - projectTitle; projectID; fundingBodyName; fundingBodyID

opened 07:59AM - 28 Oct 24 UTC

aaltenburger2

Term - add

New terms Submitter: Andreas Altenburger (GBIF Norway) Efficacy Justificat…ion: I work at a university museum that publishes its collections on GBIF as datasets. We constantly receive requests from contributors to the museum collection, asking to be able to track "their" contributions at the record level on GBIF. This relates to private funders such as Ocean Census (https://oceancensus.org/) or the Mohn Foundation (https://mohnfoundation.no/), governmental funding from Artsdatabanken or the Research Council of Norway, and institutional internal funding. We need to be able to attribute the specimens to their respective projects and funders. Demand Justification: Record-level attribution has been requested several times previously. See discussions https://github.com/tdwg/dwc-qa/issues/37 https://github.com/tdwg/dwc-qa/issues/83 https://github.com/tdwg/dwc-qa/issues/100 https://github.com/gbif/pipelines/issues/836 https://github.com/gbif/ipt/issues/1780 for more details. Stability Justification: New terms for record-level attribution are unlikely to negatively impact existing implementations because these terms would be additional, optional fields that enhance the granularity of data attribution without altering existing data structures. Current users and systems can continue to operate without adopting these new terms immediately, allowing for a gradual integration. Moreover, these terms are designed to be backward-compatible, ensuring that they do not disrupt existing workflows or data integrity. The addition of these terms would primarily provide a means for more detailed tracking and reporting, which is a growing requirement from funders and contributors. This enhancement would improve the transparency and traceability of data contributions without imposing changes on those who do not require this level of detail. Implications for dwciri: The introduction of the proposed new terms - projectTitle, projectID, fundingBodyName, and fundingBodyID - does not necessitate changes to existing dwciri term versions. The new terms would be added as properties within the Darwin Core namespace but would not alter the definitions or functionalities of existing dwciri terms. They are designed to be complementary and to integrate seamlessly with the current structure, ensuring that they do not disrupt existing implementations These additions would simply expand the capability of the Darwin Core standard to convey more detailed project and funding information, which is increasingly important for data transparency and traceability in biodiversity research. Proposed attributes of the new term: Term names: projectTitle; projectID; fundingBodyName; fundingBodyID Term labels: Project Title; Project ID, Funding Body Name; Funding Body ID Organized in Class: Occurence Definition of the terms (normative): projectTitle: The title or name of the project under which the data was collected or the specimen was acquired. projectID: A list (concatenated and separated) of unique identifiers for the project(s) that contributed to the original dwc:Occurrence. The projectID can link multiple occurrence records associated with the same project but may be shared in different datasets. The nature of the association can be described in the metadata project description element. fundingBodyName: The name of the organization or agency that provided funding for the project. fundingBodyID: A unique identifier for the funding organization or agency that supported the project. Usage comments: projectTitle: Use this term to provide the official or commonly recognized title or name of the project. This should be the title under which the project is known and cited. Avoid abbreviations unless they are widely understood. The recommended best practice is to separate the values in a list with space vertical bar space ( | ). projectID: This term should be used to provide a globally unique identifier (GUID) for the project, if available. This could be a DOI, URI, or any other persistent identifier that ensures the project can be uniquely distinguished from others. The recommended best practice is to separate the values in a list with space vertical bar space ( | ). fundingBodyName: Specify the full official name of the funding body. This should include the complete name without abbreviations, unless the abbreviation is an official and commonly recognized form (e.g., NSF for the National Science Foundation). The recommended best practice is to separate the values in a list with space vertical bar space ( | ). fundingBodyID: Provide a unique identifier for the funding body, such as an identifier used in governmental or international databases. If no official identifier exists, use a persistent and unique identifier within your organization or dataset. The recommended best practice is to separate the values in a list with space vertical bar space ( | ). Examples: projectTitle: The Nansen Legacy; Scalidophora i Noreg; Arctic Deep projectID: RCN276730; Artsproject_7-24; OC202405 fundingBodyName: Norges forskningsråd; Artsdatabanken; Ocean Census | Nippon Foundation fundingBodyID: https://ror.org/00epmv149; https://ror.org/04jnzhb65; https://ror.org/05wszs827 Refines: NA Replaces: NA ABCD 2.06: Here I am not sure and require support from the community projectTitle: /DataSets/DataSet/Units/Unit/Gathering/Project/ProjectTitle projectID: /DataSets/DataSet/Units/Unit/Gathering/Project/ fundingBodyName: /DataSets/DataSet/Metadata/Owners/Owner/Organisation/Name this is not a perfect fit as the funding body is not necessarily the owner fundingBodyID: /DataSets/DataSet/Metadata/Owners/Owner/Organisation/ID this is not a perfect fit as the funding body is not necessarily the owner

Please support the idea and comment if it would be helpful for your work.

Topic		Replies	Views
About the Data Publishing category Data Publishing	1	1253	May 3, 2018
About the API developers category API developers	0	14	February 4, 2026
Investigating taxonomic issues on GBIF.org Data Publishing NodesSupportHour	6	340	February 13, 2025
April technical support hour for GBIF nodes Data Publishing NodesSupportHour	4	786	June 26, 2023
GBIF's data quality workflow (GBIF technical support hour for nodes) Data Publishing NodesSupportHour	5	586	March 15, 2024

GBIF's vocabulary server (GBIF technical support hour for nodes)

Related topics