The Humboldt extension to Darwin Core

cecsve · May 19, 2025, 1:03pm

The Humboldt extension to Darwin Core (https://eco.tdwg.org/) provides a means by which to more explicitly capture the sampling context of biodiversity survey data. The ratification of this extension reinvigorated efforts to expand support for Event data in GBIF and increase the quantity of survey data shared through GBIF implementing Event core with the Humboldt extension.

In the upcoming GBIF technical support hour for nodes on 4 June 2025, we will talk about these efforts to increase biodiversity survey data in GBIF. Specifically, we will present a data example implementing the Humboldt extension to Darwin Core (https://eco.tdwg.org/), introduce existing resources, and provide an update on the status of the Survey and Monitoring Data Guide.

We will be happy to answer any question relating or not to the topic. Please feel free to post questions in advance in this thread or write to helpdesk@gbif.org.

mgrosjean · July 1, 2025, 8:51am

The video recording is available here: Humboldt Extension to Darwin Core on Vimeo

Here is a transcript of the Q&A:

I did a PhD over 30 years ago and published a sampling event dataset using my PhD data. Your presentation inspires me to republish my data. Should I make a new major version of the dataset?

Yes, you will potentially capture a lot more information as structured data with the Humbolt extension (HE, https://eco.tdwg.org/). If you choose to update your dataset using terms from the extension, it will definitely qualify as a new major version (not a new dataset).

Can you change the major version of the dataset in the IPT?

Yes, if your organization is associated with a DataCite account, the IPT allows you to reserve DOIs which will automatically change the major version number of your dataset. See also this part of the documentation: https://ipt.gbif.org/manual/en/ipt/latest/doi-workflow. Note that if you don’t have a DataCite account, you cannot reserve a DOI.

Is there a way to change the dataset type from occurrence dataset to event dataset?

Yes, you will need to delete the current mappings, upload the new data tables, and remap your data. Some older IPT versions will not allow this. If that’s the case, please update your IPT.

When I initially published, I was not able to capture the location of bee traps along transects. Is it better captured now with the Humbolt Extension?

Yes. Providing this information in a more explicit and structured manner with a nested dataset and implementing additional terms from the Humboldt extension will make the dataset easier to interpret and easier for potential users to assess fitness for use.

Is it possible/recommended to use this extension in addition to other extensions such as DNA-derived data for eDNA protocols/sites?

Yes, but it is important to remember that the star schema limits what information can be fully ingested and interpreted by GBIF. The DNA-derived extension is currently structured for use with the Occurrence core meaning that if it is implemented with an Event core dataset nothing will be ingested/interpreted (although, the original published dataset can be accessed through the Darwin Core Archive and EML Endpoint links on the dataset page, see an example screenshot below).

In your own system, however, you are welcome and recommended to use the Humbolt Extension (an extension to the event core) together with an occurrence extension (such as the DNA-derived extension) to more comprehensively capture the information associated with your dataset.

I have a question about the term isLeastSpecificTargetCategoryQuantityInclusive, did I understand correctly that if you have one row with one count, the value is True?

Yes. If all observations of a specific taxon observed during a unique sampling Event are reported in a single row in the occurrence table (with a single, unique dwc:occurrenceID), the value for eco:isLeastSpecificTargetCategoryQuantityInclusive should be ‘true’. If the observations are broken down in multiple rows and multiple unique dwc:occurrenceID’s, (for example one row per life stage and/or sex), then eco:isLeastSpecificTargetCategoryQuantityInclusive would be ‘false’.

The term lets data users know if they need to add up values across rows for that taxon.

Additional documentation: isLeastSpecificTargetCategoryQuantityInclusive Guidelines - Humboldt Extension for Ecological Inventories.

Is it only for numeric counts?

No, it isn’t just about counts; it is so users know if they have all the comprehensive information reported about a particular organism during a unique sampling Event in one row.

I have been trying to map to the Humbolt extension in the context of the new data model, the concept of survey scope was new to me. I believe a lot of the GBIF Nodes wouldn’t be provided with survey scope.

This is not surprising. Explicit survey scopes are often not reported, and if they are, they are often buried in verbatim metadata alongside sampling design information. Explicit scopes are often included with an inventory. An inventory is a specific type of survey which aims to capture all taxa within an intended scope in a specific area.

Nearly all the terms in the Humbolt extension were designed to capture data: “what was done, how was it done, what happened?”. The scope terms (specifically taxonomic, organismal, and habitat scopes) often capture intent (What taxa do we aim to sample? What taxa are we deliberately excluding from our survey efforts? What habitats will be sampled?). The content of these terms needs to be explicit (for example to convey that a goal was to sample flying insects at night, so CO2 traps were used).

Reported scope is needed to assess completeness of a survey. If you don’t know what you are looking for, you don’t know if you have found everything.

For those cases where you have a scope but during the survey, you make opportunistic samples (bycatch), how to model that information?

There are four terms in the Humbolt extension which help capture bycatch. You can only report bycatch if you have an explicitly stated scope.

For example, you have a survey aiming to catch fish of a certain size range with nets. Those nets might also capture cephalopods or other organism of similar size. These other non-fish organisms caught are bycatch. You can record that using the Humbolt extension terms: eco:hasNonTargetTaxa, eco:nonTargetTaxa,eco:areNonTargetTaxaFullyReported, eco:hasNonTargetOrganisms.

This is really good information to share, some people study surveys that explicitly report bycatch.

In France we have a reference list of protocols, techniques and methods to acquire data (see here: https://inpn.mnhn.fr/programme/campanule?lg=en), is there a similar list (maybe not a controlled vocabulary yet) in Humboldt core to try and harmonize the possible values for protocols?

No, there isn’t such list. We started the discussion but creating a controlled vocabulary or even a list of existing protocols requires more capacity than is currently available. For now, we recommend having protocol defined online, ideally in a published document or via https://www.protocols.io/, and providing the link to the protocol in the data (eco:protocolReferences).

You are welcome to reference the protocol in the list you mentioned.

I have a question about inferring absences. Taking the example of Cecilie’s project (which she described in the latest webinar about the new data model: https://www.gbif.org/event/7oJVaWQZ2wlknRfM7w4OXE/introducing-the-darwin-core-data-package) where they caught and sequences as well as identified insects by driving around with nets on cars. What is the scope? Flying insects? How do you infer absences? Do you have a list of flying insects expected?

In this context, there was no explicit taxonomic or organismal scope stated for the survey. As such, survey completeness cannot be interpreted and thus absence cannot be inferred.

What if you take all the species detected in all the samples and assume that if you detected it in one sample, you should be able to detect it in another sample?

I didn’t want to do that because there is too much bias in the processing of samples. Depending on some sample, some species may happen to be detected more easily because more DNA is exuded, or the amplification may work better in some samples. There are so many bottlenecks in the processing of the samples that I couldn’t comfortably infer absences in my study. If I had sample replicates and additional quality control, I would have been more confident in inferring absences.

I know that some people have different views on the topic and have developed pipelines to infer such absences without sample replicates.

So, what would be the advice in such cases? Should absences not be inferred?

It depends on the study design. If the inferring of absences is integrated from the start, then it makes sense to include those.

It sounds like we might have two cases: open-ended studies or not. For example, the scope of the eBird is “all birds” while some other studies (notably on alien and invasive species), the list of target taxa isn’t flexible.

Yes, if you use for example, qPCR to detect species with a specific primer (in the context of survey on invasive species for example), it makes sense to report absences. But if you use a universal primer, you can have all kinds of results. It might be a bit difficult to share and format data if you aren’t familiar with the methodologies.

Should you not report absences in doubt?

If you are in doubt, the best would be not to report absences. You can’t know if something is absent if you aren’t looking for it.

I would like to say a few words about protocols. There are a few layers of protocols:

1. Field protocols (as defined in INPN - Programs CAMPanule or https://www.protocols.io/). There is often a publication describing the protocol and available.
2. Lab protocols (many of those are controlled as well, for example DNA extraction kits, DNA purification kits, etc. Morphological protocols are also protocols for example having a minimum number of spores to conclude anything about a fungal identification).
3. Post-lab protocol (for example bioinformatics pipelines used to analyze sequences. This can be captured in the code.)

In this protocol discussion, all three layers should be discussed.

I have a question regarding permeant sample plots for vegetation. Our institute does a lot of permanent sample plots in dry forest (where you might come back to the same plot every few months).

How can the information be conveyed in Humbolt core?
Do you have an example that we could share with publishers?
Sometimes surveyors measure abundance differently for example number of bamboos vs tree branch sizes, how can they share the information? Should the tree sizes be in the extended measurements or facts (eMOF) and the number of bamboos in the occurrence file?

I have a dataset that I am looking into (but I don’t know if it is as complex as the example you are providing). For each locality (permanent plot), we would have a parent event (with a unique locationID specific to the permanent plot), then each survey conducted at the plot would be a sub-event. Each event will be associated with data in the Humbolt extension. If appropriate, you would attach a relevé extension and/or measurmentOrFacts extension.

If you have example data, please email me (kingenloff@gbif.org) to follow up. We can ensure that the survey design is properly captured by the dataset nesting structure.

Is it better to have measurements in the eMOF extension or organism quantity?

Maybe both. The DwC-DP will help structure that a bit.

A lot of branch measurements would go in the eMOF.

Note that the fields are creates so that the publisher share measurements and units as they were generated to minimize possible errors in data entry and data interpretation.

Would you recommend using a controlled vocabulary for measurement units? Is there one that exists already?

Yes. Use a controlled vocabulary when available.

OBIS has controlled vocabularies for units (which are shared in the eMOF extension).

Making such controlled vocabulary should also be possible for GBIF, we would have to look at the data first.

I see two options to indicate absences:

1. Uploading occurrences with a zero in the individual count or organism quantity and a link to the Humbolt extension where eco:isAbsenceReported = ‘true’.
2. There is also a specific Humbolt Extension term to report absent taxa.

How do you recommend reporting the absences?

Reporting the list of absent taxa in the Humbolt extension (eco:absentTaxa) is nice but the absences can also be inferred by users when comparing the occurrences and scope of the survey. If populating the list of absent taxa isn’t too much extra work, it would be appreciated but it isn’t the most critical field. The potentially more useful field the Boolean term eco:isAbsenceReported.

When I read through the guide, I noticed that it mentioned somewhere that if the Boolean value is filled and that the taxonomic scope is filled, you don’t need to share the list of absent taxa. But should occurrences be shared?

If you have any absences, I suggest recording them in the occurrence table (ensure that dwc:occurrenceStatus is populated). Note that none of the Humbolt extension terms are required. We are still observing how people use the data.

What about just use the occurrenceStatus? Would you also recommend the individualCount field show zero as well?

It is up to your individual preference: occurrenceStatus or individualCount (populating both would be ideal) but don’t forget to fill the isAbsenceRecorded field as well to help searches and indexing.

Topic		Replies	Views
Absences and how they fit in the new model Diversifying the GBIF data model	21	2082	May 8, 2025
GBIF's vocabulary server (GBIF technical support hour for nodes) Data Publishing NodesSupportHour	2	115	November 13, 2024
About the Data Publishing category Data Publishing	1	1254	May 3, 2018
Use Case: Ecological Survey Data Exchange Specification - Forest Basal Area Monitoring Diversifying the GBIF data model	7	683	April 22, 2022
Which data can be shared through GBIF and what cannot - GBIF Data Blog Data blog	1	733	November 17, 2022

The Humboldt extension to Darwin Core

Related topics