Investigating taxonomic issues on GBIF.org

The next technical support hour for GBIF nodes will be on February 5th, 2025, at 4 pm CET, and the topic is investigating taxonomic issues on GBIF.org.

The taxonomy of the records published to GBIF is matched to GBIF’s taxonomic backbone to normalize the data and make records easier to find for GBIF data users. This means that publishers sometimes share a taxon name that is changed when appearing on GBIF.org. For example, a beetle record appears as a butterfly record, or the authorship is changed.

We will show examples of how the taxonomy is changed, how to investigate where the source of the change comes from, and how you can provide feedback directly to the source or GBIF helpdesk.

We will be happy to answer any question relating or not to the topic. Please feel free to post questions in advance in this thread or write to helpdesk@gbif.org.

4 Likes

Greetings @cecsve,

Looking forward to this session! I’ve put it on the Species File Group calendar. I’m especially heartened to see the focus on being explicit about agency when you write:

… how [we] can provide feedback directly to the source OR GBIF

Hoping this is a trend, that is, a move toward a community best practice that includes thinking about our workflows, software, user interfaces, and networks that consider facilitating this type of interaction from the start. #roundtripping matters #agency #provenance #policy #standards #transparency #credit

Some questions that come to mind:

  1. How many taxonomy-related tickets does GBIF encounter in say, a week, or month, or year?
  2. How many can GBIF address directly? How many staff and how much time from GBIF staff is dedicated to these matters?
  3. What percentage of these issues need help from the experts who are providing these data?
  4. How often does it happen that the providers are not reachable or responsive? How can we all help to improve responsiveness (from UI to software and community development)?
  5. What changes could we / you suggest that are needed in other related taxonomic related resources with regard to making agency (more) possible and transparent?
  6. For names that need community work to improve them (or even provide them at all), might there be ways to call attention to these organismal groups (kindly, of course) and help convey these gaps to policy makers and funders for relevance to the bioeconomy?
1 Like

Thanks for the questions @Debbie! We will answer them to the best of our knowledge when we make a summary after the meeting. With the switch to the xRelease from Catalogue of Life as the new GBIF backbone, we expect that source updates will be integrated faster in the backbone and shown on GBIF.org.

We have a video from a previous session where Camila from COL explains the xRelease and process you may find relevant: Switching GBIF’s taxonomic backbone to the Catalogue of Life extended release (x-release).

The video recording of the presentation is available here: Investigating taxonomic issues on GBIF.org.

Links mentioned in the presentation:

Q&A transcript:

How to report several issues? For example if hundreds of records are identified with associated taxonomic issues should we log separate issues? Should we share a report?

If you have identified such amount of issues, perhaps you can identify a pattern. For example, specific families or genera which would all relate to the same taxonomic source. Sometimes there are issues that are due tohigher taxonomy (it can happen in some records coming from literature treatments). So it would be best to log one feedback message or GitHub issue. If you (or we) can identify a pattern, we will handle it as one issue and if not, we will split it accordingly.

How can scientificNameID values help unambiguous matching to the GBIF Backbone taxonomy? For example, federal employees in the US often use ITIS as their taxonomic reference, would it help if they provide such identifiers?

Only the WoRMS reference databases is currently used for ID matching. In addition to that, in order to be used, the identifiers must be integrated in the GBIF Backbone taxonomy. As the backbone hasn’t been updated in more than a year, you would still encounter some challenges if you use the identifiers for the more recent species, as they might not be in the taxonomy.
We have opened a GitHub issue to investigate adding more sources for scientificNameID matching: Resolve more taxonID suppliers than WoRMS · Issue #1119 · gbif/pipelines · GitHub. You are welcome to suggest the integration of identifiers from specific taxonomic references if they aren’t used yet.Note that sometimes the scientificName and classification and the scientificNameID provided don’t match at all. In that case, our system prioritizes the identifier match and flags the occurrences.

I reported a taxonomic GitHub issue two years ago and it was closed as solved , but the taxonomic issue remains. What happened?

It looks like it was our mistake. We reopened the issue now. If you have suggestions for open-licence taxonomic sources that we could use for the Backbone taxonomy (especially for algae), please let us know.

What happens to the taxonKeys/taxonID s if a name changes (for example if a misspelling was corrected)? Is the updated name associated with a new key/id?

The new name would get a new identifier in most cases (the exception would be very minor corrections). All the scientific names associated with occurrence records are reinterpreted after the GBIF Backbone taxonomy is updated. They will be linked to the new taxon keys/identifiers. That means that if you use a taxonKey to query records on GBIF, you should make sure that it is still relevant after a GBIF Backbone update.

If we have an issue with (for example) a marine species, can we log it in GBIF , or should we log it with the Catalogue of Life or with W oRMS directly?

You are welcome to contact the taxonomic source directly. In most cases, we forward the feedback to these sources anyway. Contacting them directly will likely result in a faster update. We receive the updates from these sources.
Note that sometimes, some marine species names can come from other sources than WoRMS. Don’t forget to check the source of the name.
You are welcome to contact us if the source is unresponsive, or if you aren’t sure what the source of the name is.

Which tools are worth educating data publisher on? If you are going to advise someone about publishing on GBIF, which tools would you recommend them to use to make sure the taxonomic information provided is of quality?

Publishers are very welcome to use the GBIF Species matching tool (Species name matching), which is based on the Species match API: Species API :: Technical Documentation.
There are two limitations to using the species matching tool:

  1. the names are only matched to the GBIF Backbone taxonomy (you can’t choose another reference)
  2. there is a limit to the number of names that you can match to the taxonomy

An alternative would be to use the checklistbank asynchronous matching tool: ChecklistBank. It has no limit to the number of names that can be matched, and you can choose any checklist available on checklistbank as reference. Please also check this tutorial: ChecklistBank tutorial (other checklistbank tutorials are available from this page: Data Use Club Practical Session: accessing and downloading species information - #2 by mgrosjean).

You can always publish your data on the GBIF test website: https://www.gbif-uat.org from TEST IPTs to see how they would be interpreted by GBIF.

Will hybrid names be integrated in the extended Catalogue of Life release (XR COL)? They are currently not included in the Catalogue of Life.

It is true that the Catalogue of Life doesn’t have hybrid names but the extended release will have them. You can learn more about this extended release here: Switching GBIF’s taxonomic backbone to the Catalogue of Life extended release (x-release). There are currently 3785 hybrid names in XR COL vs 5767 hybrids in the backbone. We need to check first where those missing names come from before we are able to assess whether they can be integrated in the XR COL.

We would like to publish our GBIF datasets to OBIS as well , and we need to provide the AphiaIDs. Is there a way for GBIF to infer the AphiaIDs based on the names?

You could use the name matching checklistbank function to match your names to WoRMS.

Could the name matching be automated?

You can use the API call /dataset/{key}/match/nameusage/job, where the key is the checklist bank key for the reference dataset you want (documented here: COL ChecklistBank API).

Ideally, the system could infer the AphiaIDs directly from the names provided, I have logged the idea in the IPT GitHub repository: Would it be possible to add AphiaIDs to species records on datasets on the IPT · Issue #2649 · gbif/ipt · GitHub

Can the name matching system work for any identifiers?

Potentially, you can match your names to any checklist available in Checklistbank.

Can the RGBIF functions give a good estimate of how the names would be interpreted by GBIF?

The RGBIF package is a wrapper for the GBIF API (GBIF API Reference :: Technical Documentation). You need to use the function that correspondsto match (Species API :: Technical Documentation), not search. The match function is what is used to match the scientific names of occurrences to the GBIF backbone taxonomy. See also this blogpost: (Almost) everything you want to know about the GBIF Species API - GBIF Data Blog.
With that in mind, you can only query up to 100,000 records with the API.
Using the matching tool from checklistbank would help you match as many names as you want.

About hybrid names : any chance to cover cultivar names (not the names in the ICNCP code , but cultivars used by plant breeders)?

It depends if there is a checklist available for those names. If not, the challenge is to first assemble and publish such a checklist.

We would like to have a session about translation of the Darwin Core to other languages for the conference Datos Vivos 2025 in Bogota. If you are interested in participating, please contact @EstebanMH-SiB.

3 Likes

@Debbie to answer some of your questions:

  1. How many taxonomy-related tickets does GBIF encounter in say, a week, or month, or year?

You can find all of them logged here: GitHub · Where software is built. It is usually a few each week but it accumulates.

  1. How many can GBIF address directly? How many staff and how much time from GBIF staff is dedicated to these matters?

We are lucky to be able to work with the Catalogue of Life staff who is also addressing a lot of these issues. We are only able to dedicate a small percentage of our time to investigating and forwarding these issues. I don’t think any of us has tracked the time spent but we are seven people who check this repository.

  1. What percentage of these issues need help from the experts who are providing these data?

That’s another hard question, maybe best answered by @camisilver and Diana Hernández. My estimate is that most of the issue require at some point the eye of an expert (if anything to approve the final proposed change).

Your other questions might be best answered by the Catalogue of Life as they handle a lot of the communication with the taxonomic references.

Thank you @mgrosjean! I have a follow up question on this. When the scientific names associated with occurrences are reinterpreted after the GBIF Backbone taxonomy is updated, will the scientific names of the occurrences be matched to the old name and the old name points to the new (accepted) name? or will the scientific name of the occurrences match to the new name directly?

The scientific name will be matched to the new name directly and there will be no history to track the old interpreted name unless you can match the record to an entry in for example our snapshot data.

2 Likes