GBIF's data quality workflow (GBIF technical support hour for nodes)

mgrosjean · March 15, 2024, 1:22pm

The video is available here: Data quality workflow on Vimeo

Here is the transcript of the questions during the session.

Links shown in the presentation :

Link to documentation of issues and flags: Occurrence issues and flags :: Technical Documentation
Link to the Data Use Club recorded webinar on data quality: Data use club - practical session 2 - recording and resources 2
Link to occurrence search: Search (then select occurrence, then click on the “metrics” tab and “custom”)
Example of an occurrence where the feedback is redirected to the publisher website: Occurrence Detail 4507697725
Example of an occurrence where the feedback is redirected to the dataset contact: Occurrence Detail 4175048254
Example of an occurrence where the feedback goes directly to the GBIF Portal feedback repository on GitHub: Occurrence Detail 4509156301
Link to the GBIF Portal feedback repository on GitHub: GitHub - gbif/portal-feedback: User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
Link to the labels used in the repository and explanations: Labels · gbif/portal-feedback · GitHub
Example of an issue reported for a specific occurrence: Location correction suggestion · Issue #5192 · gbif/portal-feedback · GitHub
Example of issues grouped by country label: Issues · gbif/portal-feedback · GitHub
Example of a GBIF Backbone taxonomy issue: Dear GBIF team, As per the Australian Faunal Directory (AFD), the genus *Austrocaligula* is a junior synonym of *Opodiphthera* : https://biodiversity.org.au/afd/taxa/Opodiphthera. Also, *Austrocaligula carnea* (as *Anthereae carnea* Sonthonnax, 1899) is considered a junior synonym of *Opodiphthera loranthi* (T.P. Lucas, 1891): https://biodiversity.org.au/afd/taxa/86175a79-7947-4730-83a3-f7470a8e6b93 Thank you for your help and time. Best regards, Dr Penelope Mills · Issue #5188 · gbif/portal-feedback · GitHub

In the presentation, you showed the example of an iNaturalist observation where users who report feedback are redirected directly to the relevant iNaturalist page. How can publishers get the same set up? This is something that we have to set up per publisher. The main requirement is that the link to the page to report the feedback must be provided for each occurrence. If you are interested, please log a GitHub issue on our portal.

I have a question about the data validator. Do you provide those data quality checks through an API or is it only by upload on the web interface?
The validator has an API which is documented here: Validator API :: Technical Documentation. You are welcome to use it to check your datasets programmatically. Otherwise, you are also welcome to publish datasets on our TEST website https://www.gbif-uat.org either via the API or an IPT in TEST mode. The advantage of publishing on the test website is that you will be able to browse your flagged records while the validator only gives a summary of the issues and flags.

Is there an easy way to download the tables and list that are displayed in the metrics tab of the occurrence search?
Right now, the only way to do access those values is by using the occurrence search API: Occurrence API :: Technical Documentation (either by using facets or metrics or inventories). Follow this GitHub issue for more information: Add the possibility to download summary tables from the UI · Issue #5227 · gbif/portal-feedback · GitHub
Note that you can already download lists of datasets from the dataset search interface: Search

Right now the feedback email templates generated from the GBIF feedback interface adds the Node manager in copy of the email. Is there a way to have the email CC a different email address?
The email addressed CCied in those email templates are based on the Node manager email addresses in the GBIF directory. If you would like a different email address there, the address associated with the Node manager in the directly needs to be updated.

Should the technical documentation be translated?
Some parts of the technical documentation (like the API documentation) are auto-generated and won’t be translatable. It is technically feasible to translate the rest of the documentation (like the description of issues and flags), but that work needs to be balanced between overall user needs and the volunteers who provide this valuable service. The priority is to first mature the current documentation.

Is there a way to be notified when there is a new GBIF Backbone Taxonomy is updated?
At the moment, we have Backbone taxonomy updates at most twice a year. They are always mentioned in the release notes: Release notes. We don’t have an automated notification system but you can keep an eye on the release notes.

It would be really helpful if you could explain which references are use to assess the quality of the fields. For instance, “ depth unlikely records ” - does GBIF use any bathymetry data or the same min max value that are applied to all records?
The only checks we do is that the value should be between 0 and 11000 (Mariana Trench depth in meters). There aren’t any additional reference used. See also the documentation of the flag here.

I am often confused about how to record interval and duration. For example, a trawl event that happens around midnight: 2023-02-10T11:58Z/2023-02-11T00:04Z. Darwin Core Quick Reference Guide eventDate example suggests that this should be interpreted as some time during the interval between 10 Feb 2023 11:58pm UTC and 11 Feb 2023 00:04am UTC. It doesn’t mean that it is a 6 minutes trawl, does it?
Each data providers might interpret the definition of the field differently. Some assume that an event can take place any time during the range provided while others would use this field to convey the length of the event. We suggest to check the samplingEffort field to check how the event was conducted. Generally for publishers, you can also add information in the sampling-related fields.

How does filtering for records with date range work for the API? For example, will the request eventDate=*,2023-01 returns records with eventDate=2022-12/2023-01 or eventDate=2023-01/02?
The general documentation is available here: Date and time interpretation :: Technical Documentation. The example query would return only eventDate=2022-12/2023-01.

What is the difference between data use 1 and literature 2 ?
Literature corresponds to any publication mentioning GBIF (that we could find), you can also watch and read this previous Technical Support Hour for Nodes on the topic. The Data Use is the hand picked, curated examples of how GBIF data is used. They will often be included in the science review as well.

How to filter for extinct species in GBIF? When is isExtinct = true?
The information about the extinction status is shared in checklists by using the Species Profile extension (http://rs.gbif.org/extension/gbif/1.0/speciesprofile_2019-01-29.xml). Mostly, what is available on GBIF when looking for https://api.gbif.org/v1/species/search?isExtinct=true is every taxon from any checklist published on GBIF that has the value “isExtinct” = True. The information is shared by each publisher (with their own interpretation of the field) and we don’t have one reference list of extinct species.You might be able to use the source checklists to give you more context. For example, you can try filtering for extinct species in the The Paleobiology Database checklist: https://api.gbif.org/v1/species/search?isExtinct=true&dataset_key=c33ce2f2-c3cc-43a5-a380-fe4526d63650.As a side note, Checklistbank has a button in the interface to filter for extinct species, see example here.
Generally, accessing information about species on GBIF isn’t always straight forward, this is why this is the topic of the next Data Use Club practical session. You can check out the event page and register here: Data Use Club practical sessions: accessing and downloading species information that the occurrence search also allows to filter for occurrences of species that belong to the “Extinct” category of the IUCN. See this example: Search

When you look at a single occurrence and there are three columns: remarks (which contain the flags), the original data and the interpreted data. Is there a way to download an occurrence selection and have the original and interpreted data side by side?
You could choose the DARWIN CORE ARCHIVE download format which is a zip file that contains a file called verbatim.txt (containing the original data) and one called occurrence.txt containing the interpreted data (you can read more about it here). The interpreted data will have a column with all the GBIF flags. You will then need to join those two files to have the data side by side. You can do so by using the gbifid field provided in both files (see also the documentation here).
Note that you can get the raw data for a given occurrence with the API like this: https://api.gbif.org/v1/occurrence/4512323354/fragment.

Right now, the IUCN Global threat status are available for GBIF occurrence. Is there a way to add regional red list categories to occurrences?
There is no easy way for us to do that. You are welcome to publish a checklist with the relevant extensions but it doesn’t mean that we would necessarily be able to process occurrences based on that checklist. We suggest logging your ideas on GitHub (Issues · gbif/portal-feedback · GitHub) so we can keep track of the interest in such feature.

Topic		Replies	Views
GBIF Issues & Flags - GBIF Data Blog data-blog	15	7018	May 22, 2024
April technical support hour for GBIF nodes Data Publishing NodesSupportHour	4	754	June 26, 2023
Deep Dive: Date-related issues and flags Data Publishing NodesSupportHour	2	83	June 13, 2025
Search, download, analyze and cite (repeat if necessary) - GBIF Data Blog data-blog	15	2869	September 15, 2021
Filtering isn't cleaning Data Use	22	1289	October 3, 2023

GBIF's data quality workflow (GBIF technical support hour for nodes)

Related topics