GBIF's data quality workflow (GBIF technical support hour for nodes)

The theme for the March 2024 session of the Technical Support Hour for Node is GBIF’s data quality workflow which will occur on the 6th of March 2024 at 4pm CET (UTC+1). We will go through how published data is processed regarding quality checks, show how you can get an overview of the flags and issues of datasets, how users provide publicly accessible feedback and how you can navigate the feedback yourself.

We will be happy to answer any question relating or not to the topic either in this thread or if you send an email to helpdesk@gbif.org.

5 Likes

Thank you so much for organizing this again! Here are my questions:

  • It would be really helpful if you could explain which references are use to assess the quality of the fields. For instance, “depth unlikely records” - does GBIF use any bathymetry data or the same min max value that are applied to all records?

Questions not related to data quality workflow:

  • I am often confused about how to record interval and duration. For example, a trawl event that happens around midnight: 2023-02-10T11:58Z/2023-02-11T00:04Z. Darwin Core Quick Reference Guide eventDate example suggests that this should be interpreted as some time during the interval between 10 Feb 2023 11:58pm UTC and 11 Feb 2023 00:04am UTC. It doesn’t mean that it is a 6 minutes trawl, does it?
  • How does filtering for records with date range work for the API? For example, will the request eventDate=*,2023-01 returns records with eventDate=2022-12/2023-01 or eventDate=2023-01/02?
  • What is the difference between data use and literature?

Thank you so much!!

2 Likes

How does filtering for records with date range work for the API? For example, will the request eventDate=*,2023-01 returns records with eventDate=2022-12/2023-01 or eventDate=2023-01/02?

I am finally following up on this - sorry for the delay! I am a bit unsure about whether you mean the format returned or which eventDates will be included so if my answer does not cover your question then please let me know.

I tested it, and it provides eventDates in an assortment of different formats:
2009-08-13T00:00
20-06-1967
1996-04
1818-06-20/1999-02-04

So it returns all occurrences until January 31st 2023 in all supported formats - see Table 1 in this part of the technical documentation.

Does this answer your question please?

Thank you so much @cecsve !!

It was just a curious question, please don’t spend too much time on it! I believe occ_001 should be returned. My question was whether occ_002 in the figure above will be returned with the query because the interval overlaps with the query but not completely.The idea could be extended to interval in days.

1 Like

Hi Ming, occ 2 wouldn’t be returned. For example, querying this: https://www.gbif.org/occurrence/search?dataset_key=377be098-626f-4cc2-b4b5-35700050669a&event_date=2005-08-02,2005-08-03&taxon_key=2434813&advanced=1 doesn’t return this occurrence: Occurrence Detail 897099134

2 Likes

The video is available here: Data quality workflow on Vimeo

Here is the transcript of the questions during the session.

Links shown in the presentation :

In the presentation, you showed the example of an iNaturalist observation where users who report feedback are redirected directly to the relevant iNaturalist page. How can publishers get the same set up? This is something that we have to set up per publisher. The main requirement is that the link to the page to report the feedback must be provided for each occurrence. If you are interested, please log a GitHub issue on our portal.

I have a question about the data validator. Do you provide those data quality checks through an API or is it only by upload on the web interface?
The validator has an API which is documented here: Validator API :: Technical Documentation. You are welcome to use it to check your datasets programmatically. Otherwise, you are also welcome to publish datasets on our TEST website https://www.gbif-uat.org either via the API or an IPT in TEST mode. The advantage of publishing on the test website is that you will be able to browse your flagged records while the validator only gives a summary of the issues and flags.

Is there an easy way to download the tables and list that are displayed in the metrics tab of the occurrence search?
Right now, the only way to do access those values is by using the occurrence search API: Occurrence API :: Technical Documentation (either by using facets or metrics or inventories). Follow this GitHub issue for more information: Add the possibility to download summary tables from the UI · Issue #5227 · gbif/portal-feedback · GitHub
Note that you can already download lists of datasets from the dataset search interface: Search

Right now the feedback email templates generated from the GBIF feedback interface adds the Node manager in copy of the email. Is there a way to have the email CC a different email address?
The email addressed CCied in those email templates are based on the Node manager email addresses in the GBIF directory. If you would like a different email address there, the address associated with the Node manager in the directly needs to be updated.

Should the technical documentation be translated?
Some parts of the technical documentation (like the API documentation) are auto-generated and won’t be translatable. It is technically feasible to translate the rest of the documentation (like the description of issues and flags), but that work needs to be balanced between overall user needs and the volunteers who provide this valuable service. The priority is to first mature the current documentation.

Is there a way to be notified when there is a new GBIF Backbone Taxonomy is updated?
At the moment, we have Backbone taxonomy updates at most twice a year. They are always mentioned in the release notes: Release notes. We don’t have an automated notification system but you can keep an eye on the release notes.

It would be really helpful if you could explain which references are use to assess the quality of the fields. For instance, “ depth unlikely records ” - does GBIF use any bathymetry data or the same min max value that are applied to all records?
The only checks we do is that the value should be between 0 and 11000 (Mariana Trench depth in meters). There aren’t any additional reference used. See also the documentation of the flag here.

I am often confused about how to record interval and duration. For example, a trawl event that happens around midnight: 2023-02-10T11:58Z/2023-02-11T00:04Z. Darwin Core Quick Reference Guide eventDate example suggests that this should be interpreted as some time during the interval between 10 Feb 2023 11:58pm UTC and 11 Feb 2023 00:04am UTC. It doesn’t mean that it is a 6 minutes trawl, does it?
Each data providers might interpret the definition of the field differently. Some assume that an event can take place any time during the range provided while others would use this field to convey the length of the event. We suggest to check the samplingEffort field to check how the event was conducted. Generally for publishers, you can also add information in the sampling-related fields.

How does filtering for records with date range work for the API? For example, will the request eventDate=*,2023-01 returns records with eventDate=2022-12/2023-01 or eventDate=2023-01/02?
The general documentation is available here: Date and time interpretation :: Technical Documentation. The example query would return only eventDate=2022-12/2023-01.

What is the difference between data use 1 and literature 2 ?
Literature corresponds to any publication mentioning GBIF (that we could find), you can also watch and read this previous Technical Support Hour for Nodes on the topic. The Data Use is the hand picked, curated examples of how GBIF data is used. They will often be included in the science review as well.

How to filter for extinct species in GBIF? When is isExtinct = true?
The information about the extinction status is shared in checklists by using the Species Profile extension (http://rs.gbif.org/extension/gbif/1.0/speciesprofile_2019-01-29.xml). Mostly, what is available on GBIF when looking for https://api.gbif.org/v1/species/search?isExtinct=true is every taxon from any checklist published on GBIF that has the value “isExtinct” = True. The information is shared by each publisher (with their own interpretation of the field) and we don’t have one reference list of extinct species.You might be able to use the source checklists to give you more context. For example, you can try filtering for extinct species in the The Paleobiology Database checklist: https://api.gbif.org/v1/species/search?isExtinct=true&dataset_key=c33ce2f2-c3cc-43a5-a380-fe4526d63650.As a side note, Checklistbank has a button in the interface to filter for extinct species, see example here.
Generally, accessing information about species on GBIF isn’t always straight forward, this is why this is the topic of the next Data Use Club practical session. You can check out the event page and register here: Data Use Club practical sessions: accessing and downloading species information that the occurrence search also allows to filter for occurrences of species that belong to the “Extinct” category of the IUCN. See this example: Search

When you look at a single occurrence and there are three columns: remarks (which contain the flags), the original data and the interpreted data. Is there a way to download an occurrence selection and have the original and interpreted data side by side?
You could choose the DARWIN CORE ARCHIVE download format which is a zip file that contains a file called verbatim.txt (containing the original data) and one called occurrence.txt containing the interpreted data (you can read more about it here). The interpreted data will have a column with all the GBIF flags. You will then need to join those two files to have the data side by side. You can do so by using the gbifid field provided in both files (see also the documentation here).
Note that you can get the raw data for a given occurrence with the API like this: https://api.gbif.org/v1/occurrence/4512323354/fragment.

Right now, the IUCN Global threat status are available for GBIF occurrence. Is there a way to add regional red list categories to occurrences?
There is no easy way for us to do that. You are welcome to publish a checklist with the relevant extensions but it doesn’t mean that we would necessarily be able to process occurrences based on that checklist. We suggest logging your ideas on GitHub (Issues · gbif/portal-feedback · GitHub) so we can keep track of the interest in such feature.

1 Like