Link issues to DwC terms

waddink · January 24, 2025, 11:01am

GBIF adds issues and flags to ingested occurrence data, as described in Occurrence issues and flags :: Technical Documentation
these describe the issue with the value for a DwC term, some flags can be added for multiple DwC terms: for example both dwc:decimalLatitude and dwc:decimalLongitude can be flagged with ZERO_COORDINATE if the value is 0. These issues can be retrieved as part of the occurrence record through the API as list of issues:
“issues”: “ZERO_COORDINATE,COUNTRY_COORDINATE_MISMATCH”

To programmatically address these issues you need to know for which specific DwC term the issue was flagged and that information is missing in the occurrence record. Is there a way to get this information from the API? The gbif.org GUI seems to know it, but I do not see it making a specific API call for it?

datafixer · January 24, 2025, 10:14pm

@waddink, which flagged issues would you programmatically address without inspecting the whole record?

waddink · January 26, 2025, 10:16am

@datafixer I am not sure, I have no plans to address issues myself but I would like to add known issues to a digital specimen so they can be fixed at the source. I would leave it to the annotator or builder of an annotation tool to decide what contextual information is needed. I could imagine this could be in some cases geo information, species distribution information, taxonomic information or expedition information that is not provided by the record.

datafixer · January 26, 2025, 7:36pm

@waddink, many thanks for the clarification. Re “so they can be fixed at the source”, see this forum post.

It sounds like DiSSCo is trying to make it easier for “source” to understand the issues to be fixed. Do you think that will increase the chances that “source” will do the fixes?

In other words, do you think that fixes are not happening because “source” does not understand the flags applied to records by GBIF?

waddink · January 27, 2025, 9:07am

There can be many reasons why flagged issues are not fixed: lack of capacity (some musea have only one staff), lack of technical skills, lack of priority, lack of understanding etc. Some of the flags are difficult to understand, I think mainly the ones related to grscicoll linking, but GBIF secretariat already took some actions in the past years to better explain these.

datafixer · January 27, 2025, 6:59pm

@waddink, yes, those are reasons why flagged issues are not fixed. For a more extended catalog see this 2016 post and this 2018 one.

My question was about your suggestion (above) that GBIF could/should relate flags to particular fields, and whether you thought that would help “source” do fixes.

If DiSSCo plans to report issues back to participating institutions (before GBIF sees the records), will DiSSCo be offering its own explanations of those issues and how they should be fixed?

waddink · January 27, 2025, 9:25pm

@datafixer the aim here is to make the flags visible as annotations at the source for data that is already in GBIF. For that I think we should link to the descriptions as given in GBIF (and credit GBIF). A next step could be to report issues before data is published in GBIF (although it would be better to prevent these issues when the records are being created). That would go beyond the issues flagged in GBIF as we encounter other issues as well such as duplicate images, and aim to have more controlled vocabularies and identifiers.

Wouter

datafixer · January 27, 2025, 9:56pm

@waddink, many thanks for clarifying what you want to do. It could be worthwhile to monitor the success rate for

(a) fixes resulting from annotations of records already in GBIF
(b) fixes resulting from reports by DiSSCo, before those records were first shared with GBIF

By “success rate” I mean a measurable decline in identified issues per record between dataset versions. You might consider dividing that metric, for example splitting by issue type.

waddink · January 28, 2025, 9:43am

That is an interesting idea @datafixer, we will take that into account when we start developing metrics.

pieter · January 29, 2025, 11:09am

Hi Wouter, I found this older blog post which links to specific fields. Maybe a local csv will do if there is no specific API endpoint for it?

EDIT: and the official documentation of course: Occurrence issues and flags :: Technical Documentation

That last one is probably a better idea

datafixer · January 29, 2025, 4:13pm

@waddink, I’m pleased you think a success metric for annotations and feedback reports would be an interesting idea.

I’ve used issues per record before to compare datasets at a single point in time. Here’s an example from 2020.

Issues-per-record tracking from time point to time point would need to be done, obviously, on the same set of records. If each record in the set has a persistent identifier this should be easy to implement.

The records to be tracked also need to be of the same kind, for example using MIDS categories if reports and annotations are applied just to MIDS level 2 and 3.

Topic		Replies	Views
GBIF Issues & Flags - GBIF Data Blog data-blog	15	7017	May 22, 2024
Filtering isn't cleaning Data Use	22	1289	October 3, 2023
What are the flags "Collection match fuzzy", "Collection match none", "Institution match fuzzy", "Institution match none" and how to remove them? - GBIF Data Blog data-blog	5	2823	October 3, 2023
Data Use Club Practical Session : Data Quality Data Use	1	800	December 14, 2022
Darwin Core Half-Million - UPDATE Data Publishing	11	1089	December 8, 2022

Link issues to DwC terms

Related topics