Occurrence records without their event records

In a previous [community forum post] 1, I reported that I had found GBIF datasets which failed a referential integrity test. I found another one today while auditing data for Pensoft Publishers.

The data publishers in each case had prepared an event.txt table for their sampling events and an occurrence.txt table for their occurrence records. The two tables were connected by an eventID field.

For every eventID entry in occurrence.txt there needs to be a corresponding eventID entry in event.txt. This wasn’t always the case. Some records in occurrence.txt had an eventID that was missing from event.txt, so the records were missing their sampling details.

GBIF does not check for referential integrity failure. The “orphaned” occurrence records are silently dropped and do not appear either in the recommended, post-processing download from the dataset, or on the GBIF website as individual records. There is also no “issues” flag assigned to those orphaned records.

As with all data made available to GBIF, the primary responsibility for data quality lies with the data publisher. However, checking for errors of this kind can be difficult, especially if eventID is a long and complicated code.

If you can use a BASH shell, please see this [blog post] 2 in which I provide a shell script (“chkevoc”), which finds referential integrity problems with event.txt and occurrence.txt. The script also checks for blank and duplicate eventID entries in event.txt, and for blank and duplicate occurrenceID entries in occurrence.txt.

“chkevoc” works on tab-separated, plain-text files. If you are preparing event.txt or occurrence.txt in a spreadsheet, a quick way to convert the spreadsheet to tab-separated plain text is to copy the active cells to the clipboard, then paste into a good-quality text editor, such as [Notepad++] 3 for Windows, or [Geany] 4 for Windows, Mac and Linux. Individual spreadsheet cells will automatically become tab-separated in the resulting text file.

Thank you @datafixer.

Note that the GBIF data validator is also able to detect integrity violation. See this example: Data validator

We cannot flag records that cannot be ingested. I don’t know what would be the best solution here. Perhaps not ingest the records at all? I just logged the idea here: Not ingest dataset with integrity violation · Issue #3872 · gbif/portal-feedback · GitHub.

Alternatively, perhaps this should be part of the IPT checks?

Thanks, @mgrosjean.

I think it would be best to advise data publishers that there is a problem in their data that will cause some of their records to disappear when harvested by GBIF from an IPT. Whether that’s done at the IPT level or the GBIF level is less important, it seems to me, than getting that advice to the data publisher as quickly as possible.

All the cases I’m aware of were found during a Pensoft data check. Pensoft advised the data publisher by email, and the data publisher then fixed the referential integrity problem. Is an automated email send-out not possible from GBIF?

As far as I know, this type email notifications was considered but not adopted because we wanted to limit the amount of email sent out automatically by systems.

I understand that there are plans to integrate the data validator to the IPT in the future. That would allow publishers to see this type of issue before the data makes it to GBIF.

How would issues identified at IPT level be communicated to data publishers? Directly or through IPT administrators?

If through admins, then there is a good chance that the person-to-person email contact will result in fixes not only to this particular problem, but also to the many other issues that validation at IPT level can pick up.

If directly, I don’t see how the chances of getting this particular problem fixed will be increased - look at the number of issues already flagged in datasets but not fixed by publishers.

Is GBIF considering other ways to communicate with publishers, other than by automated email send-outs?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.