In a previous [community forum post] 1, I reported that I had found GBIF datasets which failed a referential integrity test. I found another one today while auditing data for Pensoft Publishers.
The data publishers in each case had prepared an event.txt table for their sampling events and an occurrence.txt table for their occurrence records. The two tables were connected by an eventID field.
For every eventID entry in occurrence.txt there needs to be a corresponding eventID entry in event.txt. This wasn’t always the case. Some records in occurrence.txt had an eventID that was missing from event.txt, so the records were missing their sampling details.
GBIF does not check for referential integrity failure. The “orphaned” occurrence records are silently dropped and do not appear either in the recommended, post-processing download from the dataset, or on the GBIF website as individual records. There is also no “issues” flag assigned to those orphaned records.
As with all data made available to GBIF, the primary responsibility for data quality lies with the data publisher. However, checking for errors of this kind can be difficult, especially if eventID is a long and complicated code.
If you can use a BASH shell, please see this [blog post] 2 in which I provide a shell script (“chkevoc”), which finds referential integrity problems with event.txt and occurrence.txt. The script also checks for blank and duplicate eventID entries in event.txt, and for blank and duplicate occurrenceID entries in occurrence.txt.
“chkevoc” works on tab-separated, plain-text files. If you are preparing event.txt or occurrence.txt in a spreadsheet, a quick way to convert the spreadsheet to tab-separated plain text is to copy the active cells to the clipboard, then paste into a good-quality text editor, such as [Notepad++] 3 for Windows, or [Geany] 4 for Windows, Mac and Linux. Individual spreadsheet cells will automatically become tab-separated in the resulting text file.