How I check Darwin Core datasets

This post describes what I do when auditing Darwin Core datasets. These are plain-text files with one record per line and with each line broken into fields.

For working with this simple data structure I use text-processing tools on the command line, not spreadsheets, OpenRefine, R or Python. Command-line tools for processing plain text originated in the UNIX era and have up to 40 years of development behind them. They are fast, reliable, simple to use and easy to learn. They are particularly good at working with very large files (millions of records, hundreds of fields).

The checks I do overlap with the checks done by GBIF, but the overlap isn’t large. I look for many data problems that GBIF ignores, and unless asked I don’t check for spatial errors such as coordinates outside the specified country (“country coordinate mismatch”), or for misspellings or incorrect authorities in scientificName. My aim is to ensure that the dataset is tidy, not that the entries are all factually correct.

The checking methods I use are explained in A Data Cleaner’s Cookbook, with more details and examples in the companion blog BASHing data. Contact me directly by email if you would like training in these command-line methods.

The checks are numbered below in a more or less logical order, but I vary this order with the nature of the dataset and the results of successive checks. The time required to check a dataset is not dependent on the number of records: it takes as long to check 1000 records as 100,000 records. A dataset with only a few problems takes 15-20 minutes for a complete check. If there are quite a few problems and the file is large, the check can take an hour or two.

For this post I’m assuming the dataset is a single occurrence.txt file in UTF-8 encoding with plain (not Windows) line endings. See this GBIF community forum post for a check on separate event.txt and occurrence.txt files in a Darwin Core archive.

(1) Check the list of Darwin Core fields for missing but expected fields, e.g. basisOfRecord, eventDate, occurrenceID.

(2) Check for character problems: “?”, �, invisible control characters, mojibake, formatting characters (no-break spaces, soft hyphens), multiple versions of the same character, unmatched braces.

(3) Check for within-field problems: invalid entries, incorrectly formatted entries, NITS, missing-but-expected entries, pseudo-duplication, incrementing fill-down errors, truncated entries, unneeded spaces (leading, trailing, internal) or quotes.

(4) Check for between-field problems as appropriate, such as valid entries in the wrong field, and disagreements between fields, e.g. year, month or day disagreeing with eventDate; genus or specificEpithet disagreeing with scientificName; scientificName disagreeing with taxonRank.

(5) Check for multiple unique entries where only one is expected, such as the same genus referred to two different family entries, or the same decimalLatitude-decimalLongitude pair referred to two different country or stateProvince entries.

(6) Check for exact and strict duplicate records.

(7) After the above problems have been fixed (by the data owner), check for anomalies in content.

Fixing data problems in a structured text file like a Darwin Core dataset isn’t my job (unless asked), but when repairing my own data I generally do deletions and replacements in a good text editor, and field- or record-specific replacements on the command line. As an alternative I highly recommend Modern CSV for general use on Windows, Mac and Linux systems. It’s a free table-editing program with all the table-editing functions of a spreadsheet, but without the spreadsheet hazards.

Robert Mesibov (“datafixer”);


This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.