Field disagreements in Darwin Core

datafixer · June 23, 2022, 5:34am

When you’re preparing a Darwin Core table, checking individual fields can be pretty straightforward. As a minimum, you make sure that

entries are in the correct field
entries are valid for the field
entries are appropriate for the field
entries are formatted as recommended for Darwin Core
entries are consistently formatted

Examples:

(1) “holotype” belongs in typeStatus, not type

(2) “Beetle sp. 3” is not a valid entry in scientificName

(3) “F. Smith, Dec 1968” in identifiedBy should be split between “F. Smith” in identifiedBy and “1968-12” in dateIdentified

(4) “2012-10-05” in eventDate, not “10/5/12”

(5) “Forest Service workers” and “For. Ser. workers” in recordedBy should both be “Forest Service workers”

The next step in checking isn’t so simple. Individual fields might be perfect on their own, but do they agree with each other in individual records?

For example, you might have a record with “Chrysomela scripta Fabricius, 1801” in scientificName but “Linneaus, 1758” in scientificNameAuthorship. Or “France” in country and “Asia” in continent. Oops.

Below are some of the Darwin Core disagreements I’ve seen in my work as a data auditor.

Missing-but-expected

Missing-but-expected (MBE) disagreements can appear in any group of related fields. Blank entries are perfectly valid in an individual field, but they constitute errors if related fields have relevant entries. Here are some examples:

No. of records	decimalLatitude	decimalLongitude	stateProvince	country
14	52.1934	-105.4405	Saskatchewan

No. of records	scientificName	genus	specificEpithet	infraspecificEpithet
26	Aus bus cus Smith, 1900	Aus		cus

No. of records	minimumElevationInMeters	occurrenceRemarks
5		found just below treeline at ca 2100’

That’s-not-right

That’s-not-right (TNR) means that the valid entry in one field simply does not agree with the valid entry in a related field. There are many possible TNRs. Some of the types I’ve seen include:

“Aus bus cus Smith, 1900” in scientificName but “species” instead of “subspecies” in taxonRank
“Aus bus cus Smith, 1900” in scientificName but “Jones, 1910” in scientificNameAuthorship
“Aus bus cus Smith, 1900” in scientificName and acceptedNameUsage but “Dus” in genus
“2012-10-05” in eventDate but “5.vii.2012” in verbatimEventDate
“2012-10-05” in eventDate but “2010” in year
“2012-10-05/30” in verbatimEventDate but “289” in startDayOfYear (should be 279)
“2012-10-05” in eventDate but “2009” in dateIdentified
“France” in country but “Asia” in continent
decimalLatitude and decimalLongitude are “63.6762” and “13.6875” but country is “Norway” (should be “Sweden”)
decimalLatitude and decimalLongitude are given to 2 or 3 decimal places, but coordinateUncertaintyInMeters is “10” (See this Wikipedia table)
“400” in minimumElevationInMeters but “300” in maximumElevationInMeters
“25” in individualCount but “absent” in occurrenceStatus

one-to-many

Overlapping with TNR, I very often see a valid entry in one field and more than one valid entry in another field. Which of the latter is correct? It’s usually not possible to answer that question from within the Darwin Core table, and either an external data reference or the data compiler needs to be consulted.

No. of records	genus	family
63	Aus	Improbabilidae
17	Aus	Probabilidae

No. of records	scientificName	taxonID
63	Aus bus Jones, 1910	167243
17	Aus bus Jones, 1910	158046

No. of records	decimalLatitude	decimalLongitude	stateProvince
63	50.2809	10.5383	Bavaria
17	50.2809	10.5383	Thuringia

No. of records	locality	decimalLatitude	decimalLongitude
63	5 km W of Babinda	-31.8535	146.4446
17	5 km W of Babinda	-31.6535	146.4446

Finding disagreements

With small Darwin Core datasets (not too many records and a small number of fields), it’s possible to check for field disagreements in a spreadsheet by sorting on one field, putting the field to be checked next to that first field (by freezing panes or hiding columns) and reading carefully down the paired columns.

For both small datasets and very large ones, it’s easier and faster to do checking on the command line and you are less likely to miss a disagreement. A useful prerequisite, though, is to do field-by-field checks (above). In that last “one-to-many” example, it wouldn’t help if some of the locality entries were “Babinda, 5 km W of”.

Some field disagreements will be detected programmatically by the GBIF Data Validator and flagged as issues, such as “Country coordinate mismatch”, “Elevation min max swapped” and “Recorded date mismatch”. However, there are many more possible disagreements, and in the case of “one-to-many” disagreements GBIF does not provide context. The Validator is a good place to start, though!

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

mgrosjean · June 23, 2022, 11:47am

Thank you datafixer, it illustrates very well the different type of field disagreements that can be encountered.
I think this post will be useful for many publishers!

system · July 23, 2022, 9:48pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How I check Darwin Core datasets Data Publishing	1	594	March 10, 2023
The vexed question of missing data in Darwin Core Data Publishing	8	1002	August 19, 2022
GBIF Issues & Flags - GBIF Data Blog data-blog	15	7017	May 22, 2024
A guide to date issues Data Publishing	4	574	June 5, 2025
Trouble in the Smithsonian "date-abase" Data Publishing	12	936	July 27, 2023

Field disagreements in Darwin Core

Related topics