Field disagreements in Darwin Core

When you’re preparing a Darwin Core table, checking individual fields can be pretty straightforward. As a minimum, you make sure that

  1. entries are in the correct field
  2. entries are valid for the field
  3. entries are appropriate for the field
  4. entries are formatted as recommended for Darwin Core
  5. entries are consistently formatted

Examples:

(1) “holotype” belongs in typeStatus, not type

(2) “Beetle sp. 3” is not a valid entry in scientificName

(3) “F. Smith, Dec 1968” in identifiedBy should be split between “F. Smith” in identifiedBy and “1968-12” in dateIdentified

(4) “2012-10-05” in eventDate, not “10/5/12”

(5) “Forest Service workers” and “For. Ser. workers” in recordedBy should both be “Forest Service workers”

The next step in checking isn’t so simple. Individual fields might be perfect on their own, but do they agree with each other in individual records?

For example, you might have a record with “Chrysomela scripta Fabricius, 1801” in scientificName but “Linneaus, 1758” in scientificNameAuthorship. Or “France” in country and “Asia” in continent. Oops.

Below are some of the Darwin Core disagreements I’ve seen in my work as a data auditor.

Missing-but-expected

Missing-but-expected (MBE) disagreements can appear in any group of related fields. Blank entries are perfectly valid in an individual field, but they constitute errors if related fields have relevant entries. Here are some examples:

 

No. of records decimalLatitude decimalLongitude stateProvince country
14 52.1934 -105.4405 Saskatchewan

 

No. of records scientificName genus specificEpithet infraspecificEpithet
26 Aus bus cus Smith, 1900 Aus cus

 

No. of records minimumElevationInMeters occurrenceRemarks
5 found just below treeline at ca 2100’

 

That’s-not-right

That’s-not-right (TNR) means that the valid entry in one field simply does not agree with the valid entry in a related field. There are many possible TNRs. Some of the types I’ve seen include:

  • “Aus bus cus Smith, 1900” in scientificName but “species” instead of “subspecies” in taxonRank
  • “Aus bus cus Smith, 1900” in scientificName but “Jones, 1910” in scientificNameAuthorship
  • “Aus bus cus Smith, 1900” in scientificName and acceptedNameUsage but “Dus” in genus
  • “2012-10-05” in eventDate but “5.vii.2012” in verbatimEventDate
  • “2012-10-05” in eventDate but “2010” in year
  • “2012-10-05/30” in verbatimEventDate but “289” in startDayOfYear (should be 279)
  • “2012-10-05” in eventDate but “2009” in dateIdentified
  • “France” in country but “Asia” in continent
  • decimalLatitude and decimalLongitude are “63.6762” and “13.6875” but country is “Norway” (should be “Sweden”)
  • decimalLatitude and decimalLongitude are given to 2 or 3 decimal places, but coordinateUncertaintyInMeters is “10” (See this Wikipedia table)
  • “400” in minimumElevationInMeters but “300” in maximumElevationInMeters
  • “25” in individualCount but “absent” in occurrenceStatus

one-to-many

Overlapping with TNR, I very often see a valid entry in one field and more than one valid entry in another field. Which of the latter is correct? It’s usually not possible to answer that question from within the Darwin Core table, and either an external data reference or the data compiler needs to be consulted.

 

No. of records genus family
63 Aus Improbabilidae
17 Aus Probabilidae

 

No. of records scientificName taxonID
63 Aus bus Jones, 1910 167243
17 Aus bus Jones, 1910 158046

 

No. of records decimalLatitude decimalLongitude stateProvince
63 50.2809 10.5383 Bavaria
17 50.2809 10.5383 Thuringia

 

No. of records locality decimalLatitude decimalLongitude
63 5 km W of Babinda -31.8535 146.4446
17 5 km W of Babinda -31.6535 146.4446

 

Finding disagreements

With small Darwin Core datasets (not too many records and a small number of fields), it’s possible to check for field disagreements in a spreadsheet by sorting on one field, putting the field to be checked next to that first field (by freezing panes or hiding columns) and reading carefully down the paired columns.

For both small datasets and very large ones, it’s easier and faster to do checking on the command line and you are less likely to miss a disagreement. A useful prerequisite, though, is to do field-by-field checks (above). In that last “one-to-many” example, it wouldn’t help if some of the locality entries were “Babinda, 5 km W of”.

Some field disagreements will be detected programmatically by the GBIF Data Validator and flagged as issues, such as “Country coordinate mismatch”, “Elevation min max swapped” and “Recorded date mismatch”. However, there are many more possible disagreements, and in the case of “one-to-many” disagreements GBIF does not provide context. The Validator is a good place to start, though!

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

2 Likes

Thank you datafixer, it illustrates very well the different type of field disagreements that can be encountered.
I think this post will be useful for many publishers!