When you’re preparing a Darwin Core table, checking individual fields can be pretty straightforward. As a minimum, you make sure that
- entries are in the correct field
- entries are valid for the field
- entries are appropriate for the field
- entries are formatted as recommended for Darwin Core
- entries are consistently formatted
Examples:
(1) “holotype” belongs in typeStatus, not type
(2) “Beetle sp. 3” is not a valid entry in scientificName
(3) “F. Smith, Dec 1968” in identifiedBy should be split between “F. Smith” in identifiedBy and “1968-12” in dateIdentified
(4) “2012-10-05” in eventDate, not “10/5/12”
(5) “Forest Service workers” and “For. Ser. workers” in recordedBy should both be “Forest Service workers”
The next step in checking isn’t so simple. Individual fields might be perfect on their own, but do they agree with each other in individual records?
For example, you might have a record with “Chrysomela scripta Fabricius, 1801” in scientificName but “Linneaus, 1758” in scientificNameAuthorship. Or “France” in country and “Asia” in continent. Oops.
Below are some of the Darwin Core disagreements I’ve seen in my work as a data auditor.
Missing-but-expected
Missing-but-expected (MBE) disagreements can appear in any group of related fields. Blank entries are perfectly valid in an individual field, but they constitute errors if related fields have relevant entries. Here are some examples:
No. of records | decimalLatitude | decimalLongitude | stateProvince | country |
---|---|---|---|---|
14 | 52.1934 | -105.4405 | Saskatchewan |
No. of records | scientificName | genus | specificEpithet | infraspecificEpithet |
---|---|---|---|---|
26 | Aus bus cus Smith, 1900 | Aus | cus |
No. of records | minimumElevationInMeters | occurrenceRemarks |
---|---|---|
5 | found just below treeline at ca 2100’ |
That’s-not-right
That’s-not-right (TNR) means that the valid entry in one field simply does not agree with the valid entry in a related field. There are many possible TNRs. Some of the types I’ve seen include:
- “Aus bus cus Smith, 1900” in scientificName but “species” instead of “subspecies” in taxonRank
- “Aus bus cus Smith, 1900” in scientificName but “Jones, 1910” in scientificNameAuthorship
- “Aus bus cus Smith, 1900” in scientificName and acceptedNameUsage but “Dus” in genus
- “2012-10-05” in eventDate but “5.vii.2012” in verbatimEventDate
- “2012-10-05” in eventDate but “2010” in year
- “2012-10-05/30” in verbatimEventDate but “289” in startDayOfYear (should be 279)
- “2012-10-05” in eventDate but “2009” in dateIdentified
- “France” in country but “Asia” in continent
- decimalLatitude and decimalLongitude are “63.6762” and “13.6875” but country is “Norway” (should be “Sweden”)
- decimalLatitude and decimalLongitude are given to 2 or 3 decimal places, but coordinateUncertaintyInMeters is “10” (See this Wikipedia table)
- “400” in minimumElevationInMeters but “300” in maximumElevationInMeters
- “25” in individualCount but “absent” in occurrenceStatus
one-to-many
Overlapping with TNR, I very often see a valid entry in one field and more than one valid entry in another field. Which of the latter is correct? It’s usually not possible to answer that question from within the Darwin Core table, and either an external data reference or the data compiler needs to be consulted.
No. of records | genus | family |
---|---|---|
63 | Aus | Improbabilidae |
17 | Aus | Probabilidae |
No. of records | scientificName | taxonID |
---|---|---|
63 | Aus bus Jones, 1910 | 167243 |
17 | Aus bus Jones, 1910 | 158046 |
No. of records | decimalLatitude | decimalLongitude | stateProvince |
---|---|---|---|
63 | 50.2809 | 10.5383 | Bavaria |
17 | 50.2809 | 10.5383 | Thuringia |
No. of records | locality | decimalLatitude | decimalLongitude |
---|---|---|---|
63 | 5 km W of Babinda | -31.8535 | 146.4446 |
17 | 5 km W of Babinda | -31.6535 | 146.4446 |
Finding disagreements
With small Darwin Core datasets (not too many records and a small number of fields), it’s possible to check for field disagreements in a spreadsheet by sorting on one field, putting the field to be checked next to that first field (by freezing panes or hiding columns) and reading carefully down the paired columns.
For both small datasets and very large ones, it’s easier and faster to do checking on the command line and you are less likely to miss a disagreement. A useful prerequisite, though, is to do field-by-field checks (above). In that last “one-to-many” example, it wouldn’t help if some of the locality entries were “Babinda, 5 km W of”.
Some field disagreements will be detected programmatically by the GBIF Data Validator and flagged as issues, such as “Country coordinate mismatch”, “Elevation min max swapped” and “Recorded date mismatch”. However, there are many more possible disagreements, and in the case of “one-to-many” disagreements GBIF does not provide context. The Validator is a good place to start, though!
Robert Mesibov (“datafixer”); robert.mesibov@gmail.com