When working with GBIF-mediated datasets I sometimes see patterns indicating that something is seriously wrong at the data publisher end.
A particularly striking pattern is in the Extant Specimen Records dataset of the National Museum of Natural History, Smithsonian Institution, in the USA. Out of 9,170,963 occurrence records (2023-05-25), GBIF’s processing found 1,019,406 records with “Recorded date invalid”.
That’s 1 out of every 9 records, 11.1%. To put that in perspective, here are the “Recorded date invalid” proportions from some other major collections that publish occurrences through GBIF:
Natural History Museum (UK) - 36,237 out of 5,168,436 records (0.7%)
American Museum of Natural History - 5,628/1,807,301 (0.3%)
Australian Museum (OZCAM dataset) - 423/1,560,995 (0.03%)
French National Herbarium (through MNHN) - 1,411/5,559,078 (0.025%)
To see what was happening I downloaded the “Recorded date invalid” Smithsonian records (DOI) and looked at the eventDate, verbatimEventDate, year, month, day and occurrenceRemarks fields in the original dataset (the eventRemarks and fieldNotes fields were empty). Here are the problems I found:
1. None of the 1,019,406 records has an eventDate entry, although every record has a valid year entry.
2. Both the month and day entries include the invalid “0” and “99”:
No. of records | month
354911 | 0
42810 | 1
44140 | 10
42392 | 11
35668 | 12
41344 | 2
44297 | 3
54392 | 4
69053 | 5
71115 | 6
89651 | 7
79608 | 8
49627 | 9
398 | 99
No. of records | day
1018871 | 0
3 | 9
532 | 99
3. There are many thousands of puzzling disagreements among the date fields, without explanation in occurrenceRemarks. The disagreements are very diverse, and I’ll show just four examples:
This plant record has the verbatimEventDate “10.4.2021”, but “1964” in year and “0” in month and day. Going to the record through the Smithsonian data portal, the Date collected entry is “1964 (10.4.2021)”, although the herbarium sheet label has “10.4.2021”.
This animal record has the verbatimEventDate “-- — -----” (not reproduced on the GBIF occurrence webpage), but “1880” in year and “0” in month and day. The Smithsonian record has “1880” for Date collected.
This plant record has “1986” in year, “0” in month and “9” in day, plus “244” in startDayOfYear, “273” in endDOY and nothing in verbatimEventDate. The Smithsonian record says the sample was collected in September 1986, and in 1986, 1 September was day 244 and 30 September was 273.
This animal record also has no verbatimEventDate entry, with “2023” in year, “3” in month and “0” in day, with “90” in both startDOY and endDOY. Day 90 in 2023 was 31 March. (Smithsonian record here.)
In processing the 1,019,406 “Recorded date invalid” records, GBIF has discarded all the year, month and day entries in the original dataset, including valid entries. The startDOY and endDOY entries have passed through the processing untouched, but their distribution is peculiar:
neither | 391816
startDayOfYear only | 2332
endDayOfYear only | 6282
both | 618976
and in 3816 records the startDOY is later than the endDOY. Contrary to expectation, many of these records don’t span more than one year. Instead, the use of startDOY and endDOY is confusing or in error, e.g. verbatimEventDate “Mar 1924 to 13 Mar 1924” with start “91” and end “73”, and “21 Mar 1911 to 22 Mar 1911” with start “90” and end “81”.
What lessons can be learned from the Smithsonian’s date problems?
The first, obviously, is that digitisation of specimen data should be done carefully and the results checked for errors. Digitising without checking is only doing half the job.
The second is that it’s a good idea to carefully fill the eventDate field, as required for GBIF occurrence datasets, and with the correct format.
The third is that while your CMS might insist that certain fields be filled with something, even if the data are missing (as apparently has happened with the “0” and “99” entries in month and day, above), please remember that Darwin Core is not a CMS. It’s a framework for sharing biodiversity data. You are free to correct, normalise and supplement what you share.
Robert Mesibov (“datafixer”); robert.mesibov@gmail.com