Trouble in the Smithsonian "date-abase"

datafixer · May 25, 2023, 5:52am

When working with GBIF-mediated datasets I sometimes see patterns indicating that something is seriously wrong at the data publisher end.

A particularly striking pattern is in the Extant Specimen Records dataset of the National Museum of Natural History, Smithsonian Institution, in the USA. Out of 9,170,963 occurrence records (2023-05-25), GBIF’s processing found 1,019,406 records with “Recorded date invalid”.

That’s 1 out of every 9 records, 11.1%. To put that in perspective, here are the “Recorded date invalid” proportions from some other major collections that publish occurrences through GBIF:

Natural History Museum (UK) - 36,237 out of 5,168,436 records (0.7%)

American Museum of Natural History - 5,628/1,807,301 (0.3%)

Australian Museum (OZCAM dataset) - 423/1,560,995 (0.03%)

French National Herbarium (through MNHN) - 1,411/5,559,078 (0.025%)

To see what was happening I downloaded the “Recorded date invalid” Smithsonian records (DOI) and looked at the eventDate, verbatimEventDate, year, month, day and occurrenceRemarks fields in the original dataset (the eventRemarks and fieldNotes fields were empty). Here are the problems I found:

1. None of the 1,019,406 records has an eventDate entry, although every record has a valid year entry.

2. Both the month and day entries include the invalid “0” and “99”:

No. of records | month
354911 | 0
42810 | 1
44140 | 10
42392 | 11
35668 | 12
41344 | 2
44297 | 3
54392 | 4
69053 | 5
71115 | 6
89651 | 7
79608 | 8
49627 | 9
398 | 99

No. of records | day
1018871 | 0
3 | 9
532 | 99

3. There are many thousands of puzzling disagreements among the date fields, without explanation in occurrenceRemarks. The disagreements are very diverse, and I’ll show just four examples:

This plant record has the verbatimEventDate “10.4.2021”, but “1964” in year and “0” in month and day. Going to the record through the Smithsonian data portal, the Date collected entry is “1964 (10.4.2021)”, although the herbarium sheet label has “10.4.2021”.

This animal record has the verbatimEventDate “-- — -----” (not reproduced on the GBIF occurrence webpage), but “1880” in year and “0” in month and day. The Smithsonian record has “1880” for Date collected.

This plant record has “1986” in year, “0” in month and “9” in day, plus “244” in startDayOfYear, “273” in endDOY and nothing in verbatimEventDate. The Smithsonian record says the sample was collected in September 1986, and in 1986, 1 September was day 244 and 30 September was 273.

This animal record also has no verbatimEventDate entry, with “2023” in year, “3” in month and “0” in day, with “90” in both startDOY and endDOY. Day 90 in 2023 was 31 March. (Smithsonian record here.)

In processing the 1,019,406 “Recorded date invalid” records, GBIF has discarded all the year, month and day entries in the original dataset, including valid entries. The startDOY and endDOY entries have passed through the processing untouched, but their distribution is peculiar:

neither | 391816
startDayOfYear only | 2332
endDayOfYear only | 6282
both | 618976

and in 3816 records the startDOY is later than the endDOY. Contrary to expectation, many of these records don’t span more than one year. Instead, the use of startDOY and endDOY is confusing or in error, e.g. verbatimEventDate “Mar 1924 to 13 Mar 1924” with start “91” and end “73”, and “21 Mar 1911 to 22 Mar 1911” with start “90” and end “81”.

What lessons can be learned from the Smithsonian’s date problems?

The first, obviously, is that digitisation of specimen data should be done carefully and the results checked for errors. Digitising without checking is only doing half the job.

The second is that it’s a good idea to carefully fill the eventDate field, as required for GBIF occurrence datasets, and with the correct format.

The third is that while your CMS might insist that certain fields be filled with something, even if the data are missing (as apparently has happened with the “0” and “99” entries in month and day, above), please remember that Darwin Core is not a CMS. It’s a framework for sharing biodiversity data. You are free to correct, normalise and supplement what you share.

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

sformel · May 26, 2023, 4:37pm

Hi Robert,

Have you contacted any of the POCs for the NMNH Extant Specimen Records to alert them of your findings? Seems like that would be the logical next step for improving this data.

Cheers,
Steve

datafixer · May 26, 2023, 8:23pm

Hi @sformel. GBIF found 1M+ records with “Recorded date invalid”, not me.

That issue and many others in this dataset are listed on the “metrics” page for the dataset and in the “Issues and flags” section on the left on the dataset’s occurrences page. The next logical step for any publisher to improve their data is to act on the issues flagged by GBIF, but that is the publisher’s responsibility. According to GBIF’s activity record (log of downloads), this particular dataset has been available for downloads since October 2016.

sformel · May 26, 2023, 11:29pm

Ah, sorry if I was unclear. By your findings, I meant the additional analysis and lessons you shared, not the existence of the GBIF flags. Since you’ve taken the time to dig into those flags, it seemed like it would be productive to target your insight to the people who can do something about the data, in addition to sharing it here.

datafixer · May 27, 2023, 12:49am

@sformel, I understand your suggestion, and of course there was more to my analysis than what I summarised above. The question is, would the Smithsonian data managers find it helpful? Please excuse the length of my answer - if you prefer, you could copy/paste this discussion to a new community forum topic.

I ask that question about the Smithsonian because I have now done 21 “scoping audits” of publicly available datasets from museums and herbaria around the world. In each case I reported the analysis to the responsible data managers or curators. These audits were done without charge, as opposed to the fee-for-service audits offered to collections by Pensoft Publishers.

You might find the responses to those 21 audit reports interesting. In a few cases there was no response at all. In all the remaining cases I was thanked, but almost all the responses warned that for a range of reasons (I’ve summarised the reasons here, here and here) it was unlikely that the problems could be fixed in the near term.

You might also be interested in the responses I’ve gotten from collections data managers when I simply point to the flags attached to their datasets on GBIF, ALA or VertNet, and ask “Is there a policy or plan in place to fix the flagged problems in future uploads?”. Every response I’ve received (my question is sometimes ignored) is a variation on “No”.

My concern with regard to your suggestion, based on my experience, is that the Smithsonian data managers would probably not find my analysis helpful, but just an annoying reminder that there is data needing fixes and no scope to do anything about it.

It’s easy to be disappointed with the state of publicly shared biodiversity data, but there are bright spots in the picture. GBIF data publishers who submit data paper manuscripts to Pensoft have their datasets audited without charge, and the data paper will not be published until the dataset problems are fixed. As of this week, more than 250 datasets in GBIF have been substantially improved through this process.

It’s also encouraging that GBIF shares so much of the data from citizen science projects, because the data quality in occurrence records from those projects is usually very high.

sformel · May 30, 2023, 2:11pm

@datafixer Thanks for the thoughtful reply, and sharing your blog posts, they were very enlightening. I understand your explanation and hesitance to contact the Smithsonian data managers, so I’ll respect that decision. We could probably debate the likelihood of action for a long time, although your experience suggests it’s unlikely. I guess I would counter that several of the excuses you’ve summarized are based on resources that are subject to change: people, funding, time, IT boundaries. Our feedback will rarely singlehandedly change these things, but I have had my annoying reminders turn into action before, usually a coincidence with the changing of personnel, funding, or government priorities. These occasional successes are enough to keep me optimistic about sending a note to the provider.

gambleb · June 8, 2023, 1:20pm

Hi, I am the main technical contact for the Smithsonian Extant Dataset. I will look into your work and see if there are ways I can improve our output. We do appreciate your extensive analysis. We publish updates to our datasets monthly. I will let you know if improvements are made and when you might see them.

datafixer · June 8, 2023, 8:09pm

@gambleb, good to see you here. I didn’t do an extensive analysis, I just checked date fields to see why you had so many “Recorded date invalid” flags. There are many other issues spotted by GBIF (see below) and I found additional problems with a “field disagreement” check. Have the Smithsonian collections found the GBIF issue flags useful in the past?

gambleb · June 9, 2023, 5:56pm

Yes, we do find the analysis helpful to a point. However, with the size of our dataset and our current resource levels, it is impossible to investigate all the issues. We have extensive historical data that has many issues. I am looking at this from basically a technical standpoint only. Are we extracting the right fields and mapping them correctly. If that isn’t the issue then I can not say how or when the issues will be resolved. Sorry I don’t have a better answer for you than that.

datafixer · June 9, 2023, 8:15pm

@gambleb, many thanks for your answer. Would this be a fair restatement, then?:

“With limited resources, we cannot do data cleaning either in the CMS or in the export to DwC. We can only work on mapping CMS fields to DwC fields. If the mapping is logical and there are data problems in the DwC, then the problems arise in the CMS and they cannot be fixed in the near term.”

I’ve restated it this way because in this data management scenario it doesn’t matter whether the CMS data problems are historical or current. Please correct me if the restatement doesn’t summarise the situation at the Smithsonian.

gambleb · June 26, 2023, 10:29pm

@datafixer We actually can only work on correcting existing mappings of CMS or DwC fields. We do hope to improve and expand our mapping sometime soon but are waiting on some resource increases to do that work. Regarding the date issue. We very much appreciate you bringing this to our attention. Sorry we didn’t catch it ourselves. It did turn out that the 0 and 99 values were a coding element that was inserted into our process of moving data from our production CMS database to our public facing search database. We have corrected this and now are passing Null values where only partial date values are present. We also are adding EventDate into our dataset. They day, month corrections will be present in our next data upload on July 2nd. Can’t promise EventDate will be included in that update. At the latest it will be in our Aug update.

datafixer · June 26, 2023, 10:56pm

@gambleb Many thanks for the good news. However, I again point out that problems in your dataset were discovered by GBIF and flagged publicly as issues. I only looked at one of these issues. An additional date issue that would have been flagged if you had eventDate filled in is “recorded date mismatch” (see GBIF Issues & Flags - GBIF Data Blog).

I can only encourage you to look to the GBIF issue flags whenever you do an update, as guides to what should be done at DwC-export or CMS level to improve the quality of the data that the Smithsonian shares with the world. If data publishers don’t look at the flags, we can ask whether GBIF should bother generating them.

system · July 27, 2023, 8:57am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Century errors Data Publishing	0	165	May 17, 2024
21st century Amphibia, collected vs observed Data Use	3	792	May 14, 2023
Occurrence Datasets: EventDate versus date of data submission Data Use	2	771	June 16, 2021
Deep Dive: Date-related issues and flags Data Publishing NodesSupportHour	2	78	June 13, 2025
Calculating collection date --> GBIF upload date lag times Data Use	3	212	May 2, 2024

Trouble in the Smithsonian "date-abase"

Related topics