A modest proposal for the NHM

The Essex (UK) naturalist John Ray began keeping biological records in the mid-1600s. The centuries-old tradition has been continued by the Natural History Museum (London), which shares 5.5 million specimen records with GBIF.

However, a biological record

is essentially a point on a map showing you that a certain species/organism was found at that location by someone on a certain date (here)

and not all of the NHM’s shared specimen entries have the “what”, “where”, “when” and “by whom” of a usable biological record. From a download of the Darwin Core source archive on 2024-09-13, I checked just the “what” and “where” fields provided in the occurrence.csv file, namely

1

The table below shows totals for records with nothing in those grouped fields in NHM’s botany, entomology, paleobiology and zoology collections.

Collection No “what” No “where” No “what” and no “where”
BOT 27736 301987 26371
ENT 28668 796118 6609
PAL 206676 199030 155717
ZOO 185463 335728 150746

By “nothing” I mean a blank, no entry at all, and the numbers above are minimum estimates for usability. For example, 3112 botany records have the invalid entry “Flowering plant” in the scientificName field and a blank in all the other “what” fields. Furthermore, many of the NHM records with no “what” or “where” are skeletal, like the zoology record shown below

(You can view this record at the NHM Data Portal here.)

According to the current definitions for the Minimum Information about a Digital Specimen (MIDS), a skeletal record is at MIDS level 0 and has no scientific value, but is useful for digital cataloguing:

A bare or skeletal record making the association between an identifier of a physical specimen and its digital representation, allowing for unambiguous attachment of all other information.

The no-“what” records tallied above don’t even seem to reach MIDS level 1, at which there should be a name:

A name given to the object. Any string of characters and/or numbers by which the object is referenced within a collection. For example, the name the specimen is stored under, its scientific or taxonomic name if known, how it is labelled, etc. This name is not necessarily its name according to an accepted scientific classification, identification, or taxonomic determination (i.e., scientific name) but it often can be the same as that.

As I noted in a previous forum post, NHM isn’t alone in publishing “what”-less records, and a comment after that post from a USA data publisher suggests that such records are “placeholders”. More information will be added in future, and if users aren’t interested in unusable records they can just ignore them.

It’s hard not to conclude that publishers of these unusable records are sharing the Darwin Core version of whatever happens to be in their CMS, with no filtering for usability at the publisher end. There’s also no filtering by GBIF, although GBIF adds issue flags to assist end-users, such as “Taxon match none”.

Do end-users derive any benefit from unusable records? I doubt it. Publishers, on the other hand, gain a performance credit when they mobilise and share N records, whether the records are usable or not.

NHM has a programme to digitise 80 million items in its collections. I might modestly propose that they simply assign “placeholder” IDs to the next 74.5M items and share them with GBIF. This would complete the programme in the short term and missing information could be added to the records in future.


Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

The title of this post is an homage to Jonathan Swift’s A Modest Proposal… (1792), a satirical essay. I don’t really think the NHM should share 74.5 M skeletal records with GBIF.

The NHM currently shares 293,703 unusable MIDS 0 records, or about 5% of their 5.5M total. As with the “no-what, no-where” record totals, this is a minimum estimate of (un)usability, because many of the DwC fields that should be filled with something sensible in the NHM’s MIDS 1 records contain only punctuation characters.

The NHM apparently thinks this is OK, and from the lack of comment on this post I worry that no one who follows this forum, including GBIF staff, is prepared to disagree. Would anyone like to suggest a threshold for “overall usability” and a justification for it? If 5% unusable is OK, how about 10%? 20%? 50%?

I admit, I had a chuckle at your Modest Proposal. I also do not think NHM should share 74.5M skeletal records. If it did however, I wonder (a) how long it would take for anyone to notice, and (b) whether it would trigger a call to action. The dialogue about dwc:basisOfRecord and the paralysis over what to do about it is one such example where aggregators might be waiting for the community to reach consensus. If some tipping point in the proportion of unusable records did prompt aggregators to alter their processing and subsequent indexing routines to exclude such records (presumably in response to an outcry), might this result in a chilling effect on publishers who are their core clients? In effect, aggregators would eat their young. Anecdotally, there are organizations that remain slow to publish their occurrence data because they grapple with how to implement internal policies, have persistent technical hurdles to overcome, or other issues. If basic data quality filters were applied post-publication to exclude unusable records, might this be identified by them as yet another reason to be fearful about the quality of their data and elect not to participate?

@dshorthouse, I’ve heard that argument before, and I see it as primarily an aggregator’s concern where aggregators regard collections as their primary suppliers. That might have been true 20 years ago, but it isn’t today, because less than 10% of shared aggregated records come from collections and the %age is dropping.

If collections want to remain relevant as biodiversity data publishers, they need to “implement internal policies” and overcome “persistent technical hurdles”. Aggregators could help.

An aggregator’s alternative is to sign up collections publishers in advance of aggregation. This is what DiSSCo has done, and here is the current result:

Screenshot_2024-09-23_06-11-38

DiSSCo’s current answer to my question is “23% is OK”.

UPDATE: In a blog post I’ve explained the method I used to get the numbers shown here, and have also pointed out a minor error (“796118” should be “796119”).