The Essex (UK) naturalist John Ray began keeping biological records in the mid-1600s. The centuries-old tradition has been continued by the Natural History Museum (London), which shares 5.5 million specimen records with GBIF.
However, a biological record
is essentially a point on a map showing you that a certain species/organism was found at that location by someone on a certain date (here)
and not all of the NHM’s shared specimen entries have the “what”, “where”, “when” and “by whom” of a usable biological record. From a download of the Darwin Core source archive on 2024-09-13, I checked just the “what” and “where” fields provided in the occurrence.csv file, namely
The table below shows totals for records with nothing in those grouped fields in NHM’s botany, entomology, paleobiology and zoology collections.
Collection | No “what” | No “where” | No “what” and no “where” |
---|---|---|---|
BOT | 27736 | 301987 | 26371 |
ENT | 28668 | 796118 | 6609 |
PAL | 206676 | 199030 | 155717 |
ZOO | 185463 | 335728 | 150746 |
By “nothing” I mean a blank, no entry at all, and the numbers above are minimum estimates for usability. For example, 3112 botany records have the invalid entry “Flowering plant” in the scientificName field and a blank in all the other “what” fields. Furthermore, many of the NHM records with no “what” or “where” are skeletal, like the zoology record shown below
(You can view this record at the NHM Data Portal here.)
According to the current definitions for the Minimum Information about a Digital Specimen (MIDS), a skeletal record is at MIDS level 0 and has no scientific value, but is useful for digital cataloguing:
A bare or skeletal record making the association between an identifier of a physical specimen and its digital representation, allowing for unambiguous attachment of all other information.
The no-“what” records tallied above don’t even seem to reach MIDS level 1, at which there should be a name:
A name given to the object. Any string of characters and/or numbers by which the object is referenced within a collection. For example, the name the specimen is stored under, its scientific or taxonomic name if known, how it is labelled, etc. This name is not necessarily its name according to an accepted scientific classification, identification, or taxonomic determination (i.e., scientific name) but it often can be the same as that.
As I noted in a previous forum post, NHM isn’t alone in publishing “what”-less records, and a comment after that post from a USA data publisher suggests that such records are “placeholders”. More information will be added in future, and if users aren’t interested in unusable records they can just ignore them.
It’s hard not to conclude that publishers of these unusable records are sharing the Darwin Core version of whatever happens to be in their CMS, with no filtering for usability at the publisher end. There’s also no filtering by GBIF, although GBIF adds issue flags to assist end-users, such as “Taxon match none”.
Do end-users derive any benefit from unusable records? I doubt it. Publishers, on the other hand, gain a performance credit when they mobilise and share N records, whether the records are usable or not.
NHM has a programme to digitise 80 million items in its collections. I might modestly propose that they simply assign “placeholder” IDs to the next 74.5M items and share them with GBIF. This would complete the programme in the short term and missing information could be added to the records in future.
Robert Mesibov (“datafixer”); robert.mesibov@gmail.com