A modest proposal for the NHM

datafixer · September 14, 2024, 2:16am

The Essex (UK) naturalist John Ray began keeping biological records in the mid-1600s. The centuries-old tradition has been continued by the Natural History Museum (London), which shares 5.5 million specimen records with GBIF.

However, a biological record

is essentially a point on a map showing you that a certain species/organism was found at that location by someone on a certain date (here)

and not all of the NHM’s shared specimen entries have the “what”, “where”, “when” and “by whom” of a usable biological record. From a download of the Darwin Core source archive on 2024-09-13, I checked just the “what” and “where” fields provided in the occurrence.csv file, namely

The table below shows totals for records with nothing in those grouped fields in NHM’s botany, entomology, paleobiology and zoology collections.

Collection	No “what”	No “where”	No “what” and no “where”
BOT	27736	301987	26371
ENT	28668	796118	6609
PAL	206676	199030	155717
ZOO	185463	335728	150746

By “nothing” I mean a blank, no entry at all, and the numbers above are minimum estimates for usability. For example, 3112 botany records have the invalid entry “Flowering plant” in the scientificName field and a blank in all the other “what” fields. Furthermore, many of the NHM records with no “what” or “where” are skeletal, like the zoology record shown below

(You can view this record at the NHM Data Portal here.)

According to the current definitions for the Minimum Information about a Digital Specimen (MIDS), a skeletal record is at MIDS level 0 and has no scientific value, but is useful for digital cataloguing:

A bare or skeletal record making the association between an identifier of a physical specimen and its digital representation, allowing for unambiguous attachment of all other information.

The no-“what” records tallied above don’t even seem to reach MIDS level 1, at which there should be a name:

A name given to the object. Any string of characters and/or numbers by which the object is referenced within a collection. For example, the name the specimen is stored under, its scientific or taxonomic name if known, how it is labelled, etc. This name is not necessarily its name according to an accepted scientific classification, identification, or taxonomic determination (i.e., scientific name) but it often can be the same as that.

As I noted in a previous forum post, NHM isn’t alone in publishing “what”-less records, and a comment after that post from a USA data publisher suggests that such records are “placeholders”. More information will be added in future, and if users aren’t interested in unusable records they can just ignore them.

It’s hard not to conclude that publishers of these unusable records are sharing the Darwin Core version of whatever happens to be in their CMS, with no filtering for usability at the publisher end. There’s also no filtering by GBIF, although GBIF adds issue flags to assist end-users, such as “Taxon match none”.

Do end-users derive any benefit from unusable records? I doubt it. Publishers, on the other hand, gain a performance credit when they mobilise and share N records, whether the records are usable or not.

NHM has a programme to digitise 80 million items in its collections. I might modestly propose that they simply assign “placeholder” IDs to the next 74.5M items and share them with GBIF. This would complete the programme in the short term and missing information could be added to the records in future.

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

datafixer · September 22, 2024, 8:36am

The title of this post is an homage to Jonathan Swift’s A Modest Proposal… (1792), a satirical essay. I don’t really think the NHM should share 74.5 M skeletal records with GBIF.

The NHM currently shares 293,703 unusable MIDS 0 records, or about 5% of their 5.5M total. As with the “no-what, no-where” record totals, this is a minimum estimate of (un)usability, because many of the DwC fields that should be filled with something sensible in the NHM’s MIDS 1 records contain only punctuation characters.

The NHM apparently thinks this is OK, and from the lack of comment on this post I worry that no one who follows this forum, including GBIF staff, is prepared to disagree. Would anyone like to suggest a threshold for “overall usability” and a justification for it? If 5% unusable is OK, how about 10%? 20%? 50%?

dshorthouse · September 22, 2024, 12:46pm

I admit, I had a chuckle at your Modest Proposal. I also do not think NHM should share 74.5M skeletal records. If it did however, I wonder (a) how long it would take for anyone to notice, and (b) whether it would trigger a call to action. The dialogue about dwc:basisOfRecord and the paralysis over what to do about it is one such example where aggregators might be waiting for the community to reach consensus. If some tipping point in the proportion of unusable records did prompt aggregators to alter their processing and subsequent indexing routines to exclude such records (presumably in response to an outcry), might this result in a chilling effect on publishers who are their core clients? In effect, aggregators would eat their young. Anecdotally, there are organizations that remain slow to publish their occurrence data because they grapple with how to implement internal policies, have persistent technical hurdles to overcome, or other issues. If basic data quality filters were applied post-publication to exclude unusable records, might this be identified by them as yet another reason to be fearful about the quality of their data and elect not to participate?

datafixer · September 22, 2024, 8:16pm

@dshorthouse, I’ve heard that argument before, and I see it as primarily an aggregator’s concern where aggregators regard collections as their primary suppliers. That might have been true 20 years ago, but it isn’t today, because less than 10% of shared aggregated records come from collections and the %age is dropping.

If collections want to remain relevant as biodiversity data publishers, they need to “implement internal policies” and overcome “persistent technical hurdles”. Aggregators could help.

An aggregator’s alternative is to sign up collections publishers in advance of aggregation. This is what DiSSCo has done, and here is the current result:

Screenshot_2024-09-23_06-11-38

DiSSCo’s current answer to my question is “23% is OK”.

datafixer · October 24, 2024, 10:30pm

UPDATE: In a blog post I’ve explained the method I used to get the numbers shown here, and have also pointed out a minor error (“796118” should be “796119”).

dschigel · January 9, 2025, 11:01am

Not sure if this is helpful, I know at least once case where a herbarium fast-digitized 1M records with no georeferences, unprocessed OCR of names, so generating a tall pile of “useless” records - but on purpose, doing this as a PR stage - then this “useless” million was highly publicized through press releases etc., and then this fame was monetized and funds are used for stepwise georeferencing by hired professionals x volunteers, which takes years, and still ongoing. So my question is if the state of the records is just a stage, or the final state?

dshorthouse · January 9, 2025, 3:33pm

Is there a final state? Similarly, does any specimen-based data remain static like that of the metadata for a peer-reviewed, published article once produced? I suspect it’s more like an attenuation to a generally stable state (eg MIDS and completeness in accordance with Darwin Core or ABCD) with stochastic jumps when new technologies are applied or new taxonomic work is executed. It’d be an interesting exercise to quantify the flux on every Darwin Core term, presented as sparklines of edit frequency relative to their initial, populated state.

datafixer · January 9, 2025, 9:41pm

@dschigel, I also don’t know if that’s helpful, because you haven’t said whether the herbarium records were shared or not with an aggregator. There’s no problem if the useless million stayed in the herbarium’s CMS. There’s a serious problem if the useless million records were shared with (for example) GBIF as occurrence records, because they aren’t occurrence records.

dschigel · January 9, 2025, 10:12pm

These were shared with GBIF almost at once after initial scans and keep improving esp. with georeferences - all in front of users, version after version https://doi.org/10.15468/cpnhcc. Currently saying 73% with coordinates, but this was certainly different when data was rolled out. N occurrences grows slower than % with coordinates.I am a big fan of early exposure x gradual improvements vs. long perfection before the release - but of course there are data publishers who never come back for fixes - it is a matter of scale and of attitude. I would imagine collections, esp. large collection are naturally having more curatorial, gardening-like take on data improvements then smaller-scale publishers / contacts such as authors of data papers, or project funded publishers. Manic-depressive project mentality is often a killer for responsible data maintenance, and so is academic mobility, short contracts, etc. but established collections - NHM, MW, even when project-funded, have long term responsibility, I am not worried - in other words, it is hard to expect a graduating PhD student to maintain own datasets 10-20 years after the initial publication, but data exposure reveals data imperfections and large institutions have a good chance to react - in fact, I see your original post here a confirmation of this fact - would we have this discussion with CMS access only to the data in question?

datafixer · January 9, 2025, 10:53pm

@dschigel, please do not confuse data improvement (for example, fixing errors, updating taxon names) with gradually adding data when there was no data to begin with.

If you see GBIF as a repository for the world’s CMSes, then there is nothing at all problematic about NHM sharing 74.5M skeletal records. If you see GBIF as a repository for usable information about the world’s biodiversity, then 1M+ useless herbarium records should never have been shared.

GBIF has changed over the years. It started with a focus on sharing specimen information from collections:

Edwards JL, Lane MA, Nielsen ES (2000) Interoperability of Biodiversity
databases: biodiversity information on every desktop. Science 289: 2312-2314.

Collections records are now just a minor and steadily shrinking proportion of GBIF’s occurrence records - overwhelmingly, most records are human observations - and GBIF has expanded beyond just occurrence records. I would have thought, however, that GBIF’s overarching aim is to make information available for biological research, for conservation planning and for other activities that require usable data. Skeletal records do not contain usable data.

Like you, I am not worried but for a different reason. The vast bulk of GBIF data is now based on contemporary observations of very high quality. For conservation and policy-making these observations are the resource of greatest value. The low-quality and even skeletal data shared by so many collections, reflecting mainly past states of the world’s biodiversity, has niche uses that will become harder and harder to justify in coming years.

Topic		Replies	Views
The strange case(s) of the missing identity	23	143	September 8, 2024
The vexed question of missing data in Darwin Core Data Publishing	8	1004	August 19, 2022
Trouble in the Smithsonian "date-abase" Data Publishing	12	936	July 27, 2023
Darwin Core Half-Million - UPDATE Data Publishing	11	1089	December 8, 2022
Extending, enriching and integrating data Digital/Extended Specimen	53	3953	April 5, 2021

A modest proposal for the NHM

Related topics