In 2004, GBIF commissioned Arthur Chapman to produce three papers, and all three are still freely available online through GBIF:
-
Chapman AD (2005) Principles and Methods of Data Cleaning – Primary Species and Species-Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen.
-
Chapman AD (2005) Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen.
-
Chapman AD (2005) Uses of Primary Species-Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen.
The first two papers were primarily addressed to institutions holding collections. The tsunami of citizen-science observations had not yet arrived, and museums and herbaria were GBIF’s major data publishers.
In both of the first two papers Chapman considered the connection between the reputation of an institution and the quality of its data:
There is no such thing as good quality data or bad quality data (Chapman 2005a). Data are data, and their use will determine their quality. Nevertheless, data providers need to ensure that the data are as free from error as it is possible to make them… In this period of increasing data and information exchange, the reputation of a collection’s institution is likely to hinge on the quality and availability of its information (Redman 1996, Dalcin 2004), rather than on the quality of its scientists, as has been the case in the past. This is a fact of life, and the two can no longer be separated. Good data and information management must run side by side with good science and together they should lead to good data and information." (…Data Cleaning…, p. 62)
Can and should species-occurrence data be certified? With increased data becoming available from many agencies, users want to know which institutions they can rely on, and which follow documented quality control procedures. Should they just rely on well-known institutions, or are there lesser-known institutions also with reliable data? What data available from the better-known institutions are reliable and which aren’t? Reputation alone can be the deciding factor on where a user may source their data but reputation is a subjective concept and is a fragile character on which to base actions and decisions (Dalcin 2004). Is this what we want in our discipline? The development of agreed quality certification could lead to an improvement in overall data quality and to increased certainty among users on the value of the data. (…Principles…, pp 48,49)
A lot has happened in the 18 years since Chapman’s reports were published. Below I list what I see as the most interesting trends from the point of view of natural history collections (NHCs). Comments very welcome.
NHCs have become minor players in GBIF. Human observations now account for five out of every six occurrence records mediated by GBIF. Preserved specimens make up less than 10%. Citizen-science observations are growing rapidly and cheaply from an enormous base, while digitisation of museum and herbarium specimens is still a slow and expensive process. The “10%” will shrink further.
NHCs have remained in quality paralysis. For a range of reasons, the quality of occurrence data shared by NHCs with GBIF has hardly improved, despite the efforts of GBIF and other aggregators to flag at least some of the issues needing to be fixed. Also for a range of reasons, the quality of citizen-science records remains high. It is unfortunate that 1 million out of 9 million Smithsonian records have no eventDate in their Darwin Core versions, but it is encouraging that zero of the 1 billion eBird sighting records shared with GBIF have an invalid, unlikely or mismatched eventDate (here).
NHC reputation has become tied to data volume, not quality. Institutions enthusiastically publicise their digitisation programs, but only a few NHCs celebrate their quality control efforts. A NHC can earn bragging rights (“So excited to be sharing 43000 of our records with GBIF!” on Twitter) simply by mobilising a set of digital records, of unspecified quality.
No NHC certification scheme has appeared. Users typically filter occurrence records according to research requirements (e.g., records after 1950) or selected quality criteria (e.g., records with spatial uncertainty less than 10 km). At least one aggregator, the Atlas of Living Australia, filters out by default any occurrence records with selected problems. In general, users rely for quality assessment on the flags attached to individual records, and on the results of their own quality tests. A NHC data certification scheme doesn’t have a place in this picture.
No pressure has been applied by aggregators to encourage NHCs to improve data quality. Poor-quality datasets are not sandboxed by GBIF and their publishers are not notified of data quality problems (see example). No aggregator awards “orchids” to publishers of high-quality datasets and “onions” to publishers of low-quality datasets.
In summary, Chapman’s concern about institutional reputation and data quality was misplaced. His 2005 papers are still well worth reading for their excellent discussions of data quality and the documentation of quality control. I suspect, however, that their 2023 readers won’t be Chapman’s intended audience, namely data managers at NHCs.
Robert Mesibov (“datafixer”); robert.mesibov@gmail.com