Do citizen-science platforms share tidier data than museums and herbaria?
I’ve gotten the impression that they do, from examining data quality in iNaturalist and from the “scoping audits” I do of collection datasets. A direct comparison would be more convincing, so I looked at occurrence records for Amphibia (an arbitrary choice) as mediated by GBIF.
The graph below shows numbers of “HumanObservation” and “PreservedSpecimen” records for Amphibia in GBIF in the 2000’s, by event year. The recent explosion in citizen-science records is obvious, and so is a slow decline in collection records. The sharp drop in the latter in recent years is likely to be an artifact of “data delay”; i.e., specimens may have been collected in the past few years but not yet accessioned.
The most recent collection records should have better data quality than earlier ones, I thought, so I queried GBIF on 2023-04-07 for all records starting in 2020 with the taxonKey “Amphibia” and with either “HumanObservation” or “PreservedSpecimen” in basisOfRecord. The resulting dataset (DOI) has 1286699 observation records (118 publishers) and 17018 collection records (66 publishers).
Of course, I was trusting that basisOfRecord had been used correctly by data publishers. This was not wise. In the “HumanObservation” subset, 1212 records had preparations entries for preserved specimens (e.g. “Animal completo (ETOH)”). I moved these 1212 records from the “observations” to the “collections” subsets, giving new totals of 1285487 and 18230, respectively.
I’ve found in data auditing that Darwin Core entries are sometimes perfectly valid but also incorrect, as in this case for basisOfRecord. Anomalies of this kind can be detected by cross-checking fields within individual records. Another example: as of 2023-04-08 there were 999 “HumanObservation” records in GBIF with holotype, neotype or lectotype in typeStatus (DOI). The winner was a well-known French museum, which published observation records for the holotype of Macrostemum scriptum (Rambur, 1842) from 166 unique locations in Madagascar.
GBIF checks. Both the observation and collection subsets for Amphibia had very low proportions of spatial issues as discovered by GBIF, i.e.
CONTINENT_COORDINATE_MISMATCH
CONTINENT_COUNTRY_MISMATCH
CONTINENT_INVALID
COORDINATE_INVALID
COORDINATE_PRECISION_INVALID
COORDINATE_UNCERTAINTY_METERS_INVALID
COUNTRY_COORDINATE_MISMATCH
COUNTRY_INVALID
COUNTRY_MISMATCH
FOOTPRINT_WKT_INVALID
FOOTPRINT_WKT_MISMATCH
GEODETIC_DATUM_INVALID
PRESUMED_NEGATED_LONGITUDE
PRESUMED_SWAPPED_COORDINATE
ZERO_COORDINATE
Nevertheless, the collections records had proportionally almost eight times as many spatial issues (one or more of the above) as the observation records: 341 out of 18230 (1.87%) vs 3085 out of 1285487 (0.24%).
Records with the FOOTPRINT_WKT_MISMATCH issue include 464 with point footprintWKT entries corrupted by the bug I described in a previous community forum post. All 464 records are from the Patrimoine Naturel project (publisher = UMS PatriNat (OFB-CNRS-MNHN), Paris).
The difference was much larger for the TAXON_MATCH_HIGHERRANK issue (“The record can be matched to the GBIF taxonomic backbone at a higher rank, but not with the scientific name given.”). Collections had 4917 records out of 18230 (27.0%) with a name up-matched, compared to 8082 records out of 1285487 (0.63%) for observations. However, 4773 of the 4917 collections up-matches were due to invalid scientificName entries, mainly from an institution in Brazil. Ignoring those entries, the collections up-matches drop to only 0.8%. (Examples of invalid names: “Unidentified sp.”, “Plethodon cinereus or Plethodon electromorphus”)
If I ignore the 5542 similarly invalid scientificName entries in the observations records, the up-matches there drop to 0.2%, so observations records are again ahead of collections records in quality. Most of the invalid names were contributed by a Dutch citizen-science project, with a French project running second.
More checks. There are many ways to assess the tidiness of datasets. I decided to focus on name and location information that would be needed for building a species distribution model (SDM) or ecological niche model (ENM). In peer-reviewed papers based on GBIF-mediated data, SDMs/ENMs are the most frequently reported outputs. Their construction only requires the modeller to know the species name and its occurrence coordinates. Modellers will sometimes check the date of the occurrence and apply a threshold, such as “all reported occurrences since 1970”. They will also sometimes add coordinates for well-defined localities that haven’t been georeferenced, and will discard doubtful occurrences or ones with high spatial uncertainty. Records are usually then “spatially thinned” to one per grid square or coordinate rectangle, to reduce spatial bias. It’s Data Use Lite, but for this exercise I pretended I was a modeller and filtered the Amphibia records accordingly. My filtering can be regarded as a “first pass” intended to remove low-quality records, and it ignores issues like synonyms, locations-as-centroids and Amphibia in zoos and home aquaria.
In what follows, the data items were from the verbatim.txt file in the Darwin Core archive, and record numbers are O for observations and S for specimens.
I started with (see above):
O 1285487 S 18230
Include only records with scientificName entries for a species or subspecies. Exclude entries with informal names and qualifications. Exclude hybrids.
O 1200160 S 12499
Exclude records with doubts in identificationQualifier or identificationVerificationStatus
O 1173953 S 10679
Exclude records with a blank in decimalLatitude, decimalLongitude, geodeticDatum or coordinateUncertaintyInMeters
O 1011566 S 1421
Exclude records with COORDINATE_INVALID, COORDINATE_UNCERTAINTY_METERS_INVALID, COUNTRY_COORDINATE_MISMATCH, GEODETIC_DATUM_INVALID, PRESUMED_NEGATED_LONGITUDE, PRESUMED_SWAPPED_COORDINATE or ZERO_COORDINATE as flagged by GBIF
O 1010769 S 1419
Exclude records with coordinateUncertaintyInMeters greater than 5000
O 897708 S 1413
After this “first pass” filtering for an imagined SDM/ENM, the observations subset for Amphibia had lost 30% of its records, while the collections subset had lost 88%. As with my review of iNaturalist’s Australian millipede IDs in GBIF, I’m reluctant to generalise to other datasets and other uses. But this result accords with my audits of many more Darwin Core fields in many more datasets: citizen-science observation records are tidier than collections records.
There are plenty of reasons why collections data might be “born” messy and stay messy. There are also reasons why observations data might be “born” tidy. But “HumanObservation” currently accounts for 85% of GBIF’s occurrence records compared to only 9% for “PreservedSpecimen” (2023-04-09). Collections are becoming less and less important in the universe of biodiversity data, and if they hope to increase the value of what they contribute, they need to work on data quality.
Robert Mesibov (“datafixer”); robert.mesibov@gmail.com