21st century Amphibia, collected vs observed

Do citizen-science platforms share tidier data than museums and herbaria?

I’ve gotten the impression that they do, from examining data quality in iNaturalist and from the “scoping audits” I do of collection datasets. A direct comparison would be more convincing, so I looked at occurrence records for Amphibia (an arbitrary choice) as mediated by GBIF.

The graph below shows numbers of “HumanObservation” and “PreservedSpecimen” records for Amphibia in GBIF in the 2000’s, by event year. The recent explosion in citizen-science records is obvious, and so is a slow decline in collection records. The sharp drop in the latter in recent years is likely to be an artifact of “data delay”; i.e., specimens may have been collected in the past few years but not yet accessioned.


The most recent collection records should have better data quality than earlier ones, I thought, so I queried GBIF on 2023-04-07 for all records starting in 2020 with the taxonKey “Amphibia” and with either “HumanObservation” or “PreservedSpecimen” in basisOfRecord. The resulting dataset (DOI) has 1286699 observation records (118 publishers) and 17018 collection records (66 publishers).

Of course, I was trusting that basisOfRecord had been used correctly by data publishers. This was not wise. In the “HumanObservation” subset, 1212 records had preparations entries for preserved specimens (e.g. “Animal completo (ETOH)”). I moved these 1212 records from the “observations” to the “collections” subsets, giving new totals of 1285487 and 18230, respectively.

I’ve found in data auditing that Darwin Core entries are sometimes perfectly valid but also incorrect, as in this case for basisOfRecord. Anomalies of this kind can be detected by cross-checking fields within individual records. Another example: as of 2023-04-08 there were 999 “HumanObservation” records in GBIF with holotype, neotype or lectotype in typeStatus (DOI). The winner was a well-known French museum, which published observation records for the holotype of Macrostemum scriptum (Rambur, 1842) from 166 unique locations in Madagascar.

GBIF checks. Both the observation and collection subsets for Amphibia had very low proportions of spatial issues as discovered by GBIF, i.e.
















Nevertheless, the collections records had proportionally almost eight times as many spatial issues (one or more of the above) as the observation records: 341 out of 18230 (1.87%) vs 3085 out of 1285487 (0.24%).

Records with the FOOTPRINT_WKT_MISMATCH issue include 464 with point footprintWKT entries corrupted by the bug I described in a previous community forum post. All 464 records are from the Patrimoine Naturel project (publisher = UMS PatriNat (OFB-CNRS-MNHN), Paris).

The difference was much larger for the TAXON_MATCH_HIGHERRANK issue (“The record can be matched to the GBIF taxonomic backbone at a higher rank, but not with the scientific name given.”). Collections had 4917 records out of 18230 (27.0%) with a name up-matched, compared to 8082 records out of 1285487 (0.63%) for observations. However, 4773 of the 4917 collections up-matches were due to invalid scientificName entries, mainly from an institution in Brazil. Ignoring those entries, the collections up-matches drop to only 0.8%. (Examples of invalid names: “Unidentified sp.”, “Plethodon cinereus or Plethodon electromorphus”)

If I ignore the 5542 similarly invalid scientificName entries in the observations records, the up-matches there drop to 0.2%, so observations records are again ahead of collections records in quality. Most of the invalid names were contributed by a Dutch citizen-science project, with a French project running second.

More checks. There are many ways to assess the tidiness of datasets. I decided to focus on name and location information that would be needed for building a species distribution model (SDM) or ecological niche model (ENM). In peer-reviewed papers based on GBIF-mediated data, SDMs/ENMs are the most frequently reported outputs. Their construction only requires the modeller to know the species name and its occurrence coordinates. Modellers will sometimes check the date of the occurrence and apply a threshold, such as “all reported occurrences since 1970”. They will also sometimes add coordinates for well-defined localities that haven’t been georeferenced, and will discard doubtful occurrences or ones with high spatial uncertainty. Records are usually then “spatially thinned” to one per grid square or coordinate rectangle, to reduce spatial bias. It’s Data Use Lite, but for this exercise I pretended I was a modeller and filtered the Amphibia records accordingly. My filtering can be regarded as a “first pass” intended to remove low-quality records, and it ignores issues like synonyms, locations-as-centroids and Amphibia in zoos and home aquaria.

In what follows, the data items were from the verbatim.txt file in the Darwin Core archive, and record numbers are O for observations and S for specimens.

I started with (see above):

        O 1285487    S 18230

Include only records with scientificName entries for a species or subspecies. Exclude entries with informal names and qualifications. Exclude hybrids.

        O 1200160    S 12499

Exclude records with doubts in identificationQualifier or identificationVerificationStatus

        O 1173953    S 10679

Exclude records with a blank in decimalLatitude, decimalLongitude, geodeticDatum or coordinateUncertaintyInMeters

        O 1011566    S 1421


        O 1010769    S 1419

Exclude records with coordinateUncertaintyInMeters greater than 5000

        O 897708     S 1413

After this “first pass” filtering for an imagined SDM/ENM, the observations subset for Amphibia had lost 30% of its records, while the collections subset had lost 88%. As with my review of iNaturalist’s Australian millipede IDs in GBIF, I’m reluctant to generalise to other datasets and other uses. But this result accords with my audits of many more Darwin Core fields in many more datasets: citizen-science observation records are tidier than collections records.

There are plenty of reasons why collections data might be “born” messy and stay messy. There are also reasons why observations data might be “born” tidy. But “HumanObservation” currently accounts for 85% of GBIF’s occurrence records compared to only 9% for “PreservedSpecimen” (2023-04-09). Collections are becoming less and less important in the universe of biodiversity data, and if they hope to increase the value of what they contribute, they need to work on data quality.

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

Two things:

Firstly, I think you will find that the precipitous drop in specimen records has a lot to do with COVID and difficulty in collecting during that period and less to do with accessioning delay (if anything backlogs would have decreased over COVID as this is all we were able to accomplish).

Secondly, your biggest drop in specimen numbers during your triage is from a lack of latitude and longitude information (this accounts for the majority of your 88% loss of records). This is an age-old issue with collections and the labor-intensive nature of georeferencing meaning that it is a task that always gets put on the back burner in favor of other tasks. There have been some community efforts to georeference but in a lot of cases, the repatriation of that data into CMS’s has been problematic. It is much easier to record an observation on a phone and have GPS coordinates automatically inserted than to do this retroactively in a CMS for historical records. However, another metric that would certainly highlight the value of specimen records as opposed to observations would be a relative count of the number of fields of information contributed by both to GBIF. I would also be cautious of inferring (from GBIF-mediated citations) that data use is the primary mechanism of collection use. There are many more uses that entail the actual use of specimens or derivatives rather than just data.

Now if we could just get SDM/ENM users to repatriate all that effort expended in georeferencing records for their own purposes back to collections that would be of great help.

@abentley Many thanks for the suggestion about a COVID pause in collecting, which sounds likely.

Yes, a big difference between citizen-science records and collection records is in the quality of georeferencing. But please note that I deliberately looked at the most recent records, 2020 onwards, in hopes that most collection records from the last few years would have coordinates, datum and uncertainty. This was not looking at “historical records”. I could go into detail on this point but would prefer not to do it here by naming particular institutions (but see the original data).

“another metric that would certainly highlight the value of specimen records as opposed to observations would be a relative count of the number of fields of information contributed by both to GBIF.” What fields did you have in mind?

“I would also be cautious of inferring (from GBIF-mediated citations) that data use is the primary mechanism of collection use. There are many more uses that entail the actual use of specimens or derivatives rather than just data.”

Of course, but the exercise described above looked only at shared DwC data in GBIF and its quality. Metrics for collection use do exist, and taxonomy publications have their “Material examined” sections, so the value of collections has other measures.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.