What I've learned from 500+ biodiversity data audits

Apologies for the click-bait title, but I thought the forum readership might be interested in this brief overview.

The “500+” part comes from the data auditing work I do for Pensoft Publishers (data papers mainly going to Biodiversity Data Journal), some contract work for museums, and numerous unasked-for audits of datasets that have been shared with aggregators.

The datasets I’ve audited fall neatly into four largely non-overlapping categories:

  1. citizen-science observations
  2. literature-based research
  3. field-based research
  4. museum and herbarium catalogues

The four categories are comparable because their primary data components are occurrences, usually of species-level taxa.

Occurrences are also what most end-users seem to want from biodiversity datasets. At one extreme, species distribution modellers want nothing more than taxon name and occurrence coordinates, with possibly some filtering based on occurrence date or coordinate uncertainty. (The coordinates will probably then be simplified to grid-based ones, to reduce spatial bias.) At the other extreme, taxonomists, conservation ecologists and “taxon hobbyists” (e.g. orchid lovers) want whatever occurrence details are available, in hopes of finding useful information.

A rough measure of data quality is the amount of time I spend detailing errors and inconsistencies for the data provider to fix. That time clearly follows the numbering of categories, above: 4 > 3 > 2 > 1. GBIF isn’t as persnickety as I am, but I suspect GBIF’s numbers of flagged issues per record would show the same ranking.

Something else I’ve learned is that there’s a formidable gap between Darwin Core “thinkers” and DwC data compilers. While the thinkers work to refine and expand the DwC standard, a large proportion of compilers don’t use the existing DwC categories correctly, or use them correctly for some data items and not others, and hope for the best. In my view this is not the fault of DwC or its documentation, both of which are excellent. Reasons for failure at the user end? I don’t know. My clients don’t tell me.

Finally, I’ve learned to my horror that many biodiversity datasets are compiled in Microsoft Excel. Sometimes I see the result directly (as an .xlsx file), sometimes I see Excel’s finger-marks on a dataset that’s been through an IPT.

In previous forum posts I’ve detailed Excel’s many contributions to drops in data quality, and I won’t repeat them here. I can only recommend that biodiversity data compilers switch when possible from spreadsheet software to table editors. These do the same table-building job but without the damage that puts Excel-built datasets in hospital. The following two table editors are free and can easily handle very large datasets: