What I've learned from 500+ biodiversity data audits

datafixer · April 25, 2024, 12:44am

Apologies for the click-bait title, but I thought the forum readership might be interested in this brief overview.

The “500+” part comes from the data auditing work I do for Pensoft Publishers (data papers mainly going to Biodiversity Data Journal), some contract work for museums, and numerous unasked-for audits of datasets that have been shared with aggregators.

The datasets I’ve audited fall neatly into four largely non-overlapping categories:

citizen-science observations
literature-based research
field-based research
museum and herbarium catalogues

The four categories are comparable because their primary data components are occurrences, usually of species-level taxa.

Occurrences are also what most end-users seem to want from biodiversity datasets. At one extreme, species distribution modellers want nothing more than taxon name and occurrence coordinates, with possibly some filtering based on occurrence date or coordinate uncertainty. (The coordinates will probably then be simplified to grid-based ones, to reduce spatial bias.) At the other extreme, taxonomists, conservation ecologists and “taxon hobbyists” (e.g. orchid lovers) want whatever occurrence details are available, in hopes of finding useful information.

A rough measure of data quality is the amount of time I spend detailing errors and inconsistencies for the data provider to fix. That time clearly follows the numbering of categories, above: 4 > 3 > 2 > 1. GBIF isn’t as persnickety as I am, but I suspect GBIF’s numbers of flagged issues per record would show the same ranking.

Something else I’ve learned is that there’s a formidable gap between Darwin Core “thinkers” and DwC data compilers. While the thinkers work to refine and expand the DwC standard, a large proportion of compilers don’t use the existing DwC categories correctly, or use them correctly for some data items and not others, and hope for the best. In my view this is not the fault of DwC or its documentation, both of which are excellent. Reasons for failure at the user end? I don’t know. My clients don’t tell me.

Finally, I’ve learned to my horror that many biodiversity datasets are compiled in Microsoft Excel. Sometimes I see the result directly (as an .xlsx file), sometimes I see Excel’s finger-marks on a dataset that’s been through an IPT.

In previous forum posts I’ve detailed Excel’s many contributions to drops in data quality, and I won’t repeat them here. I can only recommend that biodiversity data compilers switch when possible from spreadsheet software to table editors. These do the same table-building job but without the damage that puts Excel-built datasets in hospital. The following two table editors are free and can easily handle very large datasets:

Modern CSV (https://www.moderncsv.com/) for Windows, Mac and Linux.
EmEditor (https://www.emeditor.com/) for Windows

Topic		Replies	Views
100 GBIF datasets, improved Data Publishing	5	1862	October 4, 2021
Darwin Core Half-Million - UPDATE Data Publishing	11	1085	December 8, 2022
Announcing the 2022 Darwin Core Half-Million Data Publishing	1	761	November 11, 2022
Data models and standards for improved usability Post-2020 Global Biodiversity Framework	3	866	September 16, 2022
Data Use Club Practical Session : Data Quality Data Use	1	800	December 14, 2022

What I've learned from 500+ biodiversity data audits

Related topics