Announcing the 2022 Darwin Core Half-Million

datafixer · October 11, 2022, 10:31pm

The first Darwin Core Million was launched in March 2020. It was an open competition that aimed to find a collections dataset uploaded to an aggregator (GBIF, ALA or other) with no serious data problems among 1 million data items (items, not records; see below). There were no winners.

The second Darwin Core Million ran from 15 July to 1 September 2020 with the same rules. Again there were no winners.

In opening the new Darwin Core Half-Million I hope to encourage the publishers of smaller datasets to enter the competition, but the time window is narrower: from now to 31 October 2022. The rules are much the same:

Any museum or herbarium data publisher can enter, but the competition applies only to publicly available Darwin Core occurrence datasets. These might have been uploaded to an aggregator, such as GBIF, ALA or iDigBio, or to an open-data repository.
Enter a dataset that contains at least half a million data items. For example, that could be 25000 records in 20 populated Darwin Core fields, or 10000 records in 50 populated Darwin Core fields, or something in between.
Email the dataset to me before 31 October as a zipped, plain-text file, together with a DOI or URL for the online version of the dataset.
I will audit datasets in the order I receive them. If I don’t find serious data quality problems in your dataset, I’ll pay your institution AUD$150 and declare your institution the winner of the 2022 Darwin Core Half-Million, here on the GBIF community forum. There will only be one winner in this competition, and datasets received after the first problem-free dataset won’t be checked.
If I find serious data quality problems, I will let you know by email. If you want to learn what the problems are, I will send you a “scoping audit” explaining what should be fixed and I’ll charge your institution AUD$150. (And it would be really good to hear, later on, that those problems had indeed been fixed and that corrected data items had replaced the originals online.)
Datasets associated with a data paper in a Pensoft journal are ineligible, because they have already been audited and have had their serious problems fixed.

What are “serious data quality problems?”

duplicate records
invalid data items
missing-but-expected items
data items in the wrong fields
data items inappropriate for their field
truncated data items
records with items in one field disagreeing with items in another
character encoding errors
wildly erroneous dates or coordinates
incorrect or inconsistent formatting of dates, names and other items

Please also note that data quality, as I use the term here, isn’t the same as data accuracy. Is the named locality really at those coordinates? Is the specimen ID correct? Is the barely legible collector name on the specimen label correctly interpreted? These are questions about data accuracy. But it’s possible to have an entirely accurate dataset that is largely unusable because it suffers from one or more of the problems listed above.

I think I’m being reasonable about the “serious” in “serious data quality problems”. One character encoding error, such as “L’H?rit” repeated many times in the scientificNameAuthorship field, isn’t serious, but multiple errors scattered through several fields are grounds for rejection. For an understanding of “invalid”, please refer to the Darwin Core field definitions and recommendations.

FAQs

Will our dataset be named if it fails the audit?

No. As in the 2020 competition, unsuccessful data publishers will remain anonymous.

Why AUD$150?

That’s what I would charge as a data auditor for a dataset of this kind. That’s about AUD$0.01 per record, which is way below commercial data-checking rates.

Why are you running this competition?

The quality of museum and herbarium datasets harvested by GBIF (for example) ranges from OK to shockingly bad. Aggregators are “quality-agnostic” and seem to share with the funders of digitisation projects the notion that “It’s important for collections to share their data now; we can worry about data quality later”.

So far as I know, no aggregator penalises publishers for sharing low-quality data, or rewards publishers for sharing high-quality data. The Darwin Core (Half) Million competitions have been my attempt to celebrate the best data publishers: the ones that

take careful note of Darwin Core’s requirements and recommendations when migrating occurrence data from the CMS to a share-able data structure
take note of any issues flagged by aggregators after publishing, and fix those issues promptly in their datasets
respect the end-users of their data, and don’t leave it to those users to do extensive data cleaning, or to discard unusable records

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

system · November 11, 2022, 8:32am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Darwin Core Half-Million - UPDATE Data Publishing	11	1085	December 8, 2022
Darwin Core Million now twice a year	2	771	August 31, 2020
100 GBIF datasets, improved Data Publishing	5	1862	October 4, 2021
The 2020 Darwin Core Million Challenge	2	763	April 4, 2020
What I've learned from 500+ biodiversity data audits Data Publishing	0	229	April 25, 2024

Announcing the 2022 Darwin Core Half-Million

Related topics