Third-party data services explained with a shanty

For some years now I’ve been arguing for third-party services to be interposed between collection datasets and aggregators like GBIF.

The services would be human, not coded, because people are the best data cleaners and because there are data problems in collection management systems (CMSes) that are impractical to fix with code.

Not many in the biodiversity informatics community have looked kindly at this suggestion, so to advance the idea (or maybe to kill it with a single post) I offer the following song to explain how a third-party data-cleaning service would work. The tune is The Wellerman.

There once was a coll with a CMS
Whose records were a dreadful mess
Sadly did the staff confess
“They’re really not ready to go”

Soon may the Dataman come
To fix our records for a modest sum
Then, when the tidying’s done,
We’ll share them in Darwin Core!

Formatting dates is most accursed
Should day or month or year go first?
At times one way, at times reversed
That’s just the status quo


When not sure where a place is at
We have a way to deal with that
A zero long and zero lat
Is how we make that show


Free-text fields allow for staff who’d
Enter things to be reviewed
So many entries now are queued
And checking is quite slow


When typing name or place or date
These things we oft abbreviate
They’re hard to disambiguate
Unless you’re in the know


Look-up tables aren’t our norm
We enter names in many a form
In lists each name will have a swarm
Of variants below

Robert Mesibov (“datafixer”);


This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.