For some years now I’ve been arguing for third-party services to be interposed between collection datasets and aggregators like GBIF.
The services would be human, not coded, because people are the best data cleaners and because there are data problems in collection management systems (CMSes) that are impractical to fix with code.
Not many in the biodiversity informatics community have looked kindly at this suggestion, so to advance the idea (or maybe to kill it with a single post) I offer the following song to explain how a third-party data-cleaning service would work. The tune is The Wellerman.
There once was a coll with a CMS
Whose records were a dreadful mess
Sadly did the staff confess
“They’re really not ready to go”
Chorus
Soon may the Dataman come
To fix our records for a modest sum
Then, when the tidying’s done,
We’ll share them in Darwin Core!
Formatting dates is most accursed
Should day or month or year go first?
At times one way, at times reversed
That’s just the status quo
Chorus
When not sure where a place is at
We have a way to deal with that
A zero long and zero lat
Is how we make that show
Chorus
Free-text fields allow for staff who’d
Enter things to be reviewed
So many entries now are queued
And checking is quite slow
Chorus
When typing name or place or date
These things we oft abbreviate
They’re hard to disambiguate
Unless you’re in the know
Chorus
Look-up tables aren’t our norm
We enter names in many a form
In lists each name will have a swarm
Of variants below
Robert Mesibov (“datafixer”); robert.mesibov@gmail.com