I recently looked at a dataset that has thousands of unusable occurrence records.
To at least partly protect the identity of the data publisher I won’t name them here. (No, it wasn’t the Natural History Museum (UK), my usual go-to source for Horrible Data Examples.)
It took some time to contact what I might call a publisher’s representative. The dataset had gone through multiple hands, something like this:
(1) Original compiler > (2) project personnel > (3) IPT personnel
What was missing in this pipeline was an assignment of responsibility.
(1) might be assumed to be responsible for dataset quality but didn’t check what they produced.
(2) might be assumed to be responsible for project data quality but didn’t check what (1) passed on.
(3) might be assumed to be responsible for what goes on the IPT but didn’t check what (2) passed on.
After harvesting the dataset from the IPT, GBIF processed the dataset and flagged the error that made those thousands of records unusable. However, GBIF has long made it clear that it isn’t responsible for data content:
GBIF Secretariat provides a publication framework for biodiversity data, but is neither the owner nor custodian of such data, and therefore is not responsible for the actual content served by Data Publishers.
I don’t know whether (3), (2) or (1) noticed the GBIF flags in the dataset. Even if they did, they aren’t obliged to fix any errors flagged by GBIF, and GBIF won’t stop harvesting datasets from the same publishers in future if the existing errors aren’t fixed. The flags are there for data users.
Those “might be assumed” statements above are meaningless, because in the contemporary world of biodiversity data there is no formal responsibility for data content. Compilers, data managers and IPT personnel don’t suffer penalties for sharing unusable data, and don’t get rewarded for fixing data problems.
This isn’t a technical problem, it’s a social or management one. What solutions are available?