Some care, but no responsibility?

datafixer · January 25, 2025, 1:56am

I recently looked at a dataset that has thousands of unusable occurrence records.

To at least partly protect the identity of the data publisher I won’t name them here. (No, it wasn’t the Natural History Museum (UK), my usual go-to source for Horrible Data Examples.)

It took some time to contact what I might call a publisher’s representative. The dataset had gone through multiple hands, something like this:

(1) Original compiler > (2) project personnel > (3) IPT personnel

What was missing in this pipeline was an assignment of responsibility.

(1) might be assumed to be responsible for dataset quality but didn’t check what they produced.

(2) might be assumed to be responsible for project data quality but didn’t check what (1) passed on.

(3) might be assumed to be responsible for what goes on the IPT but didn’t check what (2) passed on.

After harvesting the dataset from the IPT, GBIF processed the dataset and flagged the error that made those thousands of records unusable. However, GBIF has long made it clear that it isn’t responsible for data content:

GBIF Secretariat provides a publication framework for biodiversity data, but is neither the owner nor custodian of such data, and therefore is not responsible for the actual content served by Data Publishers.

I don’t know whether (3), (2) or (1) noticed the GBIF flags in the dataset. Even if they did, they aren’t obliged to fix any errors flagged by GBIF, and GBIF won’t stop harvesting datasets from the same publishers in future if the existing errors aren’t fixed. The flags are there for data users.

Those “might be assumed” statements above are meaningless, because in the contemporary world of biodiversity data there is no formal responsibility for data content. Compilers, data managers and IPT personnel don’t suffer penalties for sharing unusable data, and don’t get rewarded for fixing data problems.

This isn’t a technical problem, it’s a social or management one. What solutions are available?

sformel · February 10, 2025, 2:38pm

@datafixer this could be our (GBIF-US) data; we frequently end up in this type of workflow. I can’t offer a solution, but I’m glad you haven’t stopped asking about these situations. I think better Data Management Plans could spell this workflow out more clearly, but frequently DMPs don’t include details like the workflow you have described above. They usually describe something like:

(1) Make data > (ambiguous actor) Published through FAIR platform

Here is some color on how feels from our end. To set the stage, I would be actor #3 in this scenario.

(1) Is ignorant of data use beyond the immediate needs of their project. They believe that satisfying project requirements is enough to produce ‘good data’. They may not express it explicitly, but they are relying on downstream actors to catch their mistakes, like they do with journal article reviewers.
(2) Is trying to meet institutional requirements that may be based on the FAIR principles. Data publishing is not their main job, and they are trying to not overburden themselves, or (1). They may be under significant pressure to hit publishing deadlines.
(3) Has limited time to help publish datasets because node management is not seen as a full-time position. In theory, each data generator, (1) or (2), is trained to publish their own high-quality data. However, (1) and (2) publish data so infrequently, that they have to retrain themselves every time they publish.

Topic		Replies	Views
100 GBIF datasets, improved Data Publishing	5	1878	October 4, 2021
About the Data Publishing category Data Publishing	1	1224	May 3, 2018
Asia Office Hours Data Publishing	16	2013	October 15, 2022
The 2020 Darwin Core Million Challenge	2	763	April 4, 2020
Publishers vs. citations Data Publishing	5	4905	April 30, 2019

Some care, but no responsibility?

Related topics