9. Workforce capacity development and inclusivity

On skills:

In the “100 GBIF datasets, improved” post I showed that many compilers of biodiversity data “do not have the necessary knowledge or skills to produce tidy, complete, consistent and Darwin Core-compliant datasets”.

Some of the present discussion about increasing workforce capacity and inclusivity seems to be about the democratisation of the necessary knowledge and skills - getting more people competent. This sounds like a solution to the problem of low data quality, along the lines of “If you give a man a fish, you feed him for a day. If you teach a man to fish, you feed him for a lifetime”.

That’s indeed a solution for an individual biodiversity data worker, but it isn’t a solution for the global problem.

The “100 GBIF datasets, improved” post was about a demonstrably successful solution to the global problem: put “gatekeeper” data specialists between data compilers and data disseminators, to look for data problems and to advise compilers on what, exactly, needs to be fixed.

Of course you can teach vehicle operators how to service the car or truck they drive, but wouldn’t you expect a better-serviced vehicle fleet if the servicing was done by trained vehicle mechanics?

A discussion on skills would benefit from consideration of how to (a) recruit data specialists for “gatekeeping” roles and (b) how to insert data specialists between compilers and disseminators.

A few words about (b): The aggregator operating model (I’m reluctant to call it a “business model”) doesn’t require the aggregator to serve high-quality data. The operating model assumes that end-users will do data cleaning. To assist end users, aggregators flag a small set of data problems and attach flags to individual records. Data providers can also see these flags, but are not required to do anything about them. The threshold for outright data rejection - how awful does data have to be before it doesn’t get aggregated? - is set remarkably low.

Just as aggregators aren’t really troubled when data quality is dreadful, they aren’t really enthusiastic when data quality is excellent. In GBIF’s case, the work described in “100 GBIF datasets, improved” is evidently seen as a private arrangement between data providers and Pensoft, and GBIF has never, to my knowledge, directed a data provider to Pensoft or to any other third-party data-checking service.

To sum up the last two paragraphs, I think we can forget the aggregators. Other participants in this discussion may have a different view , but I think the most that aggregators would be willing to contribute to “gatekeeping” would be advice to data publishers that data-checking services exist.

Now back to (a). Data specialists already exist. They’re turned out every year by information and library science courses, and they also work in the corporate world as “data scientists” (which sounds glamorous until you hear that “80% of a data scientist’s time is spent cleaning data”). The training required to bring a data specialist up to speed to turn messy biodiversity data into tidy, complete, consistent and Darwin Core-compliant records needn’t be either long or taxing.

Discussion point: How to set up that specialist training?

Next discussion point, recruit data specialists for what? Apart from Pensoft, is there any organisation in the world that carefully checks biodiversity datasets from any and all sources? Where and for what reward would data specialists be working?

To half-answer my own question, I think there might already be in-house data specialists at some of the larger museums/herbaria. Is there a way to expand their remit, so that they not only clean the house data, but also data from other institutions, as part of their paid work?