On skills:
In the â100 GBIF datasets, improvedâ post I showed that many compilers of biodiversity data âdo not have the necessary knowledge or skills to produce tidy, complete, consistent and Darwin Core-compliant datasetsâ.
Some of the present discussion about increasing workforce capacity and inclusivity seems to be about the democratisation of the necessary knowledge and skills - getting more people competent. This sounds like a solution to the problem of low data quality, along the lines of âIf you give a man a fish, you feed him for a day. If you teach a man to fish, you feed him for a lifetimeâ.
Thatâs indeed a solution for an individual biodiversity data worker, but it isnât a solution for the global problem.
The â100 GBIF datasets, improvedâ post was about a demonstrably successful solution to the global problem: put âgatekeeperâ data specialists between data compilers and data disseminators, to look for data problems and to advise compilers on what, exactly, needs to be fixed.
Of course you can teach vehicle operators how to service the car or truck they drive, but wouldnât you expect a better-serviced vehicle fleet if the servicing was done by trained vehicle mechanics?
A discussion on skills would benefit from consideration of how to (a) recruit data specialists for âgatekeepingâ roles and (b) how to insert data specialists between compilers and disseminators.
A few words about (b): The aggregator operating model (Iâm reluctant to call it a âbusiness modelâ) doesnât require the aggregator to serve high-quality data. The operating model assumes that end-users will do data cleaning. To assist end users, aggregators flag a small set of data problems and attach flags to individual records. Data providers can also see these flags, but are not required to do anything about them. The threshold for outright data rejection - how awful does data have to be before it doesnât get aggregated? - is set remarkably low.
Just as aggregators arenât really troubled when data quality is dreadful, they arenât really enthusiastic when data quality is excellent. In GBIFâs case, the work described in â100 GBIF datasets, improvedâ is evidently seen as a private arrangement between data providers and Pensoft, and GBIF has never, to my knowledge, directed a data provider to Pensoft or to any other third-party data-checking service.
To sum up the last two paragraphs, I think we can forget the aggregators. Other participants in this discussion may have a different view , but I think the most that aggregators would be willing to contribute to âgatekeepingâ would be advice to data publishers that data-checking services exist.
Now back to (a). Data specialists already exist. Theyâre turned out every year by information and library science courses, and they also work in the corporate world as âdata scientistsâ (which sounds glamorous until you hear that â80% of a data scientistâs time is spent cleaning dataâ). The training required to bring a data specialist up to speed to turn messy biodiversity data into tidy, complete, consistent and Darwin Core-compliant records neednât be either long or taxing.
Discussion point: How to set up that specialist training?
Next discussion point, recruit data specialists for what? Apart from Pensoft, is there any organisation in the world that carefully checks biodiversity datasets from any and all sources? Where and for what reward would data specialists be working?
To half-answer my own question, I think there might already be in-house data specialists at some of the larger museums/herbaria. Is there a way to expand their remit, so that they not only clean the house data, but also data from other institutions, as part of their paid work?