For the third time there was no winner in my “Darwin Core (Half)Million” competition, and in this latest round there was only one entrant.
My AUD$150 is safe for another year but I will continue to audit museum and herbarium datasets. I’ve so far done 25 datasets in my auditing work for Pensoft Publishers, and another 38 as free-of-charge “scoping” audits for institutions worldwide. Among the 38 were a few potential “Darwin Core Million” winners, together with some very disappointing collection datasets. Several well-known institutions in the UK top the “could do MUCH better” list.
I wonder if most institutional publishers simply don’t worry in 2022 about the quality of their collection data. If the CMS suits the institution and its personnel, why work to improve CMS data quality? As I’ve written before,
Museums and herbaria don’t get rewards, kudos, more visitors, more funding or more publicity if staff improve the quality of their collection data, and they don’t get punishments, opprobrium, fewer visitors, reduced funding or less publicity if the data remain messy.
When CMS data is reformatted in Darwin Core (DwC), it becomes share-able through GBIF and other aggregators. GBIF flags a subset of the data problems in the DwC dataset, but those flags are almost universally ignored by publishers. Why bother fixing the flagged issues in the shared dataset, when the dataset that really matters to the institution, in the CMS, is good enough for in-house purposes?
Here’s another radical perspective. On the day I write this (31 October), GBIF indexes 2.2 billion occurrence records with occurrenceStatus “present”. Of the 2.2 billion, only about 220 million have “PreservedSpecimen” or “FossilSpecimen” as the basis of the record. Using those values as proxies, museums and herbaria only contribute about 10% of GBIF’s occurrence records.
Of all the “present” occurrence records, 86% are “HumanObservation” and three-quarters of those observations (1.4 billion) are of birds. In fact, of the 200 most frequently recorded species in all of GBIF’s occurrence records, 196 are bird species.
While GBIF isn’t (yet) an acronym for “Global Bird Information Facility”, those numbers are impressive and suggest something else. Is it possible that museums and herbaria now see the sharing of data through GBIF and other aggregators as something that perhaps might still be done in 2022 to enhance the institution’s profile, but not something to spend much time on?
Staff might be thinking that"HumanObservation" records are likely to be increasing a lot faster than “PreservedSpecimen” and “FossilSpecimen” records, so the 10% figure for museum and herbarium occurrence records in GBIF is likely to continue dropping. If GBIF becomes principally a gateway to data from eBird, iNaturalist and other platforms, will its value for museums and herbaria decline? Will the migration of CMS data to Darwin Core, and the updating of that shared data for aggregators, become chores that institutions will increasingly avoid?
Disclaimer - I did not enter as my dataset (KUBI Ichthyology Collection) only contains about 43,000 occurrence records - although that represents about 600,000 specimens (fish collections work in lots). I will also say that a lot of potential applicants may have been put off by the possibility of having to pay you $150.
I don’t think this is strictly true. With the ever-increasing use of collections data in niche modeling exercises where users have no interest in loaning the physical specimens, clean data is at a premium and citation of records from a museum is a metric that we can use to show use and attribution for advocacy. Clean records are more likely to be used than messy ones so I think there is some value in cleaning these. There is also a second reason for cleaning these - although they may be secondarily cleaned by the aggregator in question, my data is available from numerous different aggregators (GBIF, iDigBio, Vertnet, Fishnet, OBIS) as well as my own local web portal (https://collections.biodiversity.ku.edu/KU_Fish_voucher/). As such, there is value in cleaning and augmenting the data to ensure consistency across all of these avenues. I know that I personally have used the data cleanup tools in both GBIF and iDigBio to clean my data as much as possible for these exact reasons.
No, observation records will always outstrip museum records as they are easier to collect and are collected by a much larger user base. That being said, there is increased value in a record that is backed up by a voucher specimens that can verify identification and can be interrogated and used in a myriad number of other ways that are precluded by an observation record. Museum voucher records will always be intrinsically more valuable than observation records for this reason.
I doubt it. It has been shown that publishing your museum records to aggregators (the more the better) increases loan traffic and use of collections in general. These are important metrics used by museums to show the value of their collections to the research enterprise which translates into more use, more visitors, more funding and other downstream benefits. I would hope this would be evident to all collections. Also, publishing your data represents an opportunity to expose a lot of the ancillary data associated with the dataset which makes them more valuable - images, sequences, citations, etc.
Of course, there is also now the prospect of a Digital Extended Specimen architecture that will involve collections and aggregators in creating a universal data integration platform that will allow collections to link their datasets to other datasets and records to increase their value and usefulness.
Then those potential applicants didn’t read the DwC competition post carefully:
“If I find serious data quality problems, I will let you know by email. If you want to learn what the problems are, I will send you a “scoping audit” explaining what should be fixed and I’ll charge your institution AUD$150.”
The one unsuccessful entrant this year wasn’t charged.
While your arguments are good ones for the continuing (and in some cases increasing) value of clean institutional DwC records, you need to explain why, then, most institutions (a) don’t clean their data and (b) ignore the problems flagged by GBIF when preparing updates. This failure is the issue about which I’m speculating.
The new DES data model (especially as promoted by DiSSCo) represents another opportunity for institutions to clean their CMS data before migrating it to a new structure. Do you think that will happen?
I think in some cases it seems like a daunting prospect - especially for large collections. This involves a lot of work that most collections personnel may not have the necessary data experience to fix in effective ways (batch editing, scripts, etc.). There are also a LOT of demands on collections staff and this unfortunately is at the bottom of a very long list of other priorities that take presidence. In other cases I think collections may just not be aware of the utility of these metrics for cleaning data. In other cases, like for most of those left for my collection, they are not actually data errors but rather inaccuracies in the way the data cleanup tools are implemented. For instance, all remaining Continent invalid flags in my data are due to marine organisms collected in oceans and seas where continent is used as a proxy due to few other viable mechanisms for handling this data. Same is true for taxon fuzzy match and taxon match higherrank - these are all valid species or disagreements in higher level taxonomy that are not in their backbone taxonomy. The recorded date invalid is also a GBIF issue as all of these records have valid dates. I have put many hours into cleaning up the data but the remaining issues are ones I am unable to take care of.
On a broader scale there are also issues with different flags in different aggregators and some not having any which is confusing to the community. I have been advocating for some time for a central data store that collections publish their data to that is then used by all aggregators as their data source. That way all data is always up to date, things like backbone taxonomy and cleanup flags can be implemented once on all the data and aggregators can be left to innovate at the UI level in presenting the data in various different ways for their respective communities. This will assist greatly in the DES ideal and present a single set of flags that the community can also assist in correcting once the annotations are implemented.
@bentley: “I have put many hours into cleaning up the data but the remaining issues are ones I am unable to take care of.”
Many thanks for gently reminding GBIF about their long-standing flagging issues. On the KUBI dataset, I would be happy to do a free-of-charge scoping audit. Let me know off-list whether you would prefer it done on the DwC dataset offered to GBIF or on a CMS dump.
“I have been advocating for some time for a central data store that collections publish their data to that is then used by all aggregators as their data source.”
Would that work something like Arctos? There are still data quality issues in the Arctos datasets, although the Arctos admins and volunteers do a great job in attempting to squash these.
No, it would work similarly to how it does now but each aggregator would not hold its own cache. The cache would be centrally managed (and funded ) and would be used by all aggregators as their data repository. Aggregators could then display a subset of the data based on any geographic or taxonomic requirement (Vertnet - only vertebrates, iDigBio - only US collections, etc.) much like GBIF-hosted portals work now. This would simplify the publishing process in that, as a provider, I would simply publish to the central cache and not to each aggregator. There could be a mechanism for publishers to decide which aggregators they wish their data to appear in - simple checkboxes would do.
We have momentum and interests about DES and the services we are piloting. This will take some time. In the next few months we will be talking with GBIF and iDigBio to see how we can broaden our scope for the pilot.
The community annotation and curation services can provide the following advantages:
The institutions and the CMS can keep their infrastructure and data model but still take advantage of new annotations and enrichment (of course, some adjustments need to be made on the CMS side to receive the data).
Introduction of the persistent identifiers at the digital specimen and annotation level provides granularity and linking of different digital objects.
These objects can provide the base for large scale data quality checks and annotation services. Here’s a test annotation digital object with a PID. Serialisation of such records (JSON or JSON-LD form) can be fed back to the CMS or other systems. Some of the basic data checks and annotations can easily be automated.
We can also use these identifiers for data citation, attribution (we are also thinking about authentication, authorisation, trust and verification methods for these annotation objects which also will not be easy).
However, we still need a few basic things in place.
As @abentley already pointed out – collections staff and the museums do not have the capacity to do some of these data clean up tasks. Automating and opening up the records will help. And it is easy to say FAIR this and FAIR that. But most museums do not have proper data steward roles and relevant data management training. DiSSCo will help with some of these capacity building but each institutions and the funding agencies need to support more data management tasks. With these training and capacity as foundation, we will start seeing the benefit of community curation or annotation at scale.
@sharif.islam, I’m glad to hear that the DiSSCo process is working on data quality mechanisms! But of course (as you know) it is so much better to fix the data at source than to expose messy data and annotate it, and hope that the publisher does something about it in the CMS, so that the next export/migration doesn’t result in the same annotations.
There are three currently available mechanisms for “fixing at source”, i.e. before DiSSCo gets the data:
(1) Do capacity building in the institutions. I would be happy to assist this. I train people in command-line methods that are faster, simpler, more flexible and more comprehensive than programs like OpenRefine and the various R packages for data cleaning.
(2) Hire data specialists to fix the data before it is exported/migrated from the institution to DiSSCo. These people would be contracted by the institution and would work with institution staff.
(3) Hire data specialists to fix the data after it is exported/migrated from the institution to DiSSCo. These people would be contracted by or through DiSSCo.
None of these three approaches requires the institution to change what it has in its CMS. These are fixing mechanisms for the shared form of the data that will greatly reduce the need for record annotation and the difficulties of “authentication, authorisation, trust and verification methods for … annotation objects”.
Thanks @datafixer et al. for sharing your notes on data quality, along with detailed breakdowns of the datasets indexed and interpreted by GBIF.
To me, your comments show the value of independent peer-review of datasets. And, with currently available point-and-click web tools, a dataset review remains tedious, time consuming, and highly specialized work: I don’t blame the reviewers of (data) papers for being tempted to skip over those data appendices.
Also, who would want to be that reviewer that says:
“Dear Author, I wasn’t able to align the taxonomic name “Aglais io” you recorded in lines 4313,9626,9680,22317,22327 in the data attachment Supplement+1_+Garden+plant-pollinator+data.csv was manually exported using LibreOffice Calc v22.214.171.124 on 2022-10-21 after downloading the source Supplement+1_+Garden+plant-pollinator+data.xlsx with content id hash://sha256/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 from https://pollinationecology.org/index.php/jpe/article/view/695/344 associated with Ollerton, J., et al. 2022.”
And being a reviewer is especially tricky when it turns out that a specific 2022 spring version of the Catalogue of Life (Banki et al. 2022; Poelen, 2022a), aka “The most complete authoritative list of the world’s species - maintained by hundreds of global taxonomists”, didn’t turn out to have the butterfly name “Aglais io”, first reported by Linnaeus in 1758 (Poelen, 2022b). In my digital forensics in analyzing this name alignment issue, I was able to trace the provenance (or origin) of the claims (or in this case non-claim) using specialized tools and data publications. And as this provenance chain is not readily available in the provided point-and-click web tools, I suspect many others will be unable to get to the bottom of these funky name alignment issues.
And, while I much appreciate the efforts to help establish more meaningful connections between biodiversity datasets and their keepers via elaborate redirection schemes (e.g., a doi for every specimen) and other plans to make plans (e.g., extended digital specimen Webster et al. 2021), I’d like to spend a significant part of my time to make the best of the existing wealth of data, tools, and the many people that keep them alive. Keeping data alive is hard enough (Elliott et al, 2020), and I imagine that improving the datasets is even harder.
I have faith that that we’ll get better at reviewing, publishing and archiving datasets, especially when we continue to democratize the tools, infrastructures, education, and societies to help support those that need it most - the hard working folks keeping valuable datasets, and those that put the data to use in their efforts to better understand life on earth.
In short, I’ll continue to try (like @datafixer and Arctos community just to name a few) to do my part in getting better at reviewing data. And, in order to get better at reviewing data as a community, not only reviewers and tools are needed, but, more importantly, folks willing and able to act on review comments.
PS. @datafixer - I have not yet been able to independently verify GBIF’s claim that 2B records have been indexed. As far as I know, a recent snapshot of GBIF/iDigBio indexed datasets yielded a few million records over 700M (Salim et al. 2022a; Salim, 2022b), consistent with results from Elliott et al. 2020. But, with the information provided by Salim, 2022, you should be able to independently verify their claims.
Ollerton, J., Trunschke, J. ., Havens, K. ., Landaverde-González, P. ., Keller, A. ., Gilpin, A.-M. ., Rodrigo Rech, A. ., Baronio, G. J. ., Phillips, B. J., Mackin, C. ., Stanley, D. A., Treanore, E. ., Baker, E. ., Rotheray, E. L., Erickson, E. ., Fornoff, F. ., Brearley, F. Q. ., Ballantyne, G. ., Iossa, G. ., Stone, G. N., Bartomeus, I. ., Stockan, J. A., Leguizamón, J., Prendergast, K. ., Rowley, L., Giovanetti, M., de Oliveira Bueno, R., Wesselingh, R. A., Mallinger, R., Edmondson, S., Howard, S. R., Leonhardt, S. D., Rojas-Nossa, S. V., Brett, M., Joaqui, T., Antoniazzi, R., Burton, V. J., Feng, H.-H., Tian, Z.-X., Xu, Q., Zhang, C., Shi, C.-L., Huang, S.-Q., Cole, L. J., Bendifallah, L., Ellis, E. E., Hegland, S. J., Straffon Díaz, S., Lander, T. A. ., Mayr, A. V., Dawson, R. ., Eeraerts, M. ., Armbruster, W. S. ., Walton, B. ., Adjlane, N. ., Falk, S. ., Mata, L. ., Goncalves Geiger, A. ., Carvell, C. ., Wallace, C. ., Ratto, F. ., Barberis, M. ., Kahane, F. ., Connop, S. ., Stip, A. ., Sigrist, M. R. ., Vereecken, N. J. ., Klein, A.-M., Baldock, K. ., & Arnold, S. E. J. . (2022). Pollinator-flower interactions in gardens during the COVID-19 pandemic lockdown of 2020. Journal of Pollination Ecology, 31, 87–96. https://doi.org/10.26786/1920-7603(2022)695
Bánki, O., Roskov, Y., Döring, M., Ower, G., Vandepitte, L., Hobern, D., Remsen, D., Schalk, P., DeWalt, R. E., Keping, M., Miller, J., Orrell, T., Aalbu, R., Adlard, R., Adriaenssens, E. M., Aedo, C., Aescht, E., Akkari, N., Alfenas-Zerbini, P., et al. (2022). Catalogue of Life Checklist (Version 2022-03-21). Catalogue of Life. ChecklistBank
Hi @datafixer, thanks for al l the great work! I agree that data should be ‘fixed at the source’ however in DiSSCo we aim to change the focus on what the source is a bit. Using a classic approach of digitisation, fixing issues in a CMS and then data publishing by the institutions is too slow and restricted by the limited resources in the institutions. DiSSCo infrastructure is designed to become in the future the first source for new data, straight from digitisation streets or collecting new objects in nature, before the data is imported in a CMS. That way it can be quality controlled and enriched already before it enters the CMS and make use of external services and capacity to assist with that. When it is of sufficient quality and detail for scientific use it can then be published in GBIF and other data aggregators. Further annotations should be synchronised with the CMS as soon as they become available, for which we are talking with the major CMS vendors to make that possible. We have to start with the existing situation though with dirty data already in a CMS, published and not yet integrated with DiSSCo, from which every institution needs to go through a digital tranformation in its own pace. Capacity building, financial support and coordination at the national level will be key to support this.