Darwin Core Half-Million - UPDATE

Thanks @datafixer et al. for sharing your notes on data quality, along with detailed breakdowns of the datasets indexed and interpreted by GBIF.

To me, your comments show the value of independent peer-review of datasets. And, with currently available point-and-click web tools, a dataset review remains tedious, time consuming, and highly specialized work: I don’t blame the reviewers of (data) papers for being tempted to skip over those data appendices.

Also, who would want to be that reviewer that says:

“Dear Author, I wasn’t able to align the taxonomic name “Aglais io” you recorded in lines 4313,9626,9680,22317,22327 in the data attachment Supplement+1_+Garden+plant-pollinator+data.csv was manually exported using LibreOffice Calc v7.3.6.2 on 2022-10-21 after downloading the source Supplement+1_+Garden+plant-pollinator+data.xlsx with content id hash://sha256/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 from https://pollinationecology.org/index.php/jpe/article/view/695/344 associated with Ollerton, J., et al. 2022.”

?

And being a reviewer is especially tricky when it turns out that a specific 2022 spring version of the Catalogue of Life (Banki et al. 2022; Poelen, 2022a), aka “The most complete authoritative list of the world’s species - maintained by hundreds of global taxonomists”, didn’t turn out to have the butterfly name “Aglais io”, first reported by Linnaeus in 1758 (Poelen, 2022b). In my digital forensics in analyzing this name alignment issue, I was able to trace the provenance (or origin) of the claims (or in this case non-claim) using specialized tools and data publications. And as this provenance chain is not readily available in the provided point-and-click web tools, I suspect many others will be unable to get to the bottom of these funky name alignment issues.

And, while I much appreciate the efforts to help establish more meaningful connections between biodiversity datasets and their keepers via elaborate redirection schemes (e.g., a doi for every specimen) and other plans to make plans (e.g., extended digital specimen Webster et al. 2021), I’d like to spend a significant part of my time to make the best of the existing wealth of data, tools, and the many people that keep them alive. Keeping data alive is hard enough (Elliott et al, 2020), and I imagine that improving the datasets is even harder.

I have faith that that we’ll get better at reviewing, publishing and archiving datasets, especially when we continue to democratize the tools, infrastructures, education, and societies to help support those that need it most - the hard working folks keeping valuable datasets, and those that put the data to use in their efforts to better understand life on earth.

In short, I’ll continue to try (like @datafixer and Arctos community just to name a few) to do my part in getting better at reviewing data. And, in order to get better at reviewing data as a community, not only reviewers and tools are needed, but, more importantly, folks willing and able to act on review comments.

-jorrit

PS. @datafixer - I have not yet been able to independently verify GBIF’s claim that 2B records have been indexed. As far as I know, a recent snapshot of GBIF/iDigBio indexed datasets yielded a few million records over 700M (Salim et al. 2022a; Salim, 2022b), consistent with results from Elliott et al. 2020. But, with the information provided by Salim, 2022, you should be able to independently verify their claims.

References

Ollerton, J., Trunschke, J. ., Havens, K. ., Landaverde-González, P. ., Keller, A. ., Gilpin, A.-M. ., Rodrigo Rech, A. ., Baronio, G. J. ., Phillips, B. J., Mackin, C. ., Stanley, D. A., Treanore, E. ., Baker, E. ., Rotheray, E. L., Erickson, E. ., Fornoff, F. ., Brearley, F. Q. ., Ballantyne, G. ., Iossa, G. ., Stone, G. N., Bartomeus, I. ., Stockan, J. A., Leguizamón, J., Prendergast, K. ., Rowley, L., Giovanetti, M., de Oliveira Bueno, R., Wesselingh, R. A., Mallinger, R., Edmondson, S., Howard, S. R., Leonhardt, S. D., Rojas-Nossa, S. V., Brett, M., Joaqui, T., Antoniazzi, R., Burton, V. J., Feng, H.-H., Tian, Z.-X., Xu, Q., Zhang, C., Shi, C.-L., Huang, S.-Q., Cole, L. J., Bendifallah, L., Ellis, E. E., Hegland, S. J., Straffon Díaz, S., Lander, T. A. ., Mayr, A. V., Dawson, R. ., Eeraerts, M. ., Armbruster, W. S. ., Walton, B. ., Adjlane, N. ., Falk, S. ., Mata, L. ., Goncalves Geiger, A. ., Carvell, C. ., Wallace, C. ., Ratto, F. ., Barberis, M. ., Kahane, F. ., Connop, S. ., Stip, A. ., Sigrist, M. R. ., Vereecken, N. J. ., Klein, A.-M., Baldock, K. ., & Arnold, S. E. J. . (2022). Pollinator-flower interactions in gardens during the COVID-19 pandemic lockdown of 2020. Journal of Pollination Ecology, 31, 87–96. https://doi.org/10.26786/1920-7603(2022)695

Bánki, O., Roskov, Y., Döring, M., Ower, G., Vandepitte, L., Hobern, D., Remsen, D., Schalk, P., DeWalt, R. E., Keping, M., Miller, J., Orrell, T., Aalbu, R., Adlard, R., Adriaenssens, E. M., Aedo, C., Aescht, E., Akkari, N., Alfenas-Zerbini, P., et al. (2022). Catalogue of Life Checklist (Version 2022-03-21). Catalogue of Life. ChecklistBank

Poelen, Jorrit H. (2022a). Nomer Corpus of Taxonomic Resources hash://sha256/6224f259190590c7aed4784de2b27b3005eea0042ae02993ebf7a0fe30d87137 (0.4) [Data set]. Zenodo. Nomer Corpus of Taxonomic Resources hash://sha256/6224f259190590c7aed4784de2b27b3005eea0042ae02993ebf7a0fe30d87137

Poelen, Jorrit H. (2022b). Inconsistent name alignment review for [Aglais io] using different versions of Catalogue of Life matcher. GitHub. inconsistent name alignment review for [Aglais io] using different versions of Catalogue of Life matcher · Issue #124 · globalbioticinteractions/nomer · GitHub accessed on 2022-11-04 .

Salim JA, Seltmann KC, Poelen JH, Saraiva AM (2022a) Indexing Biotic Interactions in GBIF data. Biodiversity Information Science and Standards 6: e93565. Indexing Biotic Interactions in GBIF data

Salim JA. 2022b. Searching for Interactions in GBIF/iDigBio Darwin Core Archives. Github. Home · globalbioticinteractions/prestonocene Wiki · GitHub accessed at 2022-11-04

MJ Elliott, JH Poelen, JAB Fortes (2020). Toward Reliable Biodiversity Dataset References. Ecological Informatics. Redirecting

Webster MS, Buschbom J, Hardisty A, Bentley A (2021) The Digital Extended Specimen will Enable New Science and Applications. Biodiversity Information Science and Standards 5: e75736. The Digital Extended Specimen will Enable New Science and Applications

2 Likes