I am excited that GBIF is thinking of novel ways to share public datasets as a way to facilitate research.
As you mention, being able to reliably reference datasets is an important first step to keep track of the origin (or provenance), and use, of the dataset. In our paper “Towards Reliable Biodiversity Dataset References” (https://doi.org/10.32942/osf.io/mysfp, in review) and related GBIF forum discussion (Toward Reliable Biodiversity Dataset References) we outline such reliable reference method. This method not only allows for reliably referencing a dataset (i.e. dataset version), but also the origin (or provenance) of that dataset. This method is complementary to the existing DOI infrastructure.
Also, I wanted to bring to your attention that as part of our related work, we have showed to be able to reliably move, archive and cite (!) the entire GBIF corpus (exactly as provided by the publishers, the “raw” data) using commodity servers, consumer-grade internet connections (<10Mbs!) and consumer grade hardware. With this, I am able to do research on the GBIF corpus using a 1TB external hard disk (<$100 at most retailers), an open source computing platform and a $250 laptop without having to worry about keeping ~700GB of data alive on some cloud infrastructure. Needless to say, I do see the benefit of being able to swipe my credit card and have quick access to a data warehouse with managed servers and fast networks (that’s what a cloud is right?), as long as I am able to keep my research data (and their original sources) under my pillow if my funding dries up.
As a small independent, I try to minimize overhead and try to avoid having to pay “rent” for cloud services. I prefer solutions that can be easily archived in across existing platforms (internet archive, Zenodo) and also be kept offline without hurting the integrity of the data. This way, I don’t have to worry about some company cutting off access to my research data because a contract expires, a change in terms of service, or if I forget to pay my bills.
I think GBIF can play an even bigger role in the research community if:
- methods are adopted to reliably/cryptographically reference original raw datasets (as provided by institutions), derived datasets (e.g., GBIF downloads or GBIF mediated datasets) and their provenance (e.g., what version of original dataset led to what version of GBIF mediated dataset).
- share your extensive knowledge on how to implement data processing workflows that work at small and large scales
- open source tools are made available to allow users to reproduce process to re-create the “GBIF mediated datasets” from their original data sources
- method are developed to allow users to produce their own “data downloads” from the original datasets (e.g., selecting records with specific geospatial/taxonomic constraints) and make it easy for them to publish these derived dataset on a data publication platform while citing their original sources.
- methods are developed to publish datasets across many different data platforms without losing the ability to reliably reference and verify the data.
I much like the idea to make the biodiversity datasets more accessible for (small/large scale) computation and thank you for facilitating the discussion around it.