GBIF exports as public datasets in cloud environments

Thanks @jhpoelen

For instance, a one-time download of a 1TB data from the big three commercial cloud providers easily costs ~ $100

This may be true, but I would note:

  1. GBIF.org provides free access which wouldn’t disappear
  2. The proposal would enable a user to easily pre-shape the data they need (filtering, aggregating) so that they pull down only a summary view to their own schema and control.
  3. “Academic” clouds could be explored if available

How are you going to keep track of the versions of source archives as provided by institutions and their usage in the associated GBIF derived datasets?
How are you planning to reliably link datasets to their associated DOIs? Or, in other words, how can users verify that they have an exact copy of a dataset associated with some DOI? Or, how can I lookup a DOI associated to a dataset that I have sitting on my hard disk?

These are good questions, but I’d consider them tangential to the discussion of enabling cloud users. I say this since they could be asked of GBIF.org today and arguably aren’t a concern for many purposes.

For the foreseeable future, I see no way other than to consider these monthly views as point-in-time snapshots.

I also foresee tracking use as using the existing DOI mechanism, noting that the DOI refers to the concept of a dataset, not a versioned export of it.

Recognizing that we’re dealing with a myriad of data sources (versioned, living, append-only etc) and that protocols in use don’t all enable strong versioning we’ve always taken the approach of providing the “raw” record along with the derived view so that the original state can be viewed.

Ensuring the integrity of dataset copies would need to use some kind of checksumming as you note. Your explorations so far have used the source datasets (i.e. from institutional URLs) which are links to mutable objects by design (the latest version). Since ~2018 we store all versions of the datasets in our crawling infrastructure which represent point-in-time snapshots of each source as it is ingested and are indeed immutable. It’s never been asked of us, but we could expose those as individual datasets if you had an interest.