Continuing the discussion from GBIF exports as public datasets in cloud environments.
We’re stoked to see this moving along! In cases where we really do need to work with all of GBIF or a really huge chunk all at once in a computationally efficient way to produce some kind of product, this will be a huge time and resource saver. I do have a question on the licensing decisions. I think we’re starting out limiting public data cloud instances to CC-0 and CC-BY licensed data sources/records. That makes a certain amount of sense if we’re pushing data to what is essentially a commercial provider’s platform.
However, that is going to be a pretty big cut to the available data for much of what we’d want to do. For instance, the vast majority of iNaturalist records are licensed NC, maybe because of a default on the iNaturalist end that could be revisited at some point. But it will mean a lot of what we want to do as government scientists, often collaborating with university colleagues, will be working in a non-commercial setting but using a commercial cloud provider because that’s where we are currently mandated to do our cloud-based work. Philosophically, I’m not happy that we’re contracted the way we are with Amazon at the moment (could be Azure or someone else who wins the next one), but that’s our reality. Sure, there are work arounds. We can write a process to use the GBIF API and pull all NC records into another data store in the same AWS region and then do our work, but that puts us back to a still pretty inefficient process.
The (mostly) global open data movement that’s taken shape over the last decade is really what pushed along the ideas for the “Public Data Cloud” concept and what Amazon, Microsoft, Google, IBM, and others have jumped in on in their own peculiar ways. Say what you will about it being self-serving, but the major commercial cloud moneymakers are providing a value-added capability through their public data offerings. In the US, NOAA really pushed this along a few years ago with their particular take on “Big Data” (even though USGS had already pushed Landsat to both AWS and Google), and being in the US Department of Commerce there was definitely an economic/business stimulus idea at play when they got the major cloud companies to play ball.
There might be an ethical argument that GBIF should not push data flagged for noncommercial use to a commercial platform that supports commercial and noncommercial users and usage, and maybe there’s a legal argument as well. If GBIF takes data entrusted to its care by contributors who specifically signal that they want usage restricted to noncommercial endeavors and puts those data out onto a commercial cloud platform, even if done through a seemingly altruistic thing like “open data sponsorship”, is that a violation of the intent of a noncommercial designation? I’d like to think that the license designation simply passes on through from that point to be (hopefully) adhered to by any users no matter where they are getting the data. That would solve our technical problem, but GBIF as a community and an institution needs to make the right call.
Any trained lawyers or ethicists want to jump in here?