Public Data Cloud Snapshots

Continuing the discussion from GBIF exports as public datasets in cloud environments.

We’re stoked to see this moving along! In cases where we really do need to work with all of GBIF or a really huge chunk all at once in a computationally efficient way to produce some kind of product, this will be a huge time and resource saver. I do have a question on the licensing decisions. I think we’re starting out limiting public data cloud instances to CC-0 and CC-BY licensed data sources/records. That makes a certain amount of sense if we’re pushing data to what is essentially a commercial provider’s platform.

However, that is going to be a pretty big cut to the available data for much of what we’d want to do. For instance, the vast majority of iNaturalist records are licensed NC, maybe because of a default on the iNaturalist end that could be revisited at some point. But it will mean a lot of what we want to do as government scientists, often collaborating with university colleagues, will be working in a non-commercial setting but using a commercial cloud provider because that’s where we are currently mandated to do our cloud-based work. Philosophically, I’m not happy that we’re contracted the way we are with Amazon at the moment (could be Azure or someone else who wins the next one), but that’s our reality. Sure, there are work arounds. We can write a process to use the GBIF API and pull all NC records into another data store in the same AWS region and then do our work, but that puts us back to a still pretty inefficient process.

The (mostly) global open data movement that’s taken shape over the last decade is really what pushed along the ideas for the “Public Data Cloud” concept and what Amazon, Microsoft, Google, IBM, and others have jumped in on in their own peculiar ways. Say what you will about it being self-serving, but the major commercial cloud moneymakers are providing a value-added capability through their public data offerings. In the US, NOAA really pushed this along a few years ago with their particular take on “Big Data” (even though USGS had already pushed Landsat to both AWS and Google), and being in the US Department of Commerce there was definitely an economic/business stimulus idea at play when they got the major cloud companies to play ball.

There might be an ethical argument that GBIF should not push data flagged for noncommercial use to a commercial platform that supports commercial and noncommercial users and usage, and maybe there’s a legal argument as well. If GBIF takes data entrusted to its care by contributors who specifically signal that they want usage restricted to noncommercial endeavors and puts those data out onto a commercial cloud platform, even if done through a seemingly altruistic thing like “open data sponsorship”, is that a violation of the intent of a noncommercial designation? I’d like to think that the license designation simply passes on through from that point to be (hopefully) adhered to by any users no matter where they are getting the data. That would solve our technical problem, but GBIF as a community and an institution needs to make the right call.

Any trained lawyers or ethicists want to jump in here?


Maybe selecting only CC0 and CC BY for the cloud service might contribute to seed data publishers (including iNaturalist) shifting towards these better practice data licenses and away from restrictive data licenses (including CC BY-NC). Such a shift, from an isolated consideration, and probably only an unintended side-effect, might anyway be a positive outcome. If having “your” data available in this cloud is important to data publishers or data users who can influence the data publishers towards choosing better practice data licenses.


Since you mention iNaturalist, it might be noteworthy that the documentation indicates that they include CC-BY-NC images in the iNaturalist Public Dataset on Amazon AWS.

Repeating relevant section of docs here:

Photos with a CC0 license can be attributed as “[observer name, or observer login], no rights reserved (CC0)”. For example “Name, no rights reserved (CC0)”, or “Login, no rights reserved (CC0)”. Photos with other Creative Commons licenses can be attributed as “© [observer name, or observer login], some rights reserved ([license abbreviation])”. For example “© Name, some rights reserved (CC-BY)”, or “© Login, some rights reserved (CC-BY-NC)”

My understanding is that there is no legal limitation to putting CC-BY-NC in the public cloud as long as the license is still clear and the metadata are still linked. I see no difference, to publishing scientific papers based upon data with a non-commercial license. The publisher is making a profit, but not directly from the data.
Granted the definition of non-commercial is fuzzy, but I don’t think this is even against the spirt of the license, let alone the legal details.
Having said this, CC-BY-NC is not considered “Open Data” by the Open Knowledge Foundation and it is not recommended by the EU Commission for use in Horizon Europe projects.
It does put severe limitations on derived products and problems for license stacking. Software will certainly have to be built to enable avoidance of such data and I’m sure not all the data providers using this license are aware of the downstream consequences.

There is a very good guide to NC usage here NonCommercial interpretation - Creative Commons

They list a number of key points, but one that seems relevant is…

NonCommercial turns on the use, not the identity of the reuser.

and that it is only the only the “primary purpose of the reuse that needs to be considered”.
So if a business uses these data, but not for generating revenue then it is still allowed under this license.

There does need to further debate with in the community regarding the use of this license and I suspect we will have to deal with its legacy, even if we can persuade current data providers not to use it. However, I don’t think this blocks its use on cloud services.