To help foster novel research, lower technical barriers of large-scale data analysis and raise the visibility of GBIF, we propose to share public datasets containing snapshots of all GBIF-mediated occurrence data each month on all major cloud infrastructure providers.
GBIF is seeking expressions of support to explore this idea further and invites discussion of any concerns that this idea may raise.
GBIF currently prepares monthly snapshots for others to download for analysis. Each is referenced by a Digital Object Identifier (DOI) (e.g. doi.org/10.15468/dl.otf01c). We propose to enhance this process by uploading and registering the datasets in CSV, DwC-A and Avro formats into each of the major cloud-computing environments. GBIF will respect the licensing of the constituent datasets, so we anticipate these monthly snapshot datasets will carry CC-BY-NC licence applied to the dataset (individual records retain more open licences if available).
In time, this process may evolve to include reference datasets with data vetted to strict quality-control mechanisms. Enabling others to easily perform large scale analysis may help accelerate this.
The main cloud-computing providers targeted could be:
- Alibaba Cloud
- Amazon AWS Public Datasets
- European Open Science Cloud (requires exploration)
- Google Cloud Public Datasets
- Google Earth Engine to enable spatial analysis
- Microsoft Azure Open Datasets
- Suggestions for others are welcome
In addition, GBIF should look to improve automation of regular exports of occurrence data to permit the inclusion of new records within modeling and analytic frameworks (e.g. Map of Life and BIEN, among others).
- Reduces the technical barrier for users around the world—anyone, anywhere should have access to the resources needed to analyse content
- Reduces the costs for users of cloud computing services, such as Amazon EC2 instances, by decreasing bandwidth charges and reducing the time to transfer
- Enables users to move data quickly and easily into cloud-native analytical tools, such as Google Big Query
- Allows development of training material and communities around cloud-native tools to accelerate, simplify and reduce total costs for analysis
- Increases the visibility of the GBIF network for users of open data, potentially expanding into new communities of users and boosting awareness of the available data resources for research groups within tech companies
- May reduce download times for user, by placing data on servers closer to users
- May strengthen applications for funds and cloud credits to make use of cloud-computing infrastructure by GBIF nodes, publishers and researchers
- Helps forge technology partnerships that ease the path—and the costs—for future growth in GBIF
- Expands the pool of user feedback needed to guide GBIF’s future activities
GBIF are seen as aligning too closely with private technology firms
The rapid global adoption of cloud-computing infrastructure means that the GBIF community is already increasingly making use of these resources. GBIF should remain neutral and seek to integrate equivalent content within all major cloud-computing services.
It is possible that future growth of GBIF may necessitate the need for technology partners and this may ease the path.
Reduced control over data citations and tracking
Enabling cloud users to analyse and filter against all data available in a given snapshot may mean that people will only cite “all data”. However:
- Guidelines for creating more fine-grained citations can be prepared that summarize how to determine the distinct datasets used in research. Combined with a GBIF citation service the equivalent DOI citation model can be maintained. GBIF could work to provide and integrate such services within core functions and libraries (as is increasingly the case among users of R packages). In some environments, such as Google Earth Engine, functions can be shared under a GBIF library to simplify this.
- More ‘big data’ uses that cite GBIF-mediated data as a whole may raise the profile of the entire GBIF network, providing an overall benefit that may mitigate the loss of some of the fine-grained citation detail provided by our download DOI service.
We welcome thoughts and discussion on this idea.