GBIF exports as public datasets in cloud environments

trobertson · February 27, 2020, 11:44am

Summary

To help foster novel research, lower technical barriers of large-scale data analysis and raise the visibility of GBIF, we propose to share public datasets containing snapshots of all GBIF-mediated occurrence data each month on all major cloud infrastructure providers.

GBIF is seeking expressions of support to explore this idea further and invites discussion of any concerns that this idea may raise.

Process

GBIF currently prepares monthly snapshots for others to download for analysis. Each is referenced by a Digital Object Identifier (DOI) (e.g. doi.org/10.15468/dl.otf01c). We propose to enhance this process by uploading and registering the datasets in CSV, DwC-A and Avro formats into each of the major cloud-computing environments. GBIF will respect the licensing of the constituent datasets, so we anticipate these monthly snapshot datasets will carry CC-BY-NC licence applied to the dataset (individual records retain more open licences if available).

In time, this process may evolve to include reference datasets with data vetted to strict quality-control mechanisms. Enabling others to easily perform large scale analysis may help accelerate this.

Cloud providers

The main cloud-computing providers targeted could be:

Alibaba Cloud
Amazon AWS Public Datasets
European Open Science Cloud (requires exploration)
Google Cloud Public Datasets
Google Earth Engine to enable spatial analysis
Microsoft Azure Open Datasets
Suggestions for others are welcome

In addition, GBIF should look to improve automation of regular exports of occurrence data to permit the inclusion of new records within modeling and analytic frameworks (e.g. Map of Life and BIEN, among others).

Perceived benefits

Reduces the technical barrier for users around the world—anyone, anywhere should have access to the resources needed to analyse content
Reduces the costs for users of cloud computing services, such as Amazon EC2 instances, by decreasing bandwidth charges and reducing the time to transfer
Enables users to move data quickly and easily into cloud-native analytical tools, such as Google Big Query
Allows development of training material and communities around cloud-native tools to accelerate, simplify and reduce total costs for analysis
Increases the visibility of the GBIF network for users of open data, potentially expanding into new communities of users and boosting awareness of the available data resources for research groups within tech companies
May reduce download times for user, by placing data on servers closer to users
May strengthen applications for funds and cloud credits to make use of cloud-computing infrastructure by GBIF nodes, publishers and researchers
Helps forge technology partnerships that ease the path—and the costs—for future growth in GBIF
Expands the pool of user feedback needed to guide GBIF’s future activities

Potential risks

GBIF are seen as aligning too closely with private technology firms
The rapid global adoption of cloud-computing infrastructure means that the GBIF community is already increasingly making use of these resources. GBIF should remain neutral and seek to integrate equivalent content within all major cloud-computing services.
It is possible that future growth of GBIF may necessitate the need for technology partners and this may ease the path.
Reduced control over data citations and tracking
Enabling cloud users to analyse and filter against all data available in a given snapshot may mean that people will only cite “all data”. However:
- Guidelines for creating more fine-grained citations can be prepared that summarize how to determine the distinct datasets used in research. Combined with a GBIF citation service the equivalent DOI citation model can be maintained. GBIF could work to provide and integrate such services within core functions and libraries (as is increasingly the case among users of R packages). In some environments, such as Google Earth Engine, functions can be shared under a GBIF library to simplify this.
- More ‘big data’ uses that cite GBIF-mediated data as a whole may raise the profile of the entire GBIF network, providing an overall benefit that may mitigate the loss of some of the fine-grained citation detail provided by our download DOI service.

We welcome thoughts and discussion on this idea.

dshorthouse · February 27, 2020, 1:33pm

This a great idea to help expose GBIF to other audiences.

re: Reduced control over data citations and tracking

What’s really needed is upload of datasets containing the subset of data used in analyses, not merely the unfiltered, raw downloaded data. This would help eliminate spurious assignment of credit to publishers or individuals.

sckott · February 27, 2020, 4:51pm

One downside is that the useful GBIF API methods for searching for names/occurrences/etc are not present with a data snapshot. It’d be important I think to make client libraries (e.g., rgbif, pygbif) have a similar as possible programmatic interface for the snapshot data as exists for the GBIF API methods to make it easy to use data from the GBIF API or the snapshot data. Having said that, I’m sure some users will be happy to work with the raw data themselves without any help.

How will tracking work? Will GBIF be able to get usage stats from each of the platforms? I imagine that is an important piece?

For citing data, I think we’d need to make a concerted effort to make it as easy as possible with client libraries/web tools/etc. to extract/produce citations for the final dataset used in a research paper

trobertson · February 28, 2020, 8:31am

Thanks @sckott and @dshorthouse

It’d be important I think to make client libraries (e.g., rgbif, pygbif) have a similar as possible programmatic interface for the snapshot data as exists for the GBIF API methods to make it easy to use data from the GBIF API or the snapshot data

Yes, although this would enable two fundamental things that I believe shouldn’t be restricted to a GBIF API.

Firstly it allows for more freedom in using GBIF-mediated content in workflows. In Google Cloud, for example, you could use SQL and BigQuery, in Google Earth Engine you would work in their scripting language, in AWS you might be mixing GBIF data with other geographic layers or remote sensing data using EMR Spark, designing a workflow using Glue or using rGBIF perhaps within Anaconda. The data standards in use would be common, but this would not restrict to only the functions (or processing capabilities) behind the GBIF API.

Secondly, it locates the data in the same environment as the workflow execution making it easier to mix with other content. GBIF is not big data, but even at 2TB is already a challenge for some to manage, and even when not, it incurs unnecessary data transfer. This is only going to get worse with growth (in volume and richness).

For citing data, I think we’d need to make a concerted effort to make it as easy as possible with client libraries/web tools/etc. to extract/produce citations for the final dataset used in a research paper

Yes. This is critical, but I believe achievable with clear guidelines and shared functions.

jhpoelen · March 2, 2020, 5:22pm

I am excited that GBIF is thinking of novel ways to share public datasets as a way to facilitate research.

As you mention, being able to reliably reference datasets is an important first step to keep track of the origin (or provenance), and use, of the dataset. In our paper “Towards Reliable Biodiversity Dataset References” (https://doi.org/10.32942/osf.io/mysfp, in review) and related GBIF forum discussion (Toward Reliable Biodiversity Dataset References) we outline such reliable reference method. This method not only allows for reliably referencing a dataset (i.e. dataset version), but also the origin (or provenance) of that dataset. This method is complementary to the existing DOI infrastructure.

Also, I wanted to bring to your attention that as part of our related work, we have showed to be able to reliably move, archive and cite (!) the entire GBIF corpus (exactly as provided by the publishers, the “raw” data) using commodity servers, consumer-grade internet connections (<10Mbs!) and consumer grade hardware. With this, I am able to do research on the GBIF corpus using a 1TB external hard disk (<$100 at most retailers), an open source computing platform and a $250 laptop without having to worry about keeping ~700GB of data alive on some cloud infrastructure. Needless to say, I do see the benefit of being able to swipe my credit card and have quick access to a data warehouse with managed servers and fast networks (that’s what a cloud is right?), as long as I am able to keep my research data (and their original sources) under my pillow if my funding dries up.

As a small independent, I try to minimize overhead and try to avoid having to pay “rent” for cloud services. I prefer solutions that can be easily archived in across existing platforms (internet archive, Zenodo) and also be kept offline without hurting the integrity of the data. This way, I don’t have to worry about some company cutting off access to my research data because a contract expires, a change in terms of service, or if I forget to pay my bills.

I think GBIF can play an even bigger role in the research community if:

methods are adopted to reliably/cryptographically reference original raw datasets (as provided by institutions), derived datasets (e.g., GBIF downloads or GBIF mediated datasets) and their provenance (e.g., what version of original dataset led to what version of GBIF mediated dataset).
share your extensive knowledge on how to implement data processing workflows that work at small and large scales
open source tools are made available to allow users to reproduce process to re-create the “GBIF mediated datasets” from their original data sources
method are developed to allow users to produce their own “data downloads” from the original datasets (e.g., selecting records with specific geospatial/taxonomic constraints) and make it easy for them to publish these derived dataset on a data publication platform while citing their original sources.
methods are developed to publish datasets across many different data platforms without losing the ability to reliably reference and verify the data.

I much like the idea to make the biodiversity datasets more accessible for (small/large scale) computation and thank you for facilitating the discussion around it.

-jorrit

ylebras · March 4, 2020, 2:49pm

AMAAAZZIIIIINNNNNNNNG!

Some firsts comments / ideas coming to my mind are :

Adding cloud providers / actors as:
- Australia:
  - Australian ARDC cloud https://nectar.org.au/research-cloud/ ?
- US:
  - XSEDE
- for Europe:
  - EOSC
Concerning modeling and analytic frameworks, I think of interest to add Galaxy (https://ecology.usegalaxy.eu/) as Biodiversity virtual lab (BCCVL) https://app.bccvl.org.au/login

ylebras · March 4, 2020, 2:50pm

(as new users can’t put more than 3 links…)

jhpoelen, Cloud is not related to business orientation for me. So we can benefit using academic clouds. Concenrning your work on “internal” identifiers, this is a very good idea I think, I am arguing using such MD5 or checksum method to produce intrinsec uniqu identifiers since 2 years and propose it on the BiodiFAIRse GO FAIR implementation network, linking it to DOI so we can benefit from both systems. Concerning provenance information and relations between raw as derived data and linked to software, workflows as publication or ohters research objects, I am totally agree and that 's why in France we are working on an intensive use of EML as an amazing metadata language who can provide both methods/tools and semantics to capture this!

For EOSC, a focus can be done on this EOSC service, I am using to deploy Galaxy for ecology: European Galaxy instance https://marketplace.eosc-portal.eu/services/european-galaxy-server
- for now I am using our Galaxy dedicated tool https://ecology.usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/ecology/spocc_occ/spocc_occ/0.9.0 to import GBIF occurence data, FYI, this Galaxy tool is using spocc R package https://cran.r-project.org/web/packages/spocc/index.html

ylebras · March 4, 2020, 2:51pm

(as new users can’t post more than 3 links …)

thanks to its conda-“ification” https://anaconda.org/bioconda/r-spocc

I can give a try to use the GBIF monthly snapshot system on European Galaxy Instance. We (Galaxy) have a system to manage automatic update of “reference data” https://galaxyproject.github.io/training-material/topics/admin/tutorials/cvmfs/tutorial.html so we can use it for GBIF snapshot!

Concerning licensing, I always prefer an open data oriented license as CC-BY than CC-BY-NC as 1/ can we really detect if there is a commercial use of data ? and 2/ are we ready to go towards legal procedure in case of detected commercial use of the data ?

kcopas · March 4, 2020, 3:17pm

Thanks for all your suggestions and enthusiasm, Yvan.

Regarding your concluding questions, see the section On commercial use in our Terms of use—a few relevant highlights:

Interpretations vary widely about how to define commercial use. Some would limit it narrowly if straightforwardly to for-profit practices like re-sale of data in contrast with use for example in publications in commercial journals. Broader constructions would extend it, for example, to websites displaying advertisements as a means of operational cost recovery. GBIF does not expect to propose or impose a resolution to this conversation…

We believe that restrictive interpretations of non-commercial use run counter to the spirit and the letter of open access in general and GBIF in particular.…

…GBIF has neither the interest nor the resources to enforce CC BY-NC by legal means.

kcopas · March 4, 2020, 3:20pm

[also bumped up your status—link away]

PhoebeBarnard · March 4, 2020, 10:00pm

I’m also excited to see the new thinking and ideas here to take GBIF to a wider audience where biodiversity data can really make truly global impact, and not just within the conservation and scientific communities. I agree, that GBIF needs to remain a neutral arbiter of information rather than cozying up to one or two tech giants - but it also needs to put public data out there into the world for massive impact. After all, Rome is burning and this decade is the most critical one we’ve ever had for impact. It’s a tricky balance, I know. I’ve asked my geeky technowizard and strategist friends in the Conservation Biology Institute to chime in with their thoughts on pros and cons, too.

kcopas · March 5, 2020, 11:20am

Does @earthdoc respond if you group him among your ‘geeky technowizard and strategist friends’ at CBI?

Asking for a friend…

jhpoelen · March 5, 2020, 6:45pm

Hi @ylebras et al. - I am glad to hear that you have also adopted content hashes as a complement to DOIs. I have yet to be convinced that EML has enough provenance information to systemically keep track of data version, but I am happy to be convinced. Do you have some examples that I can learn from? I am open to scheduling a live discussion on this next week. Interested?

Coincidentally, I had a look at the realities of computing on big-ish ~1TB datasets earlier this week (see https://github.com/bio-linker/organization/wiki/compute-environments ) and found that most academic clouds focus on providing a short-lived sandbox to experiment with cloud computing or running small experiments. With ~1TB of raw data to work with, I found that network bandwidth, network transfer and storage costs become a the main design constraints. For instance, a one-time download of a 1TB data from the big three commercial cloud providers easily costs ~ $100 and this is assuming that the client network connections are stable and fast enough to even be able to transfer that data reliably. This is why I have adopted decentralized storage and discovery techniques that even allow for the option to send physical hard disks by snail mail or by using tools like rsync to enable efficient incremental updates. As I mentioned earlier, using these approaches/tools are a necessity for me because I don’t have access to fast university networks or massive IT budgets.

I am sure that my brief survey of academic clouds is incomplete, so I am open to learning more about the academic services that are now available.

Storage, network and compute infrastructure issues aside: my remaining questions to the GBIF technical team are:

How are you going to keep track of the versions of source archives as provided by institutions and their usage in the associated GBIF derived datasets?
How are you planning to reliably link datasets to their associated DOIs? Or, in other words, how can users verify that they have an exact copy of a dataset associated with some DOI? Or, how can I lookup a DOI associated to a dataset that I have sitting on my hard disk?

ylebras · March 5, 2020, 9:21pm

THANK YOU for your detailled answer and very interested coments / questions!

With EML you can build relatively detailled provenance informations at relative low cost https://nceas.github.io/datateam-training/training/building-provenance.html notably because 1/ with EML you can give info on datasets (raw and derived dataset) + softwares used to generate datasets and/or to come from dataset to another.
BUT DataOne is using uuid and/or DOI to identify each research object so this is not intrinsec and here is a potential provenance issue regarding reliability mentionned in you paper. I hope I can test ASAP generation of intrinsec ID through hash approach so we can give to DataOne research objects unique id maybe coming from a combination of intrinsec ID + uuid/DOI so we can be sure the object is the one referenced.
BUT I am quite new in this complex but so beautifull DataOne environment and as there is MD5 information on each DataOne research object (you can look at this document https://openstack-192-168-100-101.genouest.org/metacatui/view/urn:uuid:ed9db304-fcb9-4c9e-aea2-1237bf58c855 , our test server, at “Authentication”) so material seems to be there, and maybe DataOne is using this information to track provenance…

Interested to take part to such live discussion, … but really under the wave…

True that handling big-ish ~1TB dataset is not straight forward… and that’s why I think of interest to differentiate ““reference data”” from ““no-reference data”” as it appears to me 1/ that this is really rare that we have to handle such big-ish dataset and 2/ such big-ish dataset are maybe more related to reference data coming from aggregated data (from database/databank for example) from where users want to extract a subpart (species/time/geo). SO here again, I really think the ““Galaxy”” approach can be specially a good one for biodiversity world, TBs of referenced data (https://galaxyproject.org/admin/reference-data-repo/ notably reference genomes) are available through cache and using HTTP / rsync / CVMFS. I think this is a good place to test GBI reference data sharing to US, EU or Australia community.
BUT this is cache, so not solving all issues notably regarding very bad networks… And here, these reference data are accessible to anyone for free

trobertson · March 6, 2020, 12:19pm

Thanks @jhpoelen

For instance, a one-time download of a 1TB data from the big three commercial cloud providers easily costs ~ $100

This may be true, but I would note:

GBIF.org provides free access which wouldn’t disappear
The proposal would enable a user to easily pre-shape the data they need (filtering, aggregating) so that they pull down only a summary view to their own schema and control.
“Academic” clouds could be explored if available

How are you going to keep track of the versions of source archives as provided by institutions and their usage in the associated GBIF derived datasets?
How are you planning to reliably link datasets to their associated DOIs? Or, in other words, how can users verify that they have an exact copy of a dataset associated with some DOI? Or, how can I lookup a DOI associated to a dataset that I have sitting on my hard disk?

These are good questions, but I’d consider them tangential to the discussion of enabling cloud users. I say this since they could be asked of GBIF.org today and arguably aren’t a concern for many purposes.

For the foreseeable future, I see no way other than to consider these monthly views as point-in-time snapshots.

I also foresee tracking use as using the existing DOI mechanism, noting that the DOI refers to the concept of a dataset, not a versioned export of it.

Recognizing that we’re dealing with a myriad of data sources (versioned, living, append-only etc) and that protocols in use don’t all enable strong versioning we’ve always taken the approach of providing the “raw” record along with the derived view so that the original state can be viewed.

Ensuring the integrity of dataset copies would need to use some kind of checksumming as you note. Your explorations so far have used the source datasets (i.e. from institutional URLs) which are links to mutable objects by design (the latest version). Since ~2018 we store all versions of the datasets in our crawling infrastructure which represent point-in-time snapshots of each source as it is ingested and are indeed immutable. It’s never been asked of us, but we could expose those as individual datasets if you had an interest.

jhpoelen · March 6, 2020, 11:50pm

Thanks @trobertson for your detailed reply.

First, I’d like to emphasize that I much appreciate that GBIF is looking into ways to distribute their aggregate data products beyond the GBIF “cloud” (e.g., servers that run GBIF processes and store/serve data).

I notice that roughly four topics came up in this forum discussion:

getting data closer to where folks do their analysis
ensuring that data publishers are attributed
ensuring that provenance of data is clear
ensuring that the integrity of the data can be verified

With this in mind, I’d to respond to some of the your comments.

Yes, I am able to download a zipfile by clicking on the “download” button of the occurrence download page DOI Download . However, I have no way of telling that the zip file I receive today is the same zip file that my future self will download in 5, 10, 20 years from now. In addition, I’d say that DOI and the zip file will be around as long as GBIF has the will, capability and funds to do so.

Realizing that the internet is a wild and dynamic place and funding comes and goes, I would want to figure out a way to keep many copies around in different places without losing the ability to cite, find, retrieve and verify the integrity of, the datasets.

A first step towards more reliably referencing data would be to provide checksums (or content hashes) associated to the provided data products and include these in the data citations along with the DOIs. Providing this information is standard practice for distribution of digital content (e.g., Zenodo provides md5 hashes for each file they host).

I’d say that when distributing content across different platforms or “clouds”, including the checksum/content hash would help to ensure that no data gets corrupted in transmission.

“Raw” records are neatly packed in zip files or tar balls and made available through institutional URLs. As you noted, you can version these files using checksums or content hashes. With this, you can version any “raw” record and further enhance the provenance of the GBIF annotated records by including a reliable reference (= content hash) to the version of the provided source archive from which the “raw” material came.

As our study has also kept track of all the dataset versions registered in the GBIF network since 2018, I’d be very interested to compare our collection of immutable/versioned source archives with those that you have.

Thank you for taking the time to respond and for considering my feedback.

tuco · March 11, 2020, 12:47am

I wholeheartedly support this proposal. The reasoning is sound, the targets are apt, the benefits are many, including an important one not stated - that it will save human resources for GBIF. I wrestle with my conscience when I ask something special of GBIF. I know how hard they work, and they have always come through anyway, but if I could get snapshots in BigQuery monthly, it would take a (down and up) load off my mind (kidding here, because GBIF have always created AVRO files for me that I can load directly in BigQuery, but you get my meaning).
I think having the data closer to analytic platforms will foster much more holistic use and new tools. It could enable sandboxes for the community to try to develop and showcase novel ideas, and GBIF in turn could benefit by incorporating those that are appropriate.

trobertson · March 11, 2020, 8:26am

Thank you all

@jhpoelen - I am not ignoring you, but reflecting on your comments; I sense assumptions are being made on living datasets that won’t hold true and I’d like to understand more. A few things spring to mind and I suggest we take a call and then start a dedicated discussion focusing on some of the replication aspects you raise. Would that be OK?

I wholeheartedly support this proposal … [but] … I wrestle with my conscience when I ask something special of GBIF

Thanks for confirming @tuco - enabling the communities you work with was a motivation behind this thinking to enable you to do more, easier and cheaper (vocabulary development, understanding standards use, project impact assessments etc) but hearing it would ease your conscience is an added bonus.

jhpoelen · March 11, 2020, 2:58pm

@trobertson thanks for your considerations. Am happy to take a call to sync up and then report out if needed. Please feel free to contact me off-list.

Also, I’d like to echo @tuco on the benefits of using novel file formats like AVRO and Apache Parquet to enable large scale data analysis (e.g., write custom queries against all of the GBIF registered data yourself) in compute environment like Apache Spark . And . . . from experience I know how much work is takes to convert billions of dwca records into these formats (~ week on a 12-node compute cluster after spending about a week getting all the data transferred and that excludes time writing conversion programs). Haven’t used BigQuery myself yet, perhaps because I am a bit weary of getting locked into some cloud environment. I’d have to dig around some more to see whether BigQuery can run on my laptop and commodity servers (incl. hosted/cloud).

waddink · March 13, 2020, 3:29am

Great idea, the benefits as expressed by @trobertson are clear. I would favor much if it could have CC-By licence instead of CC-By-NC.
re @dshorthourse “What’s really needed is upload of datasets containing the subset of data used in analyses, not merely the unfiltered, raw downloaded data. This would help eliminate spurious assignment of credit to publishers or individuals”: I think this is a different topic. For such subsets it would be nice if you could have downloads containing not only the GBIF data itself but also data linked to it, for that DwC is not usable but it may be possible with a cross-domain developments like SciData.

Topic		Replies	Views
Public Data Cloud Snapshots Alliance for Biodiversity Knowledge	5	1670	May 12, 2022
Toward Reliable Biodiversity Dataset References Data Use	8	2925	February 24, 2020
Are you ready to go BIG with #biodiversity data in the ActivityPub test	0	22	September 17, 2024
Citation of the week/month/you name it Data Use	0	1629	May 7, 2018
Invitation to share DNA metabarcoding data to test early pilot of data-publishing tool Diversifying the GBIF data model	0	3202	June 7, 2023

GBIF exports as public datasets in cloud environments

Related topics