GBIF exports as public datasets in cloud environments

I am excited that GBIF is thinking of novel ways to share public datasets as a way to facilitate research.

As you mention, being able to reliably reference datasets is an important first step to keep track of the origin (or provenance), and use, of the dataset. In our paper “Towards Reliable Biodiversity Dataset References” (https://doi.org/10.32942/osf.io/mysfp, in review) and related GBIF forum discussion (Toward Reliable Biodiversity Dataset References) we outline such reliable reference method. This method not only allows for reliably referencing a dataset (i.e. dataset version), but also the origin (or provenance) of that dataset. This method is complementary to the existing DOI infrastructure.

Also, I wanted to bring to your attention that as part of our related work, we have showed to be able to reliably move, archive and cite (!) the entire GBIF corpus (exactly as provided by the publishers, the “raw” data) using commodity servers, consumer-grade internet connections (<10Mbs!) and consumer grade hardware. With this, I am able to do research on the GBIF corpus using a 1TB external hard disk (<$100 at most retailers), an open source computing platform and a $250 laptop without having to worry about keeping ~700GB of data alive on some cloud infrastructure. Needless to say, I do see the benefit of being able to swipe my credit card and have quick access to a data warehouse with managed servers and fast networks (that’s what a cloud is right?), as long as I am able to keep my research data (and their original sources) under my pillow if my funding dries up.

As a small independent, I try to minimize overhead and try to avoid having to pay “rent” for cloud services. I prefer solutions that can be easily archived in across existing platforms (internet archive, Zenodo) and also be kept offline without hurting the integrity of the data. This way, I don’t have to worry about some company cutting off access to my research data because a contract expires, a change in terms of service, or if I forget to pay my bills.

I think GBIF can play an even bigger role in the research community if:

  1. methods are adopted to reliably/cryptographically reference original raw datasets (as provided by institutions), derived datasets (e.g., GBIF downloads or GBIF mediated datasets) and their provenance (e.g., what version of original dataset led to what version of GBIF mediated dataset).
  2. share your extensive knowledge on how to implement data processing workflows that work at small and large scales
  3. open source tools are made available to allow users to reproduce process to re-create the “GBIF mediated datasets” from their original data sources
  4. method are developed to allow users to produce their own “data downloads” from the original datasets (e.g., selecting records with specific geospatial/taxonomic constraints) and make it easy for them to publish these derived dataset on a data publication platform while citing their original sources.
  5. methods are developed to publish datasets across many different data platforms without losing the ability to reliably reference and verify the data.

I much like the idea to make the biodiversity datasets more accessible for (small/large scale) computation and thank you for facilitating the discussion around it.

-jorrit

1 Like

AMAAAZZIIIIINNNNNNNNG!

Some firsts comments / ideas coming to my mind are :

1 Like

(as new users can’t put more than 3 links…)

jhpoelen, Cloud is not related to business orientation for me. So we can benefit using academic clouds. Concenrning your work on “internal” identifiers, this is a very good idea I think, I am arguing using such MD5 or checksum method to produce intrinsec uniqu identifiers since 2 years and propose it on the BiodiFAIRse GO FAIR implementation network, linking it to DOI so we can benefit from both systems. Concerning provenance information and relations between raw as derived data and linked to software, workflows as publication or ohters research objects, I am totally agree and that 's why in France we are working on an intensive use of EML as an amazing metadata language who can provide both methods/tools and semantics to capture this!

(as new users can’t post more than 3 links …)

thanks to its conda-“ification” https://anaconda.org/bioconda/r-spocc

Concerning licensing, I always prefer an open data oriented license as CC-BY than CC-BY-NC as 1/ can we really detect if there is a commercial use of data ? and 2/ are we ready to go towards legal procedure in case of detected commercial use of the data ?

1 Like

Thanks for all your suggestions and enthusiasm, Yvan.

Regarding your concluding questions, see the section On commercial use in our Terms of use—a few relevant highlights:

Interpretations vary widely about how to define commercial use. Some would limit it narrowly if straightforwardly to for-profit practices like re-sale of data in contrast with use for example in publications in commercial journals. Broader constructions would extend it, for example, to websites displaying advertisements as a means of operational cost recovery. GBIF does not expect to propose or impose a resolution to this conversation…

We believe that restrictive interpretations of non-commercial use run counter to the spirit and the letter of open access in general and GBIF in particular.…

…GBIF has neither the interest nor the resources to enforce CC BY-NC by legal means.

1 Like

[also bumped up your status—link away]

2 Likes

I’m also excited to see the new thinking and ideas here to take GBIF to a wider audience where biodiversity data can really make truly global impact, and not just within the conservation and scientific communities. I agree, that GBIF needs to remain a neutral arbiter of information rather than cozying up to one or two tech giants - but it also needs to put public data out there into the world for massive impact. After all, Rome is burning and this decade is the most critical one we’ve ever had for impact. It’s a tricky balance, I know. I’ve asked my geeky technowizard and strategist friends in the Conservation Biology Institute to chime in with their thoughts on pros and cons, too. :wink:

3 Likes

Does @earthdoc respond if you group him among your ‘geeky technowizard and strategist friends’ at CBI?

Asking for a friend…

Hi @ylebras et al. - I am glad to hear that you have also adopted content hashes as a complement to DOIs. I have yet to be convinced that EML has enough provenance information to systemically keep track of data version, but I am happy to be convinced. Do you have some examples that I can learn from? I am open to scheduling a live discussion on this next week. Interested?

Coincidentally, I had a look at the realities of computing on big-ish ~1TB datasets earlier this week (see https://github.com/bio-linker/organization/wiki/compute-environments ) and found that most academic clouds focus on providing a short-lived sandbox to experiment with cloud computing or running small experiments. With ~1TB of raw data to work with, I found that network bandwidth, network transfer and storage costs become a the main design constraints. For instance, a one-time download of a 1TB data from the big three commercial cloud providers easily costs ~ $100 and this is assuming that the client network connections are stable and fast enough to even be able to transfer that data reliably. This is why I have adopted decentralized storage and discovery techniques that even allow for the option to send physical hard disks by snail mail or by using tools like rsync to enable efficient incremental updates. As I mentioned earlier, using these approaches/tools are a necessity for me because I don’t have access to fast university networks or massive IT budgets.

I am sure that my brief survey of academic clouds is incomplete, so I am open to learning more about the academic services that are now available.

Storage, network and compute infrastructure issues aside: my remaining questions to the GBIF technical team are:

  1. How are you going to keep track of the versions of source archives as provided by institutions and their usage in the associated GBIF derived datasets?
  2. How are you planning to reliably link datasets to their associated DOIs? Or, in other words, how can users verify that they have an exact copy of a dataset associated with some DOI? Or, how can I lookup a DOI associated to a dataset that I have sitting on my hard disk?
2 Likes

THANK YOU for your detailled answer and very interested coments / questions!

With EML you can build relatively detailled provenance informations at relative low cost https://nceas.github.io/datateam-training/training/building-provenance.html notably because 1/ with EML you can give info on datasets (raw and derived dataset) + softwares used to generate datasets and/or to come from dataset to another.
BUT DataOne is using uuid and/or DOI to identify each research object so this is not intrinsec and here is a potential provenance issue regarding reliability mentionned in you paper. I hope I can test ASAP generation of intrinsec ID through hash approach so we can give to DataOne research objects unique id maybe coming from a combination of intrinsec ID + uuid/DOI so we can be sure the object is the one referenced.
BUT :wink: I am quite new in this complex but so beautifull DataOne environment and as there is MD5 information on each DataOne research object (you can look at this document https://openstack-192-168-100-101.genouest.org/metacatui/view/urn:uuid:ed9db304-fcb9-4c9e-aea2-1237bf58c855 , our test server, at “Authentication”) so material seems to be there, and maybe DataOne is using this information to track provenance…

Interested to take part to such live discussion, … but really under the wave…

True that handling big-ish ~1TB dataset is not straight forward… and that’s why I think of interest to differentiate ““reference data”” from ““no-reference data”” as it appears to me 1/ that this is really rare that we have to handle such big-ish dataset and 2/ such big-ish dataset are maybe more related to reference data coming from aggregated data (from database/databank for example) from where users want to extract a subpart (species/time/geo). SO here again, I really think the ““Galaxy”” approach can be specially a good one for biodiversity world, TBs of referenced data (https://galaxyproject.org/admin/reference-data-repo/ notably reference genomes) are available through cache and using HTTP / rsync / CVMFS. I think this is a good place to test GBI reference data sharing to US, EU or Australia community.
BUT this is cache, so not solving all issues notably regarding very bad networks… And here, these reference data are accessible to anyone for free :wink:

Thanks @jhpoelen

For instance, a one-time download of a 1TB data from the big three commercial cloud providers easily costs ~ $100

This may be true, but I would note:

  1. GBIF.org provides free access which wouldn’t disappear
  2. The proposal would enable a user to easily pre-shape the data they need (filtering, aggregating) so that they pull down only a summary view to their own schema and control.
  3. “Academic” clouds could be explored if available

How are you going to keep track of the versions of source archives as provided by institutions and their usage in the associated GBIF derived datasets?
How are you planning to reliably link datasets to their associated DOIs? Or, in other words, how can users verify that they have an exact copy of a dataset associated with some DOI? Or, how can I lookup a DOI associated to a dataset that I have sitting on my hard disk?

These are good questions, but I’d consider them tangential to the discussion of enabling cloud users. I say this since they could be asked of GBIF.org today and arguably aren’t a concern for many purposes.

For the foreseeable future, I see no way other than to consider these monthly views as point-in-time snapshots.

I also foresee tracking use as using the existing DOI mechanism, noting that the DOI refers to the concept of a dataset, not a versioned export of it.

Recognizing that we’re dealing with a myriad of data sources (versioned, living, append-only etc) and that protocols in use don’t all enable strong versioning we’ve always taken the approach of providing the “raw” record along with the derived view so that the original state can be viewed.

Ensuring the integrity of dataset copies would need to use some kind of checksumming as you note. Your explorations so far have used the source datasets (i.e. from institutional URLs) which are links to mutable objects by design (the latest version). Since ~2018 we store all versions of the datasets in our crawling infrastructure which represent point-in-time snapshots of each source as it is ingested and are indeed immutable. It’s never been asked of us, but we could expose those as individual datasets if you had an interest.

Thanks @trobertson for your detailed reply.

First, I’d like to emphasize that I much appreciate that GBIF is looking into ways to distribute their aggregate data products beyond the GBIF “cloud” (e.g., servers that run GBIF processes and store/serve data).

I notice that roughly four topics came up in this forum discussion:

  1. getting data closer to where folks do their analysis
  2. ensuring that data publishers are attributed
  3. ensuring that provenance of data is clear
  4. ensuring that the integrity of the data can be verified

With this in mind, I’d to respond to some of the your comments.

Yes, I am able to download a zipfile by clicking on the “download” button of the occurrence download page DOI https://doi.org/10.15468/dl.otf01c Screenshot from 2020-03-06 14-41-29 . However, I have no way of telling that the zip file I receive today is the same zip file that my future self will download in 5, 10, 20 years from now. In addition, I’d say that DOI and the zip file will be around as long as GBIF has the will, capability and funds to do so.

Realizing that the internet is a wild and dynamic place and funding comes and goes, I would want to figure out a way to keep many copies around in different places without losing the ability to cite, find, retrieve and verify the integrity of, the datasets.

A first step towards more reliably referencing data would be to provide checksums (or content hashes) associated to the provided data products and include these in the data citations along with the DOIs. Providing this information is standard practice for distribution of digital content (e.g., Zenodo provides md5 hashes for each file they host).

I’d say that when distributing content across different platforms or “clouds”, including the checksum/content hash would help to ensure that no data gets corrupted in transmission.

“Raw” records are neatly packed in zip files or tar balls and made available through institutional URLs. As you noted, you can version these files using checksums or content hashes. With this, you can version any “raw” record and further enhance the provenance of the GBIF annotated records by including a reliable reference (= content hash) to the version of the provided source archive from which the “raw” material came.

As our study has also kept track of all the dataset versions registered in the GBIF network since 2018, I’d be very interested to compare our collection of immutable/versioned source archives with those that you have.

Thank you for taking the time to respond and for considering my feedback.

I wholeheartedly support this proposal. The reasoning is sound, the targets are apt, the benefits are many, including an important one not stated - that it will save human resources for GBIF. I wrestle with my conscience when I ask something special of GBIF. I know how hard they work, and they have always come through anyway, but if I could get snapshots in BigQuery monthly, it would take a (down and up) load off my mind (kidding here, because GBIF have always created AVRO files for me that I can load directly in BigQuery, but you get my meaning).
I think having the data closer to analytic platforms will foster much more holistic use and new tools. It could enable sandboxes for the community to try to develop and showcase novel ideas, and GBIF in turn could benefit by incorporating those that are appropriate.

1 Like

Thank you all

@jhpoelen - I am not ignoring you, but reflecting on your comments; I sense assumptions are being made on living datasets that won’t hold true and I’d like to understand more. A few things spring to mind and I suggest we take a call and then start a dedicated discussion focusing on some of the replication aspects you raise. Would that be OK?

I wholeheartedly support this proposal … [but] … I wrestle with my conscience when I ask something special of GBIF

Thanks for confirming @tuco - enabling the communities you work with was a motivation behind this thinking to enable you to do more, easier and cheaper (vocabulary development, understanding standards use, project impact assessments etc) but hearing it would ease your conscience is an added bonus.

1 Like

@trobertson thanks for your considerations. Am happy to take a call to sync up and then report out if needed. Please feel free to contact me off-list.

Also, I’d like to echo @tuco on the benefits of using novel file formats like AVRO and Apache Parquet to enable large scale data analysis (e.g., write custom queries against all of the GBIF registered data yourself) in compute environment like Apache Spark . And . . . from experience I know how much work is takes to convert billions of dwca records into these formats (~ week on a 12-node compute cluster after spending about a week getting all the data transferred and that excludes time writing conversion programs). Haven’t used BigQuery myself yet, perhaps because I am a bit weary of getting locked into some cloud environment. I’d have to dig around some more to see whether BigQuery can run on my laptop and commodity servers (incl. hosted/cloud).

Great idea, the benefits as expressed by @trobertson are clear. I would favor much if it could have CC-By licence instead of CC-By-NC.
re @dshorthourse “What’s really needed is upload of datasets containing the subset of data used in analyses, not merely the unfiltered, raw downloaded data. This would help eliminate spurious assignment of credit to publishers or individuals”: I think this is a different topic. For such subsets it would be nice if you could have downloads containing not only the GBIF data itself but also data linked to it, for that DwC is not usable but it may be possible with a cross-domain developments like SciData.

With regards to data packaging requirements I defined the following requirements for DiSSCo:

  • Easy to use by end users
  • Flexible (extensible, scalable and customisable)
  • Machine readable metadata that is human-editable
  • Use of existing standard formats
  • Language, technology and infrastructure agnostic

besides the already mentioned formats it might also be interesting to look into Linked Data Fragments (linkeddatafragments.org) and of course Data Packages (frictionlessdata.io). linkeddatafragments is a potentially interesting format although it is very much linked data oriented (triples based) and it is not currently widely supported, nor has a large developers group to support it. But it is small because compressed, binary format yet you can use sparql to query it without the need to unpack it first.

1 Like

I also wanted to add my support for this idea. I am experimenting with using GBIF occurrence data on a Databricks Spark Cluster on Microsoft Azure (running RStudio on the master node). My use case is perhaps slightly unusual in the sense that I am trying to link scientific and patent data with taxonomic data for country, regional and global analysis. I mention this simply to highlight that social scientists are interested in this data.

Just to add a few things it might be helpful to think about:

  1. The bulk of occurrence data is made up of Aves (in my case I don’t normally want to see that). It may be helpful to think about ways to divide up the sets - such as by kingdom - to suit different users needs. However, I recognise that this could introduce quite a lot of complexity depending on different use cases.

  2. GBIF data is available in the simple and the Darwin Core format. Different users may have different needs here. For me the simple data is normally enough.

  3. File size issues. In the case of Open Academic Graph (for example) the data is made available in 1.8Gb chunks in folders which are easy to handle. The US patents office originally made its full texts available as a single massive file but has now broken it down to similar 1Gb+ sized chunks in response to users struggling. Microsoft Academic Graph makes its data available as a set of individual tab separated tables which are easy to import (but big). I’ve also noticed that NCBI has started to move the Sequence Read Archive on to AWS and GC… it may be worth taking a look at how they are doing that as records can now be downloaded from multiple places.

  4. My experience using the Databricks Spark cluster has been that Spark converts the csv/tab sep files to parquet for processing that then need to be converted back to csv with results. There really is a lot to like about parquet and in R (with say sparklyr) it is easy to convert results back to csv @jhpoelen . I’m not suggesting providing as parquet but one issue that will come up is dealing with parsing issues on import that can be difficult to solve in spark. I’m not sure how realistic this is but maybe some kind of test suite to check for common parsing issues would be helpful (maybe that already happens)?

  5. I use taxize and rgbif in my workflows and agree with @sckott that it would be important to think about the programmatic interface issues to avoid a lot of potential overhead in moving back and forth. @sckott I have used Mark Edmonson’s google cloud packages with success… so maybe there is some kind of route with existing pkgs for auth and data access? I also agree on citation issue. That is, how do we cite the novel outcome set from processing on a cluster (for example).

  6. On cost. My experience with Azure (and Google Cloud) is that storage is the main cost rather than processing (a Data Bricks cluster is about US$3 per hour and payment stops when the cluster is terminated). A very big gotcha on cost is data transfer between regions. I have learned that you absolutely do not want to be passing data stored in one region to processing in another region. So, it could well be important to have say regional mirror copies as close as possible to where the users are. Also some kind of guide that warns people about these issues when getting set up would be helpful (and save them a lot of money).

I think this is a really great idea, hence all the comments, and happy to assist with testing say on Azure and Databricks if needed.

2 Likes

Thank you @poldham - very good ideas in there.

Internally we use Hive on Spark / MapReduce on the Hadoop cluster and generally use Avro file formats for longer term storage due to the schema bindings, and Parquet/ORCFiles for query views. We’ve recently added SIMPLE_AVRO and SIMPLE_WITH_VERBATIM_AVRO formats to the download and are testing their compatibility with tools like BigQuery and finding a few oddities. More details on that soon, but you can try them now (formats still in flux). I suspect SIMPLE_AVRO will meet most of your needs from your description as it is a native format in Spark.

Another cloud provided worth exploring is the European Grid Initiative - EGI.eu. EGI federates a large set of public funded national or regional cloud services, grid computing services, and HPC and HTC services. EGI has had an important role in the development of EOSC. Some national GBIF nodes, like Spain and Portugal, host their national data portals in clouds provided by EGI members. In the European context, it might be one of the most cost-efficient ways to make computing resources available for researchers.

1 Like