With regards to data packaging requirements I defined the following requirements for DiSSCo:
Easy to use by end users
Flexible (extensible, scalable and customisable)
Machine readable metadata that is human-editable
Use of existing standard formats
Language, technology and infrastructure agnostic
besides the already mentioned formats it might also be interesting to look into Linked Data Fragments (linkeddatafragments.org) and of course Data Packages (frictionlessdata.io). linkeddatafragments is a potentially interesting format although it is very much linked data oriented (triples based) and it is not currently widely supported, nor has a large developers group to support it. But it is small because compressed, binary format yet you can use sparql to query it without the need to unpack it first.
I also wanted to add my support for this idea. I am experimenting with using GBIF occurrence data on a Databricks Spark Cluster on Microsoft Azure (running RStudio on the master node). My use case is perhaps slightly unusual in the sense that I am trying to link scientific and patent data with taxonomic data for country, regional and global analysis. I mention this simply to highlight that social scientists are interested in this data.
Just to add a few things it might be helpful to think about:
The bulk of occurrence data is made up of Aves (in my case I donāt normally want to see that). It may be helpful to think about ways to divide up the sets - such as by kingdom - to suit different users needs. However, I recognise that this could introduce quite a lot of complexity depending on different use cases.
GBIF data is available in the simple and the Darwin Core format. Different users may have different needs here. For me the simple data is normally enough.
File size issues. In the case of Open Academic Graph (for example) the data is made available in 1.8Gb chunks in folders which are easy to handle. The US patents office originally made its full texts available as a single massive file but has now broken it down to similar 1Gb+ sized chunks in response to users struggling. Microsoft Academic Graph makes its data available as a set of individual tab separated tables which are easy to import (but big). Iāve also noticed that NCBI has started to move the Sequence Read Archive on to AWS and GC⦠it may be worth taking a look at how they are doing that as records can now be downloaded from multiple places.
My experience using the Databricks Spark cluster has been that Spark converts the csv/tab sep files to parquet for processing that then need to be converted back to csv with results. There really is a lot to like about parquet and in R (with say sparklyr) it is easy to convert results back to csv @jhpoelen . Iām not suggesting providing as parquet but one issue that will come up is dealing with parsing issues on import that can be difficult to solve in spark. Iām not sure how realistic this is but maybe some kind of test suite to check for common parsing issues would be helpful (maybe that already happens)?
I use taxize and rgbif in my workflows and agree with @sckott that it would be important to think about the programmatic interface issues to avoid a lot of potential overhead in moving back and forth. @sckott I have used Mark Edmonsonās google cloud packages with success⦠so maybe there is some kind of route with existing pkgs for auth and data access? I also agree on citation issue. That is, how do we cite the novel outcome set from processing on a cluster (for example).
On cost. My experience with Azure (and Google Cloud) is that storage is the main cost rather than processing (a Data Bricks cluster is about US$3 per hour and payment stops when the cluster is terminated). A very big gotcha on cost is data transfer between regions. I have learned that you absolutely do not want to be passing data stored in one region to processing in another region. So, it could well be important to have say regional mirror copies as close as possible to where the users are. Also some kind of guide that warns people about these issues when getting set up would be helpful (and save them a lot of money).
I think this is a really great idea, hence all the comments, and happy to assist with testing say on Azure and Databricks if needed.
Internally we use Hive on Spark / MapReduce on the Hadoop cluster and generally use Avro file formats for longer term storage due to the schema bindings, and Parquet/ORCFiles for query views. Weāve recently added SIMPLE_AVRO and SIMPLE_WITH_VERBATIM_AVRO formats to the download and are testing their compatibility with tools like BigQuery and finding a few oddities. More details on that soon, but you can try them now (formats still in flux). I suspect SIMPLE_AVRO will meet most of your needs from your description as it is a native format in Spark.
Another cloud provided worth exploring is the European Grid Initiative - EGI.eu. EGI federates a large set of public funded national or regional cloud services, grid computing services, and HPC and HTC services. EGI has had an important role in the development of EOSC. Some national GBIF nodes, like Spain and Portugal, host their national data portals in clouds provided by EGI members. In the European context, it might be one of the most cost-efficient ways to make computing resources available for researchers.
In early April GBIF was kind enough to put together a snapshot of the Occurrence data in the SIMPLE_WITH_VERBATIM_AVRO format @trobertson mentioned. We had one glitch with the original single file when trying to load into BigQuery from Google Cloud Storage, but that was resolved. After that it was trivial to load the entirety of the Occurrence data, including verbatim inputs, GBIF interpretations, and issues. Iāve been using this every day since for multiple investigations, including a) options for thesauri against raw input values of terms that are recommended in Darwin Core to use controlled vocabularies, and b) construction of a āLocation Catalogueā of distinct combinations of the Darwin Core Location class terms and GBIF interpretations against which to try to find georeferences for records without georeferences from those that already have them. All of the latter is linked with the initiative to establish āBiodiversity Enhanced Location Servicesā as presented in the Darwin Core Hour " Imagining a Global Gazetteer of Georeferences" and subsequent BBQs (https://www.idigbio.org/content/darwin-core-hour-2-bbqs-imagining-global-gazetteer-georeferences and https://github.com/tdwg/dwc-qa/wiki/Webinars#chapter17).
Big support from my side too on taking the data where the analysis will be made. It makes sceince tremendously more affordable and convinient, ansd therefore I am sure will increase a lot usage (and thats what ultimately we all want right?).
We, at CARTO, have been working bringing data into Public Datasets at Google, but AWS and Azure all have subsidized programs to do so. The cloud will subsidize the cost of storage by the way.
We would definetly be happy collaborating on this.
And as an example here you have a simple map of all occurrences coming out of the AVRO file @trobertson mentioned, loaded into BigQuery. Load time into Bigquery less than 10min, making the map out of 1.4B rows, less than 6min. It can not get much more convinient and cost effective than this.
This is an old thread, but for anyone stumbling upon it⦠In addition to the AWS and Azure drops, GBIF data is now put on Google BigQuery as a public table each month.