@mgrosjean @trobertson thanks for your prompt reply and sharing some more about workings of the GBIF cluster method.
So, now I know that (dynamic) snapshots are available in servers managed by Amazon (AWS) and Microsoft (Azure). And some notes on how to process these data are available in various locations.
I do have some remaining questions / comments.
Data Versioning - in “Global Biodiversity Information Facility (GBIF) Species Occurrences - Registry of Open Data on AWS” you say:
[…] While these data are constantly changing at GBIF.org, periodic snapshots are taken and made available on AWS. […]
With a constantly shifting snapshot, how can I reproduce the results of a specific snapshot that you, or others, have used to create some clustering result? How do you compile these snapshots? Which specific versions of non-CC-BY-NC data are included in these snapshots?
Software Versioning - you mention using Hive, Spark, and HBase. Which versions do you use? Do you have specific instructions on how to setup your system? Do you keep track of the versions as part of your clustering results? In my experience, different software versions can yield different results, or may render existing programs/scripts ineffective due to API changes.
Pay-to-play - I presume I’d have to pay to get access to the computer services needed to access the data within the constraints of their clouds. I imagine that commercial folks let you share your data for “free” so that others have to rent their stuff to access it. Please confirm. Please see screenshots of the “Launch in Hub” paywall in the Microsoft environment [2]. Also, the hack-a-thon example https://github.com/pensoft/BiCIKL/tree/main/Topic%203%20Enhance%20the%20GBIF%20clustering%20algorithms relies on getting an account with https://databricks.com , a commercial cloud re-seller that runs specific software on a select list of commercial cloud vendors.
Prior Art - If you are not familiar with Thessen et al. 2018 [1] (disclaimer: I am a co-author), you might want to reference it in future publications. This paper describes a very similar architecture (e.g., methods for open big data access, Apache Spark, parquet, jupyter notebooks) that predates your clustering work. One of your collaborators has attended related meetings at the time.
In summary, the information you provided is not sufficient for me to reproduce your results, because I don’t know which snapshot/software versions were used, the cluster results are not published as a versioned dataset, and because I don’t have budget to rent the cloud environments needed to access process the data.
This makes me wonder: how much time/money would take for me (or anyone else) to create a reproducible environment in which your cluster method can be reviewed independently?
Curious to hear your and other folks thoughts,
-jorrit
[1] Thessen, A.E. et al., 2018. 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration. PeerJ Computer Science , 4, p.e164. Available at: 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration [PeerJ] .
[2] Sceenshots related to Planetary Computer taken on 2021-11-10