Identifying potentially related records - How does the GBIF data-clustring feature work? - GBIF Data Blog

trobertson · November 10, 2021, 10:17am

Thanks for the questions @jhpoelen

The environment we run this on is basically hooked up for Hive,Spark and HBase (to store the output).
If your intentions are exploratory, the easiest input would likely be to use one of the monthly cloud drops on AWS on the Planetary computer (Azure). They don’t contain the CC-BY-NC data though at the moment (discussed here).

The easiest way to run this, might be to use the stripped down version I created for a Hackathon event recently. In that you only need an input table from Hive, and a Spark cluster with the output dropped back to Hive. During the Hackathon we used a DataBricks cluster running on Azure and it was fairly easy.

This is all fairly experimental at the moment, and the outputs are not downloadable as a file.

Best of luck. Contributions are welcome of course.

Topic		Replies	Views
How GBIF identifies related occurrence records (GBIF technical support hour for Nodes) Data Publishing NodesSupportHour	2	838	December 14, 2023
Exploring Related Records in the Flowering Plant Genus Senegalia in Brazil - GBIF Data Blog data-blog	1	3328	February 9, 2023
Informatics/data products developments and plans (presentation by Tim Robertson & Andrea Hahn) GB27	18	1854	November 15, 2020
GBIF attempts to improve identifier stability by monitoring changes of occurrenceIDs - GBIF Data Blog data-blog	6	5103	November 9, 2023
The same Occurrence in different Materials of Citations (Books)	2	401	October 20, 2022

Identifying potentially related records - How does the GBIF data-clustring feature work? - GBIF Data Blog

Related topics