Identifying potentially related records - How does the GBIF data-clustring feature work? - GBIF Data Blog

The data available in GBIF include many so-called “duplicate” records. This is something that users might be familiar with. You download data from GBIF, analyze them and realize that some records have the same date, scientific name, catalogue number and location but come from two different publishers or have slightly different attributes.

This is a companion discussion topic for the original entry at
1 Like

I can see that the first table isn’t showing well on the Discourse version of the blog post so here is a screenshot version:

(note that you can also read the blog post directly on the blog: Identifying potentially related records - How does the GBIF data-clustering feature work? - GBIF Data Blog).

@mgrosjean Thanks for sharing a general overview of a method to identify records that have similar information elements.

In academic tradition, I would very much like to reproduce your results. In order to do so I was wondering about the following:

  1. Which version of GBIF did you use to produce the result you shared? And, where can I get that version?
  2. Where can I find/download the resulting dataset that includes the clustering results?
  3. What combination (and versions) of software packages did you use?
  4. What hardware did you use to produce your results?
  5. How does your method scale with number of records? O(n), O(n^2), O(log n)?
  6. Do you have a small subset of data / results I can reproduce before moving onto the full GBIF index?

As you can tell, I am curious to learn more about how you are grouping similar records.


Hi @jhpoelen thanks for your questions.

You can find the clustered records with the “in cluster” search filter in the GBIF interface and API: Search (you can click on examples there). You are welcome to write to if you would like to get a table of the occurrence relationships (as those aren’t included in the GBIF downloads).

The clustering process is run directly on the GBIF index (e.g. on the GBIF infrastructure) on a regular basis, it isn’t a one time analysis. The code we use is available in our repository (mainly here and here).

Thanks for the questions @jhpoelen

The environment we run this on is basically hooked up for Hive,Spark and HBase (to store the output).
If your intentions are exploratory, the easiest input would likely be to use one of the monthly cloud drops on AWS on the Planetary computer (Azure). They don’t contain the CC-BY-NC data though at the moment (discussed here).

The easiest way to run this, might be to use the stripped down version I created for a Hackathon event recently. In that you only need an input table from Hive, and a Spark cluster with the output dropped back to Hive. During the Hackathon we used a DataBricks cluster running on Azure and it was fairly easy.

This is all fairly experimental at the moment, and the outputs are not downloadable as a file.

Best of luck. Contributions are welcome of course.

This looks interesting @mgrosjean @trobertson, thanks a lot for share.

I would like to ask for permission to reproduce the first table. It’s a good summary of the steps to detect duplicates even inside a given dataset (which is what I would like to do with ours, prior to publish, so I can then properly follow the recommendations to improve linkage between occurrences).

In case there is no problem, what would be the recommended way to citate this blog post (or any other in discourse)? (dodobot does not look like an author’s name)

BTW, it’s a shame that posts are auto-closed so soon.
Why don’t you let us comment after one month?

1 Like

Thanks, @sant - I’d suggest referring to the blog using the URL as the source.

I don’t administer the forum, but I presume it is to discourage endless discussion on long threads. Threads can be reopened on request.

1 Like

@mgrosjean @trobertson thanks for your prompt reply and sharing some more about workings of the GBIF cluster method.

So, now I know that (dynamic) snapshots are available in servers managed by Amazon (AWS) and Microsoft (Azure). And some notes on how to process these data are available in various locations.

I do have some remaining questions / comments.

Data Versioning - in “Global Biodiversity Information Facility (GBIF) Species Occurrences - Registry of Open Data on AWS” you say:

[…] While these data are constantly changing at, periodic snapshots are taken and made available on AWS. […]

With a constantly shifting snapshot, how can I reproduce the results of a specific snapshot that you, or others, have used to create some clustering result? How do you compile these snapshots? Which specific versions of non-CC-BY-NC data are included in these snapshots?

Software Versioning - you mention using Hive, Spark, and HBase. Which versions do you use? Do you have specific instructions on how to setup your system? Do you keep track of the versions as part of your clustering results? In my experience, different software versions can yield different results, or may render existing programs/scripts ineffective due to API changes.

Pay-to-play - I presume I’d have to pay to get access to the computer services needed to access the data within the constraints of their clouds. I imagine that commercial folks let you share your data for “free” so that others have to rent their stuff to access it. Please confirm. Please see screenshots of the “Launch in Hub” paywall in the Microsoft environment [2]. Also, the hack-a-thon example BiCIKL/Topic 3 Enhance the GBIF clustering algorithms at main · pensoft/BiCIKL · GitHub relies on getting an account with , a commercial cloud re-seller that runs specific software on a select list of commercial cloud vendors.

Prior Art - If you are not familiar with Thessen et al. 2018 [1] (disclaimer: I am a co-author), you might want to reference it in future publications. This paper describes a very similar architecture (e.g., methods for open big data access, Apache Spark, parquet, jupyter notebooks) that predates your clustering work. One of your collaborators has attended related meetings at the time.

In summary, the information you provided is not sufficient for me to reproduce your results, because I don’t know which snapshot/software versions were used, the cluster results are not published as a versioned dataset, and because I don’t have budget to rent the cloud environments needed to access process the data.

This makes me wonder: how much time/money would take for me (or anyone else) to create a reproducible environment in which your cluster method can be reviewed independently?

Curious to hear your and other folks thoughts,

[1] Thessen, A.E. et al., 2018. 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration. PeerJ Computer Science , 4, p.e164. Available at: 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration [PeerJ] .

[2] Sceenshots related to Planetary Computer taken on 2021-11-10

Thanks @trobertson
I didn’t realize the post url shows the authors’ names (I was reading it in Something like this, so?

Grosjean, M.; Robertson, T. 2021. Identifying potentially related records - How does the GBIF data-clustring feature work? GBIF Data Blog, posted November 4 2021.

Thanks @sant. That would be fine.

You can’t I’m afraid. Similar to the GBIF indexing it depends on the state of the index at the time it is run. You can generate clusters from a snapshot, which is the approach I’d recommend if you want to do any research - I’d encourage research more to compare the performance of the process to other techniques such as ML based approaches rather than just trying to reproduce the results of a fairly basic JOIN. I am certain there will be better approaches as ours is a very naive rules-based approach.

I’d expect software versions would be unlikely to yield different results here, as it’s really just a simple Java method to compare two records, but I guess it could happen. Hive and spark are only there to provide a SQL interface to the data, and it’s known to run on Spark 2.3, 2.4 and 3.x. The exact versions are listed in the project poms and released as any Java package.

You don’t need to pay to access any GBIF download or snapshot, even if accessing from AWS or Azure. If you want to run a computation cluster, then yes, you’d need to source that somehow (if it were me, I’d use free credits from one of the cloud providers).

This makes me wonder: how much time/money would take for me (or anyone else) to create a reproducible environment in which your cluster method can be reviewed independently?

Time-wise it really depends on experience and approach. It took a morning to introduce and get a group running this in a hackathon on DataBricks. Someone more familiar with the technology would be able to run it in say an hour or so - perhaps at a cost of around a few 10s of US$. Because the algorithm boils down to a simple compare(record1, record2) you could fairly easily wrap it up in other approaches (i.e. not Spark and Hive) to determine which candidate records you wish to compare to. This is why I say it depends on the approach taken.

Thanks for the link to the paper. On a first glance it looks to be doing far more than what we document here (i.e. linking out to other sources) but the technology stack is similar and could be reused to explore the GBIF version. We don’t anticipate writing any publications on this but I’m happy to answer questions on anyone researching it.

@trobertson Thanks for taking the time to respond to my notes.

Great! I guess I was being a little too pessimistic about the open access. Can you please add some examples on how to download the datasets?

Ok, I can see how the provided snapshots can be used as a way to play around with a big dataset. However, when doing method comparisons (e.g., outcomes, performance, scaling etc) I imagine that working with a clearly defined (and citable) dataset is key. Let’s say that eBirds decides to make their annual dump of occurrence data available in between comparison tests, then the variability of the results may very well be caused the addition (or removal) of indexed occurrence records.

Also, from your description of the cluster algorithm as a compare(record1, record2), I assume that the computational complexity of the method is at least O(n^2), meaning that the time is takes to process records grows at least with N*N where N is the number of occurrence records.

I am excited that you are actively soliciting others to improve on the clustering methods that you are currently listing on the GBIF search pages.

Hopefully, at some point in the future, someone will find the time/money to systematically review/document the method with a well-defined test set and hardware/software setup.

Also, I hope we’ll find a way to better capture the provenance of the complex data products (like the GBIF index) so that we can work with well-defined and citable data publications.

In other words, lots to think about (at least for me)!


Not at all - thank you for the interest.

On AWS Public Dataset, just click “Browse bucket” on the region closest to you. US East is here. These are immutable, monthly snapshots with DOIs. The same files are on GBIF, Azure and AWS.

On Azure, they don’t have such an easy browser, and you need to do a call to list the file parts, then download them (see example).

That is why there is the blocking stage to identify candidate records. It’s implemented in SQL and creates a series of hashes on each record. Records with similar hashes (candidate groups) are then compared to each other (NxN within the group). This worked across the 2B records in around 40 mins on the GBIF Spark cluster, which was my operational target. Data skew could creep in and blow this in the future of course.

Given your interest - would you like a crash course on Databricks on an Azure cluster (on GBIF credits) one day soon on a shared zoom session? I can get you running this and perhaps you can put a critical eye on it as I am certain it can be improved or even replaced with a different approach. I’m more in favour of exploring ways to improve it rather than simply have someone struggle to rerun it - we might be able to find some traceable benchmarks to run periodically on the monthly views too, to look at the performance of the clustering (not runtime) over time.

@trobertson thanks for sharing the examples/hints on how to access the GBIF parquet chunks in the commercial clouds of Amazon and Microsoft.

I can see how hashing of properties can help reduce the complexity of the comparisons. However, I am still unclear about the specifics on how the current algorithms scales with the snapshots you used (e.g., a figure that shows how computational complexity scales with number of (randomly sampled) occurrences in various hash / compute configurations).

You did mentioned that you ran 2B records in 40 minutes. I just checked the big GBIF occurrence counter, and saw that it is currently set to 1.9B, less than 2B . Also, I am curious to learn more about the specs of the GBIF spark cluster that completed your compute job in 40 minutes. Is this spark cluster running in your own GBIF servers, or is this running on databricks ?

Thanks for offering a crash course on databricks on GBIF credits.

After spending some years in helping to setup, maintain, and develop algorithms for a spark cluster in collaboration with the University of Florida, I can see the benefits of outsourcing the complex configuration/maintenance of compute clusters, especially if money is not as much of an issue.

Before jumping back into the operational aspects of running programs (or compute jobs) in a spark cluster, I’d like to better understand the behavior of the clustering algorithm you developed, the provenance of the data, and finding reliably ways to cite the input and the output of compute jobs. With known input/output and a rough sketch of the expected computational expense, I’d be more comfortable experimenting/tweaking the algorithm and the system on which it runs.

I’ll keep my eyes peeled for additional documentation created by you or your collaborators.

Very cool to see these data experiments come to life!

Great post on this feature, but I would like to ask for some clarification. Below I will describe my understanding from the post for step 2 of the algorithm.

In each candidate set of occurrence records obtained in step 1 occurrence records are compared pairwise for each of the criteria listed in the first table. The rows in this table are to be interpreted as follows (taking the first row as an example):

If the values of the field taxonKey coincide and the values of the field typeStatus are “Holotype” in each record of the pair then this pair of occurrence records is flagged with “same specimen”

If a pair of records gets categorized with a flag combination shown in the columns of the second table, they are allocated to a corresponding cluster.

How exactly are the clusters formed among three or more records? For inclusion in a cluster, is it sufficient that a given occurrence record shares a flag combination with one other record in that cluster or must it share this flag combination with all records in the cluster? Or need all cluster elements share all flag cominations among them? The sentence in the blog post right above the second table seems to indicate the second option, but I want to make sure I understand this correctly.

What happens if records share different flag combinations with other records? Are different clusters formed?

How are the cluster criteria represented in the cluster tab of an occurrence record’s web page?
From the examples linked in this post I get the impression that for each related record to a record that is in focus (the “current” record) each of the pairwise established flags are shown, which, in the linked example, have no overlap except for the flag same accepted species.

Thanks @cboelling

I think you’ve understood that we allocate records to a “cluster”, which isn’t the case (edited to add: re-reading the blog I can understand this due to the choice of our wording). We create a link between two records when the rules are satisfied, storing the reason why a link is created (e.g. identifiers overlap + same_location + same species + same_date). In graph terminology, the occurrence records are the nodes and here we create the edges.

We are beginning to explore if/how we could create strong cluster objects from these and also how to best visualize and explore the graph.

That is correct as explained above, although in the example you provide you can see several reasons why each of the 3 links are made (e.g. identifiers overlap for one). You can compare the records by clicking on “details”. Perhaps I misunderstand the question?

I hope this helps.