How GBIF identifies related occurrence records (GBIF technical support hour for Nodes)

In this November session, we will explain how we identify related occurrence records on GBIF (which is sometimes referred to as the GBIF “data clustering” feature). For some background information, you can read our blogpost here.
Tim Robertson from the Informatics team will join the session. We will be happy to answer any question relating or not to the topic.

The next session is on November the 1st at 4pm CET (UTC+1). Note that it will be wintertime for Copenhagen, we are no longer in CEST.

The invitation with registration link will be sent to the GBIF Nodes. If you are interested in attending, you can reach out to your local node.
The edited recording and the transcript of the questions will be made available here.

1 Like

The video is available here: How GBIF identifies related occurrence records on Vimeo

Here is the transcript of the questions during the session.

If two records are related to each other (for example the same occurrence from an old static dataset and a new version published) but have different associated scientific names, would the algorithm detect the pair?

There is a combination of assertions that doesn’t require the pair to have the same accepted names. If two specimens have overlapping catalogue numbers, they will be linked regardless of the species name. In other words, if the catalogue number of one specimen is the same as the value in the catalogueNumber or otherCatalogueNumber fields for the other specimen for the same institution and collection codes. Note that the format of the calatogue number matters here, see the details in the code here as well as the documentation on the blogpost.
All the other combinations of assertions require the occurrences to have at least synonym names.You are welcome to help us improve the algorithm. Don’t hesitate to share your ideas on this thread or by writing to helpdesk@gbif.org. The current rules that we have can be changed and we welcome the feedback. Thanks!

Where can we check the cluster view and interface demonstrated during the presentation?

This visualisation of the links between specimens are available for GBIF Hosted Portals. This is an experimental feature that can be enabled for any portal. In fact, this view is currently available for the GRSciColl website. See an example on a collection specimen page and an example in the general specimen search interface.
We welcome feedback on this feature.

We have two datasets with overlapping occurrences but not the same catalogue numbers (as their are hosted by two different systems). Could we use this clustering feature to reconcile them and find pairs?

We only run the algorithm on data published on GBIF. There is no service where you can upload a dataset and find out if there are related occurrences published on GBIF.
That being said, we could repurpose the code base. We could reuse the function that compares records and assess relationships. Please contact us if you are interested.

Is anyone using the clustering feature to find data that can enrich your own records? I have heard people use it to find citation or DNA sequences that have been published on GBIF for a given specimen and bring those back into their collection management system.
None of the session participants are doing this at the moment.

Having an occurrence with a materialCitation basis of records and another occurrence relating to the same specimen with a preservedSpecimen basis of record would technically be duplicate records.

The reality is that we have those duplicates (especially with the publication of the INSDC datasets which contain sequences relating to specimens) and our current model cannot really accommodate for that. This is one of the reasons why we worked on this clustering feature and the richer data models. Right now an occurrence only tells one part of the story. We have one occurrence for a specimen, one for a sequence, one for a citation, etc. Ideally, the model should be able accommodate one occurrence for the observation/collection of an organism and a series of activities resulting in records in different databases.
Today though, we should enrich the information at source in collection management system with citations and sequences.

We have been working on transect surveys of butterflies. In that case, the same specimen can be observed multiple times. How to deal with that?

In the cases where you know that the same organism was observed several times. You can have one occurrence per observation and use the same dwc:OrganismID. This is what is done for bird tracking datasets for example (see an example here).
If there is no way to know if the same organism is being observed, you have to share the recordings as separate occurrences.
Note that the Humboldt Extension task group discusses how to model sampling efforts. Something to keep an eye on.

I was wondering why the two following occurrences aren’t shown as related to each other: Occurrence Detail 2557230648 and Occurrence Detail 2557436489. Is it because they are published in the same dataset?

Yes, it is because they are published in the same dataset. Our system doesn’t compare occurrences within the same datasets.
The reason why we compare only occurrences across datasets is that there are a few datasets in GBIF which contain host/parasite data. For example, some datasets will contain all the gut microbiome for a given individual. All of those gut occurrences will cluster together and we have a number of comparison that isn’t easily manageable. We have an open GitHub issue on exploring how to detect this type of situation and possibly enable comparison within datasets.

If we know there are duplicate records in GBIF, how can we make sure they are easily found by the algorithm and by the users? How can we improve the use experience?

Unfortunately, we don’t have a way to download related occurrences together. As a publisher, if you know that two occurrences are related, you can put that information in the dwc:associatedOccurrences field. Note that this field isn’t included in the SIMPLE download format (only the DWCA download format).
There is also the Darwin Core Resource Relationship extension where you can specify the relationship between records (note that this extension isn’t available in the GBIF downloads.)We have open this GitHub issue on the topic: GBIF Clustering feature - make outcome more visible to users · Issue #5038 · gbif/portal-feedback · GitHub. Don’t hesitate to share your feedback and ideas there. Thanks!

Sometimes, it would be really convenient to have information merged from related records. Would that be a possibility?

We wouldn’t want to automate any merging of record information. We could explore making the relationships accessible in way that would enable users to make this type of merging themselves if they wanted to.

How does the system handle multilingual records when performing data clustering?

We have avoided using fields with multilingual values. The only fields used are coordinates, dates, identifiers and latin names. There is no full text analysis. Handling fields with multilingual values would be a challenge.

I was working with a new data provider, we tested publishing on the IPT but we couldn’t deleted the dataset afterwards. The dataset wasn’t registered on GBIF. Is this a known issue or should I log a GitHub issue?

Please log a GitHub issue on the IPT GitHub repository. Note that the issue was logged here.

Where is the list of all the Technical Support Hour for Nodes subjects so far?

You should have them listed in you email invitation. The support hours have so far covered the following topics:

  • During the January support hour, we introduced the Registry and how a dataset could be debugged by checking the different interpretation steps in the crawl and ingestion history tabs. You can questions and answers on the discourse thread for the topic or see the recorded practical session here.
  • In the February support hour, the helpdesk team gave a brief overview of the technical components of GBIF, and how they all contribute to the content on GBIF.org. The questions and answers from the support hour can be found in this discourse thread and you can view the recording of the practical session here.
  • In the March support hour, the Data Products team demonstrated how to install, update and backup the IPT, as well as updating the extensions. Questions and answers for the support hour was captured in this discourse forum post and you can review the video from the practical session here.
  • In the April support hour, no theme was planned but we got a lot of questions are available on our discourse forum post.
  • In the May support hour, the Data Products team showed how to find and generate statistics and graphics for reporting on publishing and usage activities by Nodes, publishers, and projects. The videos is available here and questions and answers are available the discourse forum post.
  • The theme for the June support hour was The Global Registry of Scientific Collections. You can find the video here and the Q&A on discourse.
  • In the September support hour, Daniel from the Communication team explained how GBIF tracks literature. The video is available here and the transcribed Q&A is on discourse.
  • During the October support hour, the Data Product team explained what occurrenceID stability means and how we try to improve it. The video is available here and the Q&A on discourse.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.