Data Use Club practical session: GBIF species occurrence cubes

larussell · May 22, 2025, 1:36pm

In case you missed the last Data Use Club session on GBIF species occurrence cubes, you can visit the session page to watch the full recording of the session and check out the PDFs of the presentations.

The recording features Andrew Rodrigues introducing the GBIF Cube Download Service and Lina Estupinan Suarez presenting From Species Occurrence to Essential Biodiversity Variables Cubes: results from the B-Cubed project.

Below, we’ve summarized some of the questions from the session and provided the pertinent links shared. We hope you use this thread to continue the discussion.

Thanks for a great session!

Links

https://www.gbif.org/occurrence-cubes

Spatial reference grids

Question

With regards to spatial aggregation on hexagon grids: You chose ISEA3H grid cells as spatial unit. Did you consider H3 (uber)? Has ISEA3H specific advantages? Does ISEA3H have globally unique and stable identifiers?

Answer

Regarding hexagonal grids: we implemented ISEA3H as it’s an Open Geospatial Consortium (OGC) standard and was recommended by a paper on choosing grids for global analysis.

A second paper describes the identifier scheme used. “Globally unique identifiers” can mean unique across all identifiers ever, like 50c9509d-22c7-4a22-a47d-8c48425ef4a7. The ISEA3H identifiers are just numbers, so they aren’t globally unique in this computing/software sense. In the other sense, of each hexagon/pentagon on the globe having a unique id, they are unique.

See GBIF documentation on ISEA3H Grid-cell code GBIF_ISEA3HCode.

—-

Question

Is there planned future functionality to allow for custom grids for the occurrence aggregation?

Answer

We have an example of a user-defined grid and in the future aim to provide more documentation on how to do this.

Data quality

Question

On occurrence coordinates that don’t fall within a grid. For example, if the coordinate uncertainty covers multiple grids. Wouldn’t it be better to remove those coordinates? Won’t random placement affect the modelling? Has anyone studied the effects of random placement within the coordinate uncertainty on the outcomes of any modelling?

Answer

Here is a deeper dive into the random placement:

Unveiling ecological dynamics through simulation and visualization of biodiversity data cubes. No.vcyr7_v1. Center for Open Science, 2024. Langeraert, Ward, Wissam Barhdadi, Dimitri Brosens, Rocìo Cortès, Peter Desmet, Michele Di Musciano, Chandra Earl et al.

—-

Question

How would a multi-year date range be handled in a cube? Is it assigned to the earliest or most recent?

Answer

If you were to have a range from 2022-2024 for example, you would have aggregated counts for each year i.e. the number of occurrences for each year 2022, 2023 and 2024.

—-

Question

If an observation was made some time between 2022 and 2024, I suppose it would be randomly assigned to a temporal grid cell (year, or year+month…) analogous to how it’s done spatially? For example, an occurrence with eventDate 2005/2007.

Answer

The default query to generate the cubes ensures the included occurrences have sufficient date resolution to fit in the required dimension, so a year cube needs occurrences with a specific year (year != null). A year-month cube needs the month not to be null, so an occurrence from 2021-01-01/30 (30 days in January) is included, although 2021-01-31/2021-02-01 (2 days) is excluded.

Users could adjust this if they want to, e.g. to assign wider ranges to a random day within that range. We haven’t provided an SQL function to make this easier.

—-

Question

Are data cleaning algorithms applied in the process of creating species occurrence cubes?

Answer

Yes, at least for the cubes generated through the UI forms. All fields, verbatim and interpreted, are present in the SQL API, so the user could choose to ignore GBIF’s data cleaning, although that is likely to cause problems when aggregating – “3 March 1999” needs to be turned into 1999-03-03 so it groups with other occurrences recorded on that day.

—-

Question

Considering occurrence counts, if applicable, how reliable is the “is_in_cluster” parameter?

Answer

The process is described here: Occurrence clustering :: Technical Documentation

The way it’s run, clusters can take a few days to be added to records. I recommend testing how many clustered records are included in your search, and doing some quick checks to see how relevant the clusters are.

Full list of available columns (you will need to click to expand) — the ones starting “v_” are exactly as GBIF received the data from the publisher, the others may have had data quality checks applied, e.g. reformatting dates.

—-

Question

It must be a challenge taking sampling effort into account when it comes to tallying alien species counts per country, monitoring effort is quite different from state to state?

Answer

As with all use of GBIF-mediated data, users need to carefully consider sampling effort bias in the datasets and employ appropriate methodologies to deal with those biases. Invasion trends: An interpretable measure of change is needed to support policy targets from McGeogh et al. provides useful guidance in the context of invasive alien species.

Downloads

Question

Are there new data usage rules on how to use and cite these cubes in publications?

Answer

All downloads are issued a DOI that allows for attribution back to the original data providers as per usual. The change comes for users where the underlying occurrence records will not be provided.

—-

Question

Why is there no attribution in the download itself (in the zip file, like for Darwin Core Archives)? If the download is deleted later on the GBIF server per GBIF’s deletion policy, this information could get lost.

Answer

Here is an example SQL download with data providers. We do have an issue about making the downloads frictionless data or similar, which probably has additional recommendations for this type of metadata, but it hasn’t been considered specifically. And to add on the download issue from earlier, we never delete the download metadata, it will always be visible on the download page and through the API. When downloads are deleted, it is only the zip file that is deleted.

—-

Question

Why not GeoParquet instead of netCDF?

Answer

The hierarchical structure of the netCDF format is better suited to the custom structure we developed for the EBVs. GeoParquet is likely to happen, but unknown when.

Absence data

Question

Is absence data also available through cubes?

Answer

Yes, you can either filter for it (occurrenceStatus = ‘PRESENT’ or ‘ABSENT’) or add it as a dimension to the cube.

Essential Biodiversity Variables (EBV)

Question

What is the relationship between the EBV portal and that of GBIF?

Answer

GBIF is providing raw occurrences. The EBV Data Portal provides results of models or indicators. Not all the datasets in the EBV Data Portal come from GBIF data. The EBV Data Portal includes a variety of EBV raster datasets. You can also upload your own EBV dataset for sharing with others.

—

Question

Are there any links for forest related data?

Answer

GEDI Lidar

The Global Ecosystem Dynamics Investigation (GEDI) instrument is a full-waveform lidar installed on the International Space Station that produces detailed observations of the 3D structure of Earth’s surface. GEDI’s three lasers precisely measure forest canopy height, canopy vertical structure, and surface elevation. By accurately measuring forests in 3D, GEDI data play an important role in understanding the amounts of biomass and carbon forests store and how much they lose when disturbed.

Forest loss from 2000 to 2020

This EBV dataset is based entirely on the time series analysis developed by Prof. Matthew Hansen and colleagues (2013) in version 1.8, which examines the global Landsat archive at a special resolution of 30 meters to characterize global forest extent and change from 2000 through 2020. In this EBV dataset we focus on “Forest Cover loss” defined as a stand-replacement disturbance, or a change from a forest to non-forest state. The original data from Hansen et al., (2013), was processed using a factor of 15 to aggregate to a new spatial resolution of 450 x 450 meters and are structured in the multidimensional and standard format proposed by the GEO BON community for the Essential Biodiversity Variables.

Global forest cover 2000

Mean percentage of global tree canopy cover in the year 2000, defined as canopy closure for all vegetation taller than 5m in height. The original dataset from Hansen et al. (2013) has been aggregated to a cell size of 900m.

—

Question

Are species distribution models EBV?

Answer

Building essential biodiversity variables (EBVs) of species distribution and abundance at a global scale. Kissling et al. 2018

—-

Question

Would it be possible to add information on the genomics data used to estimate the Genetic composition EBV?

Answer

If the genetic information has a spatial component, it can be saved as an EBV Cube. If it does not have a spatial component, it can still be an EBV, but cannot be an EBV Cube.

—

Question

Are all climate data on WorldClim EBV?

Answer

What makes it an EVB is if you use the climate data as an input to create a species distribution model for example. So climate per se is not an EVB, but when you combine it with biodiversity data, then you have a product that is an EBV.

Wolfgang · May 23, 2025, 8:47am

Thanks again for the session and for providing detailed answers in the forum.
The links to tech docs and papers are much appreciated.

Topic		Replies	Views
Finding gridded datasets - GBIF Data Blog Data blog	5	3327	February 2, 2024
GBIF SQL Downloads - GBIF Data Blog Data blog	1	119	October 4, 2024
GBIF's data quality workflow (GBIF technical support hour for nodes) Data Publishing NodesSupportHour	5	588	March 15, 2024
Absences and how they fit in the new model Diversifying the GBIF data model	21	2085	May 8, 2025
Identifying potentially related records - How does the GBIF data-clustring feature work? - GBIF Data Blog Data blog	19	7847	June 1, 2023

Links

Spatial reference grids

Data quality

Downloads

Absence data

Essential Biodiversity Variables (EBV)

Related topics