In case you missed the last Data Use Club session on GBIF species occurrence cubes, you can visit the session page to watch the full recording of the session and check out the PDFs of the presentations.
The recording features Andrew Rodrigues introducing the GBIF Cube Download Service and Lina Estupinan Suarez presenting From Species Occurrence to Essential Biodiversity Variables Cubes: results from the B-Cubed project.
Below, we’ve summarized some of the questions from the session and provided the pertinent links shared. We hope you use this thread to continue the discussion.
Thanks for a great session!
Links
Spatial reference grids
Question
With regards to spatial aggregation on hexagon grids: You chose ISEA3H grid cells as spatial unit. Did you consider H3 (uber)? Has ISEA3H specific advantages? Does ISEA3H have globally unique and stable identifiers?
Answer
Regarding hexagonal grids: we implemented ISEA3H as it’s an Open Geospatial Consortium (OGC) standard and was recommended by a paper on choosing grids for global analysis.
A second paper describes the identifier scheme used. “Globally unique identifiers” can mean unique across all identifiers ever, like 50c9509d-22c7-4a22-a47d-8c48425ef4a7. The ISEA3H identifiers are just numbers, so they aren’t globally unique in this computing/software sense. In the other sense, of each hexagon/pentagon on the globe having a unique id, they are unique.
See GBIF documentation on ISEA3H Grid-cell code GBIF_ISEA3HCode.
—-
Question
Is there planned future functionality to allow for custom grids for the occurrence aggregation?
Answer
We have an example of a user-defined grid and in the future aim to provide more documentation on how to do this.
Data quality
Question
On occurrence coordinates that don’t fall within a grid. For example, if the coordinate uncertainty covers multiple grids. Wouldn’t it be better to remove those coordinates? Won’t random placement affect the modelling? Has anyone studied the effects of random placement within the coordinate uncertainty on the outcomes of any modelling?
Answer
Here is a deeper dive into the random placement:
Unveiling ecological dynamics through simulation and visualization of biodiversity data cubes. No.vcyr7_v1. Center for Open Science, 2024. Langeraert, Ward, Wissam Barhdadi, Dimitri Brosens, Rocìo Cortès, Peter Desmet, Michele Di Musciano, Chandra Earl et al.
—-
Question
How would a multi-year date range be handled in a cube? Is it assigned to the earliest or most recent?
Answer
If you were to have a range from 2022-2024 for example, you would have aggregated counts for each year i.e. the number of occurrences for each year 2022, 2023 and 2024.
—-
Question
If an observation was made some time between 2022 and 2024, I suppose it would be randomly assigned to a temporal grid cell (year, or year+month…) analogous to how it’s done spatially? For example, an occurrence with eventDate 2005/2007.
Answer
The default query to generate the cubes ensures the included occurrences have sufficient date resolution to fit in the required dimension, so a year cube needs occurrences with a specific year (year != null). A year-month cube needs the month not to be null, so an occurrence from 2021-01-01/30 (30 days in January) is included, although 2021-01-31/2021-02-01 (2 days) is excluded.
Users could adjust this if they want to, e.g. to assign wider ranges to a random day within that range. We haven’t provided an SQL function to make this easier.
—-
Question
Are data cleaning algorithms applied in the process of creating species occurrence cubes?
Answer
Yes, at least for the cubes generated through the UI forms. All fields, verbatim and interpreted, are present in the SQL API, so the user could choose to ignore GBIF’s data cleaning, although that is likely to cause problems when aggregating – “3 March 1999” needs to be turned into 1999-03-03 so it groups with other occurrences recorded on that day.
—-
Question
Considering occurrence counts, if applicable, how reliable is the “is_in_cluster” parameter?
Answer
The process is described here: Occurrence clustering :: Technical Documentation
The way it’s run, clusters can take a few days to be added to records. I recommend testing how many clustered records are included in your search, and doing some quick checks to see how relevant the clusters are.
Full list of available columns (you will need to click to expand) — the ones starting “v_” are exactly as GBIF received the data from the publisher, the others may have had data quality checks applied, e.g. reformatting dates.
—-
Question
It must be a challenge taking sampling effort into account when it comes to tallying alien species counts per country, monitoring effort is quite different from state to state?
Answer
As with all use of GBIF-mediated data, users need to carefully consider sampling effort bias in the datasets and employ appropriate methodologies to deal with those biases. Invasion trends: An interpretable measure of change is needed to support policy targets from McGeogh et al. provides useful guidance in the context of invasive alien species.
Downloads
Question
Are there new data usage rules on how to use and cite these cubes in publications?
Answer
All downloads are issued a DOI that allows for attribution back to the original data providers as per usual. The change comes for users where the underlying occurrence records will not be provided.
—-
Question
Why is there no attribution in the download itself (in the zip file, like for Darwin Core Archives)? If the download is deleted later on the GBIF server per GBIF’s deletion policy, this information could get lost.
Answer
Here is an example SQL download with data providers. We do have an issue about making the downloads frictionless data or similar, which probably has additional recommendations for this type of metadata, but it hasn’t been considered specifically. And to add on the download issue from earlier, we never delete the download metadata, it will always be visible on the download page and through the API. When downloads are deleted, it is only the zip file that is deleted.
—-
Question
Why not GeoParquet instead of netCDF?
Answer
The hierarchical structure of the netCDF format is better suited to the custom structure we developed for the EBVs. GeoParquet is likely to happen, but unknown when.
Absence data
Question
Is absence data also available through cubes?
Answer
Yes, you can either filter for it (occurrenceStatus = ‘PRESENT’ or ‘ABSENT’) or add it as a dimension to the cube.
Essential Biodiversity Variables (EBV)
Question
What is the relationship between the EBV portal and that of GBIF?
Answer
GBIF is providing raw occurrences. The EBV Data Portal provides results of models or indicators. Not all the datasets in the EBV Data Portal come from GBIF data. The EBV Data Portal includes a variety of EBV raster datasets. You can also upload your own EBV dataset for sharing with others.
—
Question
Are there any links for forest related data?
Answer
The Global Ecosystem Dynamics Investigation (GEDI) instrument is a full-waveform lidar installed on the International Space Station that produces detailed observations of the 3D structure of Earth’s surface. GEDI’s three lasers precisely measure forest canopy height, canopy vertical structure, and surface elevation. By accurately measuring forests in 3D, GEDI data play an important role in understanding the amounts of biomass and carbon forests store and how much they lose when disturbed.
This EBV dataset is based entirely on the time series analysis developed by Prof. Matthew Hansen and colleagues (2013) in version 1.8, which examines the global Landsat archive at a special resolution of 30 meters to characterize global forest extent and change from 2000 through 2020. In this EBV dataset we focus on “Forest Cover loss” defined as a stand-replacement disturbance, or a change from a forest to non-forest state. The original data from Hansen et al., (2013), was processed using a factor of 15 to aggregate to a new spatial resolution of 450 x 450 meters and are structured in the multidimensional and standard format proposed by the GEO BON community for the Essential Biodiversity Variables.
Mean percentage of global tree canopy cover in the year 2000, defined as canopy closure for all vegetation taller than 5m in height. The original dataset from Hansen et al. (2013) has been aggregated to a cell size of 900m.
—
Question
Are species distribution models EBV?
Answer
Building essential biodiversity variables (EBVs) of species distribution and abundance at a global scale. Kissling et al. 2018
—-
Question
Would it be possible to add information on the genomics data used to estimate the Genetic composition EBV?
Answer
If the genetic information has a spatial component, it can be saved as an EBV Cube. If it does not have a spatial component, it can still be an EBV, but cannot be an EBV Cube.
—
Question
Are all climate data on WorldClim EBV?
Answer
What makes it an EVB is if you use the climate data as an input to create a species distribution model for example. So climate per se is not an EVB, but when you combine it with biodiversity data, then you have a product that is an EBV.