Retrieve randomly sampled occurrence records

Hello everyone!
I am working with the rgbif package in R and was wondering if anyone is familar with a good way to create randomly sampled occurrence record sets.
Assuming i already know what dataset i want to look at, this is how i create a random (sub-)sample currently:
i fetch the max number of occurrence records possible within a timeframe of 15 minutes, which has proven to be ~17.100 occ-records. I then use R-base sample() function, to randomly sample 50 records from this subset of 17.100:

total_records_needed <- 17100
limit <- 300 # fetched records per request (300 max)
all_records <- list()
offset <- 0 #

while (length(all_records) < total_records_needed) {
  paged_records <- occ_search(
    datasetKey = datasetKey,
    mediaType = "StillImage",
    basisOfRecord = "PRESERVED_SPECIMEN",
    limit = limit,
    start = offset
  )$media
# Append newly fetched/requested record set to existing one (collecting all records in one list)
  all_records <- c(all_records, paged_records)
  
  # update to fetch next record batch
  offset <- offset + limit
if (length(all_records) >= total_records_needed) {
  # sample from all records fetched
  sampled_records <- sample(all_records, 50)
  }
}

This approach is okayish, but only pseudo random, since we cant sample from the whole dataset but only from the 17.100 occ-records, which also are basically the “first” samples. Many datasets we are looking at consist of more than 1 million entries. fetching all of them is no option.

I thought maybe there is a way to use the paging system to create a broader range of occurrences to sample from (by “jumping” every other entry or something?) or maybe even a way to directly sample?

I’d be thankful for any idea!

This suggestion comes from someone who has not used the GBIF API (through R or any other means), so be skeptical. But…

My first thought is to find a way to ensure randomness AND avoid having to pre-download records only to not-use most of them. Having a peek at the GBIF API reference, I see that there is an “Occurrence > key”. This is a simple integer. If one had some idea of the maximum value of this integer (might take a little probing to get an idea of that), then it should be possible to attempt record retrievals based on a random selection of integers within the range of that key (discarding retrieval failures).

By using a random process on top of the GBIF dataset, it should not matter if records are non-randomly distributed within the dataset (which I’m guessing they are: probably clustered by submission). That would allow you to retrieve randomly selected records without having to pre-retrieve records.

1 Like

My mind went to the same place as @pentcheff. However, I’m not sure I quite grasp what you’re trying to do. Are you trying to randomly sample (1) the entirety of GBIF, (2) a single dataset, or (3) the results of a GBIF query?

If it’s #1, then it’s probably best to work with the GBIF helpdesk to obtain a file of all the gbifId and any corresponding information you need. For #2 and #3, here is an example of how you could use the SQL API to get the gbifId and then randomly sample from that. The initial SQL query might take a few minutes, but there should be no unusual bottlenecks after that.

library(rgbif)
library(httr2)
library(dplyr)

# Create download using httr2 and R-built SQL statement

  SQL_statement <- paste0("SELECT ",
                          paste(c('datasetKey', 'gbifId'), collapse = ", "),
                          " FROM occurrence",
                          " WHERE eventDate > '2014-01-01'",
                          " AND eventDate < '2015-01-01'",
                          " AND occurrenceStatus = 'PRESENT'")

#request download using SQL_statement created above

  request(base_url = "https://api.gbif.org/v1/occurrence/download/request") %>% 
    req_auth_basic(username = rstudioapi::askForPassword(prompt = "enter your GBIF username"), 
                   password = rstudioapi::askForPassword(prompt = "enter your GBIF password")) %>% 
    req_body_json(list("sendNotification" = TRUE,
                       "format" = "SQL_TSV_ZIP",
                       "sql" = SQL_statement)
    ) %>% 
    req_perform()

#download TSV from DOI linked to my user account: https://doi.org/10.15468/dl.jtee6p

  t <- tempdir()
  occ_download_get(key = "0013070-241126133413365", path = t) #got this key from the download page url: https://www.gbif.org/occurrence/download/0013070-241126133413365
  
  gbif_ids <- readr::read_tsv(list.files(path = t, pattern = "*.zip", full.names = TRUE))

#get 5 random gbifIds from desired dataset

  id_to_download <- gbif_ids %>% 
    filter(datasetkey == "df8e3fb8-3da7-4104-a866-748f6da20a3c") %>% 
    slice_sample(n = 5) %>% 
    pull(gbifid)

#perform API call for each ID and then bind into one data frame

  lapply(id_to_download, function(x){
    
    occ_search(gbifId = x) %>% 
      .[["data"]]

  }) %>% 
  bind_rows()

If you are only interested in a single dataset, I think you’d be better off fetching the dataset locally and using something like dplyr::slice_sample() on it directly. This will by far be the fastest way to do it.

But it sounds like you are interested in sampling millions and millions of rows, I’d use the monthly parquet snapshot, there is an excellent guide here: Using Apache Arrow and Parquet with GBIF-mediated occurrences - GBIF Data Blog

Example for the iNaturalist Research Grade observations dataset, which is just over 100M records. This might take a while depending on the speed of your internet connection, but you don’t need to store the whole thing locally this way.

library(arrow)
library(dplyr)

gbif_snapshot <- "s3://gbif-open-data-eu-central-1/occurrence/2024-11-01/occurrence.parquet"
occ <- open_dataset(gbif_snapshot)

occ %>% 
  filter(
    basisofrecord == "PRESERVED_SPECIMEN",
    datasetkey == "50c9509d-22c7-4a22-a47d-8c48425ef4a7"
  ) %>% 
  slice_sample(n = 10, replace = FALSE) %>% 
  collect()

Note that mediatype is stored as an array (as an occurrence can have more than one type of media attached), so it’s more complex to filter on.

If you want to do this over many datasets, dataset by dataset, I’d recommend you download a local version of the gbif snapshot, this will save time and allow you to group_by(datasetkey) and use the grouped version of slice_sample() instead, slicing by group is currently not supported in {arrow} but you should be able to get it to work in {sparklyr} or {duckdb}. Storing a local version of the snapshot will really speed things up, it takes around 150GB of storage if I remember correctly. If this isn’t an option, you can still do it over the network providing you can field enough patience.

Let me know if you need any help!

Thank you! To answer your question, guess (3) fits the most.
We have a list of datasetKeys and we are trying to get an “overview” of what the datasets images look like. So the idea is to to work through each dataset, apply the filters and randomly pick 50 occurrence records and download the associated images.
We are trying to avoid having to save data for all occurrence records of a dataset before sampling.

I am not familiar with SQL, but if i see correct,your proposed approach is downloading the whole dataset first to then sample from it, right? Which we are really trying to avoid. The datasets we want to inspect range from just a few up to >5 mio occurrence records related with image files.

I like the idea, thank you! I think working with the maximum value of the Keys is helpful, when sampling the whole database. But maybe this could work, by creating a list/table of the datasets occurrence Keys (so saving only the Keys), sampling the list and then fetch the records based on the sampled Keys.

I will try that and let you know!

This looks promising! thank you i will give it a try!

@pentcheff @sformel @pieter

This is very valuable advice indeed. Here’s some more context about the use case @usersarah and I work on:

What we are really interested in are the media associated with occurrence records for certain taxa. We want to initially evaluate the media files for a number of criteria manually, based on a random sample of occurrence records with media from a given dataset. The data set enters the equation because digitization / observation methodology, data sources and licenses for the media files vary among data providers.

We settled on a number of 50 occurrence records from a given dataset (resp. the media associated with those) for pragmatic reasons for now, but the sample size is essentially just a parameter. More important is to obtain a random sample from the dataset.

As we would like to do this systematically also for taxa higher up in the hierarchy, we deal with many datasets and also very large ones.

The goal then is to have, for a considerable number of datasets, a local copy of the media associated with a random sample of occurrence records from each dataset, including relevant metadata to document the origin of the media.