Retrieve randomly sampled occurrence records

usersarah · November 28, 2024, 1:21pm

Hello everyone!
I am working with the rgbif package in R and was wondering if anyone is familar with a good way to create randomly sampled occurrence record sets.
Assuming i already know what dataset i want to look at, this is how i create a random (sub-)sample currently:
i fetch the max number of occurrence records possible within a timeframe of 15 minutes, which has proven to be ~17.100 occ-records. I then use R-base sample() function, to randomly sample 50 records from this subset of 17.100:

total_records_needed <- 17100
limit <- 300 # fetched records per request (300 max)
all_records <- list()
offset <- 0 #

while (length(all_records) < total_records_needed) {
  paged_records <- occ_search(
    datasetKey = datasetKey,
    mediaType = "StillImage",
    basisOfRecord = "PRESERVED_SPECIMEN",
    limit = limit,
    start = offset
  )$media
# Append newly fetched/requested record set to existing one (collecting all records in one list)
  all_records <- c(all_records, paged_records)
  
  # update to fetch next record batch
  offset <- offset + limit
if (length(all_records) >= total_records_needed) {
  # sample from all records fetched
  sampled_records <- sample(all_records, 50)
  }
}

This approach is okayish, but only pseudo random, since we cant sample from the whole dataset but only from the 17.100 occ-records, which also are basically the “first” samples. Many datasets we are looking at consist of more than 1 million entries. fetching all of them is no option.

I thought maybe there is a way to use the paging system to create a broader range of occurrences to sample from (by “jumping” every other entry or something?) or maybe even a way to directly sample?

I’d be thankful for any idea!

pentcheff · December 2, 2024, 6:33pm

This suggestion comes from someone who has not used the GBIF API (through R or any other means), so be skeptical. But…

My first thought is to find a way to ensure randomness AND avoid having to pre-download records only to not-use most of them. Having a peek at the GBIF API reference, I see that there is an “Occurrence > key”. This is a simple integer. If one had some idea of the maximum value of this integer (might take a little probing to get an idea of that), then it should be possible to attempt record retrievals based on a random selection of integers within the range of that key (discarding retrieval failures).

By using a random process on top of the GBIF dataset, it should not matter if records are non-randomly distributed within the dataset (which I’m guessing they are: probably clustered by submission). That would allow you to retrieve randomly selected records without having to pre-retrieve records.

sformel · December 2, 2024, 8:02pm

My mind went to the same place as @pentcheff. However, I’m not sure I quite grasp what you’re trying to do. Are you trying to randomly sample (1) the entirety of GBIF, (2) a single dataset, or (3) the results of a GBIF query?

If it’s #1, then it’s probably best to work with the GBIF helpdesk to obtain a file of all the gbifId and any corresponding information you need. For #2 and #3, here is an example of how you could use the SQL API to get the gbifId and then randomly sample from that. The initial SQL query might take a few minutes, but there should be no unusual bottlenecks after that.

library(rgbif)
library(httr2)
library(dplyr)

# Create download using httr2 and R-built SQL statement

  SQL_statement <- paste0("SELECT ",
                          paste(c('datasetKey', 'gbifId'), collapse = ", "),
                          " FROM occurrence",
                          " WHERE eventDate > '2014-01-01'",
                          " AND eventDate < '2015-01-01'",
                          " AND occurrenceStatus = 'PRESENT'")

#request download using SQL_statement created above

  request(base_url = "https://api.gbif.org/v1/occurrence/download/request") %>% 
    req_auth_basic(username = rstudioapi::askForPassword(prompt = "enter your GBIF username"), 
                   password = rstudioapi::askForPassword(prompt = "enter your GBIF password")) %>% 
    req_body_json(list("sendNotification" = TRUE,
                       "format" = "SQL_TSV_ZIP",
                       "sql" = SQL_statement)
    ) %>% 
    req_perform()

#download TSV from DOI linked to my user account: https://doi.org/10.15468/dl.jtee6p

  t <- tempdir()
  occ_download_get(key = "0013070-241126133413365", path = t) #got this key from the download page url: https://www.gbif.org/occurrence/download/0013070-241126133413365
  
  gbif_ids <- readr::read_tsv(list.files(path = t, pattern = "*.zip", full.names = TRUE))

#get 5 random gbifIds from desired dataset

  id_to_download <- gbif_ids %>% 
    filter(datasetkey == "df8e3fb8-3da7-4104-a866-748f6da20a3c") %>% 
    slice_sample(n = 5) %>% 
    pull(gbifid)

#perform API call for each ID and then bind into one data frame

  lapply(id_to_download, function(x){
    
    occ_search(gbifId = x) %>% 
      .[["data"]]

  }) %>% 
  bind_rows()

pieter · December 4, 2024, 2:16pm

If you are only interested in a single dataset, I think you’d be better off fetching the dataset locally and using something like dplyr::slice_sample() on it directly. This will by far be the fastest way to do it.

But it sounds like you are interested in sampling millions and millions of rows, I’d use the monthly parquet snapshot, there is an excellent guide here: Using Apache Arrow and Parquet with GBIF-mediated occurrences - GBIF Data Blog

Example for the iNaturalist Research Grade observations dataset, which is just over 100M records. This might take a while depending on the speed of your internet connection, but you don’t need to store the whole thing locally this way.

library(arrow)
library(dplyr)

gbif_snapshot <- "s3://gbif-open-data-eu-central-1/occurrence/2024-11-01/occurrence.parquet"
occ <- open_dataset(gbif_snapshot)

occ %>% 
  filter(
    basisofrecord == "PRESERVED_SPECIMEN",
    datasetkey == "50c9509d-22c7-4a22-a47d-8c48425ef4a7"
  ) %>% 
  slice_sample(n = 10, replace = FALSE) %>% 
  collect()

Note that mediatype is stored as an array (as an occurrence can have more than one type of media attached), so it’s more complex to filter on.

If you want to do this over many datasets, dataset by dataset, I’d recommend you download a local version of the gbif snapshot, this will save time and allow you to group_by(datasetkey) and use the grouped version of slice_sample() instead, slicing by group is currently not supported in {arrow} but you should be able to get it to work in {sparklyr} or {duckdb}. Storing a local version of the snapshot will really speed things up, it takes around 150GB of storage if I remember correctly. If this isn’t an option, you can still do it over the network providing you can field enough patience.

Let me know if you need any help!

usersarah · December 6, 2024, 2:42pm

Thank you! To answer your question, guess (3) fits the most.
We have a list of datasetKeys and we are trying to get an “overview” of what the datasets images look like. So the idea is to to work through each dataset, apply the filters and randomly pick 50 occurrence records and download the associated images.
We are trying to avoid having to save data for all occurrence records of a dataset before sampling.

I am not familiar with SQL, but if i see correct,your proposed approach is downloading the whole dataset first to then sample from it, right? Which we are really trying to avoid. The datasets we want to inspect range from just a few up to >5 mio occurrence records related with image files.

usersarah · December 6, 2024, 2:51pm

I like the idea, thank you! I think working with the maximum value of the Keys is helpful, when sampling the whole database. But maybe this could work, by creating a list/table of the datasets occurrence Keys (so saving only the Keys), sampling the list and then fetch the records based on the sampled Keys.

I will try that and let you know!

usersarah · December 6, 2024, 2:54pm

This looks promising! thank you i will give it a try!

cboelling · December 9, 2024, 2:44pm

@pentcheff @sformel @pieter

This is very valuable advice indeed. Here’s some more context about the use case @usersarah and I work on:

What we are really interested in are the media associated with occurrence records for certain taxa. We want to initially evaluate the media files for a number of criteria manually, based on a random sample of occurrence records with media from a given dataset. The data set enters the equation because digitization / observation methodology, data sources and licenses for the media files vary among data providers.

We settled on a number of 50 occurrence records from a given dataset (resp. the media associated with those) for pragmatic reasons for now, but the sample size is essentially just a parameter. More important is to obtain a random sample from the dataset.

As we would like to do this systematically also for taxa higher up in the hierarchy, we deal with many datasets and also very large ones.

The goal then is to have, for a considerable number of datasets, a local copy of the media associated with a random sample of occurrence records from each dataset, including relevant metadata to document the origin of the media.

sformel · December 9, 2024, 10:00pm

@usersarah no, the SQL is just a download of some keys from the occurrences, not the entire dataset (like the parquet solution). Your last couple of sentences:

‘But maybe this could work, by creating a list/table of the datasets occurrence Keys (so saving only the Keys), sampling the list and then fetch the records based on the sampled Keys.’

is essentially what I tried to suggest. The creation of the query and downloading it aren’t instantaneous, but shouldn’t take more than minutes to hours, depending on queues, etc. It returns a two-column table (gbifid and datasetkey that is relatively small). For my example above, the 97.5 million keys are about ~2GB. If you’re feeling uncertain about the SQL aspects, it’s not as scary as it looks. I’d be happy to walk you through it.

Unfortunately, one of the things that might make your work more challenging are dead links to the media. I’ve heard anecdotally that the links it may be as high as 50% dead. So, you may want to put in some sort of try/resample statement based on what the download link returns.

usersarah · December 11, 2024, 3:49pm

Hi @sformel , thanks again for the hints!

I’ve been trying to use your suggested approach, with some adjustments to my case:

# Varibale x determines the accessed row ( == datasetKeys in the DatasetKeys.txt file)
x <- 1

datasetKeys <- read.delim("DatasetKeys.txt")
datasetKey <- datasetKeys$datasetKey[x]

SQL_statement <- paste0("SELECT datasetKey, gbifId FROM occurrence WHERE datasetKey = '", datasetKey, "'")

# input login data
username <- readline(prompt = "Enter your GBIF username: ")
password <- readline(prompt = "Enter your GBIF password: ")
# using rstudioapi was no option, as i am working in vs code

# make request
request(base_url = "https://api.gbif.org/v1/occurrence/download/request") %>%
  req_auth_basic(username = username, 
                 password = password) %>%
  req_body_json(list(
    "sendNotification" = TRUE,
    "format" = "SQL_TSV_ZIP",
    "sql" = SQL_statement
  )) %>%
  req_perform()

I didnt get any further, becuase i always get the following error:

Error in `req_perform()`:
> rlang::last_trace(drop = FALSE)
<error/httr2_http_403>   
Error in `req_perform()`:
! HTTP 403 Forbidden.
---
Backtrace:
    ▆
 1. ├─... %>% req_perform()
 2. └─httr2::req_perform(.)
 3.   └─httr2:::handle_resp(req, resp, error_call = error_call)
 4.     └─httr2:::resp_abort(resp, req, body, call = error_call)
 5.       └─rlang::abort(...)

Do you see anything that could be causing this error? I already double checked my login data of course. but are there maybe any other permission restrictions?

And regarding the dead links to media files - we have also observed this. For now we basically treat them as “corrupted/broken” files and take notes about the proportion within the randomly sampled occurrence record subset. So this will also just be additional information about the datasets properties.
We also assume that, if we request the original media files from the providers (later on), we might end up with less corrupted image files.

Thanks in advance for the help!

sformel · December 11, 2024, 5:47pm

Sorry, I forgot that the SQL API is still experimental, and you need to request permission to use it. That’s probably why you’re getting the 403 code. You can contact helpdesk@gbif.org to request access.

Topic		Replies	Views
Download occurrences by IDs with rgbif Data Use	16	969	April 18, 2024
Fetching datasets(Keys) instead of occurrence records with rgbif functions Data Use	7	440	December 9, 2024
Limit number of occurrences returned in occurrence downloads	3	1232	December 10, 2020
Earliest Occurrence Records Data Use	2	733	June 16, 2022
Using rgbif to explore dataset usage Data Use	1	1013	December 5, 2019

Retrieve randomly sampled occurrence records

Related topics