Download occurrences by IDs with rgbif

oleh.prylutskyi · August 1, 2023, 5:17pm

Hi there,
I have a long vector of occurrence IDs (means global unique GBIF’s identifiers as used in occurrences` URLs, e.g. “2992081821” for the occurrence). I would like to download GBIF data for those occurrences, but did not find proper predicate in occ_download(). pred_in(“occurrenceId”, myvector) seems to address identifier for the occurrence record as provided by the publisher.
Can you help me with that?
Cheers,
Oleh

oleh.prylutskyi · August 1, 2023, 5:19pm

@jwaller @markus @mgrosjean can you give me a piece of advice?

jwaller · August 2, 2023, 2:35pm

Unfortunately it looks like you will have to fetch all of the associated “publisher” occurrenceIds.

I have written an example below.

library(rgbif)
library(dplyr)
library(purrr)

# get some gbif ids for example 
gbif_ids <- occ_search(country="UA",limit=20)$data$key 

# get all of the occurrenceIds (which might take a while if you have a lot). 
occurrenceId <- gbif_ids %>%
as.numeric() %>%
map_chr(~ occ_get(.x,fields="all")[[1]]$data$occurrenceID)

# then send that to occ_download
occ_download(pred_in("occurrenceId", occurrenceId))

the download from the example

oleh.prylutskyi · August 3, 2023, 9:22am

Thank you, John, it works, but I am surprised that there is no way to search for gbifID - it seems the most basic query possible in any database. Whether this feature is planned to be added for the next releases of rgbif?
I see a lot of pitfalls in this approach – what if internal occurrence IDs appeared duplicated between datasets? What if some publishers use just integer consecutive numbers as IDs for all their datasets?
How it supposed to behave for occurrences associated with event or taxon cores?

oleh.prylutskyi · August 3, 2023, 10:02am

Moreover, I have found that might be records with issues in occurrenceID (cannot figure out the nature of issue) that can not be retrieved with your approach, like this one.

jwaller · August 10, 2023, 8:48am

@oleh.prylutskyi

Sorry this took me a while to answer.

Thanks for pointing this out. I had to ask around GBIF to figure out that in such cases where a publisher doesn’t supply an occurrenceId.

Apparently, in those cases you would have to use the triplet code…

catalogue number/collection code/institution code

Below is an example that I got to work

library(rgbif)
library(purrr)
library(dplyr)

gbif_ids = c(2283471077,2283464289)

d <- gbif_ids %>%
map(~ occ_get(.x,fields="all")[[1]]$data) %>% 
bind_rows()

occ_download(
pred_in("datasetKey",d$datasetKey),
pred_in("catalogNumber",d$catalogNumber),
pred_in("collectionCode",d$collectionCode),
pred_in("institutionCode",d$institutionCode)
)

This is pretty awkward, but it works.

Can I ask why you are stuck with gbifids in the first place, and not like a list of datasetKeys or some other filter?

I also made an issue about gbifid downloads here

oleh.prylutskyi · August 10, 2023, 11:22am

Thank you, @jwaller .

It’s not easy to explain why. I need to collect occurrences for a long list of scientific names, yet keeping relations among GBIF’s occurrences and (my) names to link to names’ attributes. The results should not be mixed up. It seemed to me that occ_download() aimed to return a set of occurrences for one particular query. Though it might be a complex query, I didn’t find a way to perform one query for returning occurrences per name. I decided then that the simplest way to achieve my goal is to loop across my scientific names using different queries. As a result, I had a list of gbifIDs associated with my names (IDs, actually), and hoped to feed it as a single query for occ_download().

Additionally, Backbone failed to resolve all issues and find proper occurrences for a lot of names. So for some ‘bad’ names, I looked for verbatinScientificNames, for some (let’s call it ‘good’) – just for scientificNames.

You can, if interested, check my workflow on github. Sorry for some Ukrainian in the flowchart – I made it for myself.

I have just realized that it might be very handy if occ_search() / occ_download() could have added some identifier to the search results – like an ID for query.

dnoesgaard · August 29, 2023, 6:47am

This is now possible in the API, so I suspect rgbif will need to be updated:

E.g. https://www.gbif.org/occurrence/download/0000255-230828120925497:

{
  "type": "in",
  "key": "GBIF_ID",
  "values": [
    "3056619651",
    "3051849318"
  ]
}

system · September 28, 2023, 4:47pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

jwaller · April 3, 2024, 11:27am

sant · April 16, 2024, 3:14pm

@dnoesgaard how many ids can be included and downloaded per request?

dnoesgaard · April 16, 2024, 3:30pm

I believe the limit is 101,000 “things”, where every element in an “in” predicate counts as one, plus every other comparison predicate (equals, not, greaterThan etc).

sant · April 17, 2024, 7:09am

Thanks @dnoesgaard !

I wonder if there is any api switch to select (limit) the returned fields and so reduce download size a lot.

We are interested in tracking changes of just one field (scientificName) for a long selected list of occurrences.

dnoesgaard · April 17, 2024, 2:56pm

Maybe this could be a solution?

https://techdocs.gbif.org/en/data-use/api-sql-downloads

(I haven’t tried it myself)

sant · April 17, 2024, 6:38pm

Thanks @dnoesgaard I’ll give it a try.
You mean running a query like this, don’t you?

{
  "sendNotification": true,
  "notificationAddresses": [
    "userEmail@example.org" 
  ],
  "format": "SQL_TSV_ZIP", 
  "sql": "SELECT key,scientificName FROM occurrence WHERE key IN (13883, 21134, ... 8347)" 
}

But the link says it’s experimental “only available for preview by invited users”.
Do we need to email helpdesk@gbif.org or should this already work with the usual download api authentication?

dnoesgaard · April 17, 2024, 7:07pm

Good question, I actually don’t know. Maybe @MattBlissett can help?

MattBlissett · April 18, 2024, 6:32am

Yes, please email helpdesk@gbif.org for access. (Anyone interested may ask for access, we will update the documentation with this information later today.)

Topic		Replies	Views
Searching GBIF using field gbifID Data Use	3	1759	August 7, 2021
Feedback on new downloads interface in rgbif Data Use	3	1045	February 21, 2020
GBIF SQL Downloads - GBIF Data Blog data-blog	1	51	October 4, 2024
Retrieve randomly sampled occurrence records Data Use	10	275	December 11, 2024
Search, download, analyze and cite (repeat if necessary) - GBIF Data Blog data-blog	15	2897	September 15, 2021

Download occurrences by IDs with rgbif

Related topics