Download occurrences by IDs with rgbif

Hi there,
I have a long vector of occurrence IDs (means global unique GBIF’s identifiers as used in occurrences` URLs, e.g. “2992081821” for the occurrence). I would like to download GBIF data for those occurrences, but did not find proper predicate in occ_download(). pred_in(“occurrenceId”, myvector) seems to address identifier for the occurrence record as provided by the publisher.
Can you help me with that?
Cheers,
Oleh

1 Like

@jwaller @markus @mgrosjean can you give me a piece of advice?

1 Like

Unfortunately it looks like you will have to fetch all of the associated “publisher” occurrenceIds.

I have written an example below.

library(rgbif)
library(dplyr)
library(purrr)

# get some gbif ids for example 
gbif_ids <- occ_search(country="UA",limit=20)$data$key 

# get all of the occurrenceIds (which might take a while if you have a lot). 
occurrenceId <- gbif_ids %>%
as.numeric() %>%
map_chr(~ occ_get(.x,fields="all")[[1]]$data$occurrenceID)

# then send that to occ_download
occ_download(pred_in("occurrenceId", occurrenceId))

the download from the example

Thank you, John, it works, but I am surprised that there is no way to search for gbifID - it seems the most basic query possible in any database. Whether this feature is planned to be added for the next releases of rgbif?
I see a lot of pitfalls in this approach – what if internal occurrence IDs appeared duplicated between datasets? What if some publishers use just integer consecutive numbers as IDs for all their datasets?
How it supposed to behave for occurrences associated with event or taxon cores?

1 Like

Moreover, I have found that might be records with issues in occurrenceID (cannot figure out the nature of issue) that can not be retrieved with your approach, like this one.

@oleh.prylutskyi

Sorry this took me a while to answer.

Thanks for pointing this out. I had to ask around GBIF to figure out that in such cases where a publisher doesn’t supply an occurrenceId.

Apparently, in those cases you would have to use the triplet code…

catalogue number/collection code/institution code

Below is an example that I got to work

library(rgbif)
library(purrr)
library(dplyr)

gbif_ids = c(2283471077,2283464289)

d <- gbif_ids %>%
map(~ occ_get(.x,fields="all")[[1]]$data) %>% 
bind_rows()

occ_download(
pred_in("datasetKey",d$datasetKey),
pred_in("catalogNumber",d$catalogNumber),
pred_in("collectionCode",d$collectionCode),
pred_in("institutionCode",d$institutionCode)
)

This is pretty awkward, but it works.

Can I ask why you are stuck with gbifids in the first place, and not like a list of datasetKeys or some other filter?

I also made an issue about gbifid downloads here

2 Likes

Thank you, @jwaller .

It’s not easy to explain why. I need to collect occurrences for a long list of scientific names, yet keeping relations among GBIF’s occurrences and (my) names to link to names’ attributes. The results should not be mixed up. It seemed to me that occ_download() aimed to return a set of occurrences for one particular query. Though it might be a complex query, I didn’t find a way to perform one query for returning occurrences per name. I decided then that the simplest way to achieve my goal is to loop across my scientific names using different queries. As a result, I had a list of gbifIDs associated with my names (IDs, actually), and hoped to feed it as a single query for occ_download().

Additionally, Backbone failed to resolve all issues and find proper occurrences for a lot of names. So for some ‘bad’ names, I looked for verbatinScientificNames, for some (let’s call it ‘good’) – just for scientificNames.

You can, if interested, check my workflow on github. Sorry for some Ukrainian in the flowchart – I made it for myself.

I have just realized that it might be very handy if occ_search() / occ_download() could have added some identifier to the search results – like an ID for query.

1 Like

This is now possible in the API, so I suspect rgbif will need to be updated:

E.g. https://www.gbif.org/occurrence/download/0000255-230828120925497:

{
  "type": "in",
  "key": "GBIF_ID",
  "values": [
    "3056619651",
    "3051849318"
  ]
}
6 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

4 Likes

@dnoesgaard how many ids can be included and downloaded per request?

I believe the limit is 101,000 “things”, where every element in an “in” predicate counts as one, plus every other comparison predicate (equals, not, greaterThan etc).

1 Like

Thanks @dnoesgaard !

I wonder if there is any api switch to select (limit) the returned fields and so reduce download size a lot.

We are interested in tracking changes of just one field (scientificName) for a long selected list of occurrences.

Maybe this could be a solution?

https://techdocs.gbif.org/en/data-use/api-sql-downloads

(I haven’t tried it myself)

Thanks @dnoesgaard I’ll give it a try.
You mean running a query like this, don’t you?

{
  "sendNotification": true,
  "notificationAddresses": [
    "userEmail@example.org" 
  ],
  "format": "SQL_TSV_ZIP", 
  "sql": "SELECT key,scientificName FROM occurrence WHERE key IN (13883, 21134, ... 8347)" 
}

But the link says it’s experimental “only available for preview by invited users”.
Do we need to email helpdesk@gbif.org or should this already work with the usual download api authentication?

Good question, I actually don’t know. Maybe @MattBlissett can help?

Yes, please email helpdesk@gbif.org for access. (Anyone interested may ask for access, we will update the documentation with this information later today.)

1 Like