Downloading and citing occurrence data for multiple taxa

I’m often asked by data users what the best way is to get and cite data for multiple taxa. We’ve always recommended that people include as many taxa as possible in a their searches to limit the number of downloads and thus DOIs requiring citation. Unfortunately, due to technical limits this has not always been feasible—especially for users requiring data for hundreds or thousand of species.

I’m happy to report that we’ve recently been able to significantly ease the restrictions on number of taxa per query. As there are several systems in place that impose bottlenecks, including the user’s own browser, it’s difficult to give an exact number. But we’ve logged successful downloads of more than 3,500 9,000 concurrent species, which should be enough for the majority of users.

For multiple-taxa downloads, keep in that you can reduce complexity significantly by requesting data at at higher taxon level—and then filtering data locally.

We will continue to explore ways of making downloading data easier, including removing limits on number of taxa. In the meantime, should you run into query limits, please get in touch with helpdesk@gbif.org and they might be able to help.

1 Like

As far as I can tell the limit seems to be much higher than 3500.

https://www.gbif.org/occurrence/download/0009331-190621201848488 # 5000 taxon_keys (success)
https://www.gbif.org/occurrence/download/0009335-190621201848488 # 6000 taxon_keys (success)
https://www.gbif.org/occurrence/download/0010210-190621201848488 # 7000 taxon_keys (success)
https://www.gbif.org/occurrence/download/0010212-190621201848488 # 8000 taxon_keys (success)
https://www.gbif.org/occurrence/download/0010219-190621201848488 # 9000 taxon_keys (success)
https://www.gbif.org/occurrence/download/0010226-190621201848488 # 10000 taxon_keys (fails)

I was able to download up to 9000 taxon_keys. Seems to fail at 10 000 taxon_keys.

Thanks John.

Just to be clear—it’s not a limit on taxon keys per se but on the number of chars/bits in the request.

I assume you did these by POSTing to the API?

Yes I did a POST to the API like this. Replacing user:password with my GBIF user name and password.

curl --include --user user:password --header "Content-Type: application/json" --data @file_9000.json http://api.gbif.org/v1/occurrence/download/request

Where file file_9000.json looked like this but with 9000 taxon_keys

{
"creator": "jwaller",
"notification_address": [
"jwaller@gbif.org"
],
"sendNotification": true,
"format": "SIMPLE_CSV",
"predicate": {
"type": "and",
"predicates": [
{
"type": "in",
"key": "TAXON_KEY",
"values": [1000003,
1000094,
___MANY MORE KEYS____
1000095,
1000096]}]}}
1 Like

requesting data at at higher taxon level—and then filtering data locally.

But the DOI will be applied to all records in the initial download. Isn’t the point of citing data to give credit to those who supplied data AND to encourage repeatability of new science? If you encourage local filtering after download then this devalues the DataCite DOI.

Agreed. I would’t recommend doing this unless it is your only option.

On the other other hand, in my experience almost every single download of GBIF data does involve some degree of local filtering with records being discarded. Sometimes this could (and should) have been done before downloading, but a user may also have reasons to remove records by criteria for which GBIF doesn’t provide filters.

Pragmatically, I would prefer that people cite a download DOI containing a few more records than is actually “used” in an analysis—than not citing a DOI at all.

Ideally, this reinforces the concept of a reference dataset—that a user can download data, clean and filter it, and then somehow re-upload that dataset to GBIF. To avoid duplicate records, however, we would need persistent identifiers for occurrences for this to make sense.

Quick update: you can now download up to 100,000 taxa in a single request. Knock yourselves out, kids.

3 Likes