Downloading and citing occurrence data for multiple taxa

dnoesgaard · July 5, 2019, 11:58am

I’m often asked by data users what the best way is to get and cite data for multiple taxa. We’ve always recommended that people include as many taxa as possible in a their searches to limit the number of downloads and thus DOIs requiring citation. Unfortunately, due to technical limits this has not always been feasible—especially for users requiring data for hundreds or thousand of species.

I’m happy to report that we’ve recently been able to significantly ease the restrictions on number of taxa per query. As there are several systems in place that impose bottlenecks, including the user’s own browser, it’s difficult to give an exact number. But we’ve logged successful downloads of more than ~~3,500~~ 9,000 concurrent species, which should be enough for the majority of users.

For multiple-taxa downloads, keep in that you can reduce complexity significantly by requesting data at at higher taxon level—and then filtering data locally.

We will continue to explore ways of making downloading data easier, including removing limits on number of taxa. In the meantime, should you run into query limits, please get in touch with helpdesk@gbif.org and they might be able to help.

jwaller · July 11, 2019, 7:11am

As far as I can tell the limit seems to be much higher than 3500.

https://www.gbif.org/occurrence/download/0009331-190621201848488 # 5000 taxon_keys (success)
https://www.gbif.org/occurrence/download/0009335-190621201848488 # 6000 taxon_keys (success)
https://www.gbif.org/occurrence/download/0010210-190621201848488 # 7000 taxon_keys (success)
https://www.gbif.org/occurrence/download/0010212-190621201848488 # 8000 taxon_keys (success)
https://www.gbif.org/occurrence/download/0010219-190621201848488 # 9000 taxon_keys (success)
https://www.gbif.org/occurrence/download/0010226-190621201848488 # 10000 taxon_keys (fails)

I was able to download up to 9000 taxon_keys. Seems to fail at 10 000 taxon_keys.

dnoesgaard · July 11, 2019, 7:35am

Thanks John.

Just to be clear—it’s not a limit on taxon keys per se but on the number of chars/bits in the request.

I assume you did these by POSTing to the API?

jwaller · July 11, 2019, 7:46am

Yes I did a POST to the API like this. Replacing user:password with my GBIF user name and password.

curl --include --user user:password --header "Content-Type: application/json" --data @file_9000.json http://api.gbif.org/v1/occurrence/download/request

Where file file_9000.json looked like this but with 9000 taxon_keys

{
"creator": "jwaller",
"notification_address": [
"jwaller@gbif.org"
],
"sendNotification": true,
"format": "SIMPLE_CSV",
"predicate": {
"type": "and",
"predicates": [
{
"type": "in",
"key": "TAXON_KEY",
"values": [1000003,
1000094,
___MANY MORE KEYS____
1000095,
1000096]}]}}

dshorthouse · July 17, 2019, 3:31pm

requesting data at at higher taxon level—and then filtering data locally.

But the DOI will be applied to all records in the initial download. Isn’t the point of citing data to give credit to those who supplied data AND to encourage repeatability of new science? If you encourage local filtering after download then this devalues the DataCite DOI.

dnoesgaard · July 18, 2019, 6:04am

Agreed. I would’t recommend doing this unless it is your only option.

On the other other hand, in my experience almost every single download of GBIF data does involve some degree of local filtering with records being discarded. Sometimes this could (and should) have been done before downloading, but a user may also have reasons to remove records by criteria for which GBIF doesn’t provide filters.

Pragmatically, I would prefer that people cite a download DOI containing a few more records than is actually “used” in an analysis—than not citing a DOI at all.

Ideally, this reinforces the concept of a reference dataset—that a user can download data, clean and filter it, and then somehow re-upload that dataset to GBIF. To avoid duplicate records, however, we would need persistent identifiers for occurrences for this to make sense.

dnoesgaard · August 13, 2019, 3:04pm

Quick update: you can now download up to 100,000 taxa in a single request. Knock yourselves out, kids.

Topic		Replies	Views
Species search by datasetKey not working Data Use	2	271	January 24, 2024
Search, download, analyze and cite (repeat if necessary) - GBIF Data Blog data-blog	15	2867	September 15, 2021
All gbif taxon data	10	1248	July 2, 2020
Downloading occurrences from a long list of species in R and Python - GBIF Data Blog data-blog	8	11016	April 18, 2024
Citation of the week/month/you name it Data Use	0	1628	May 7, 2018

Downloading and citing occurrence data for multiple taxa

Related topics