Downloading occurrences from a long list of species in R and Python - GBIF Data Blog

Until recently it was not possible to download more than a few hundred species at the same time. This is unfortunately is still true for downloads through the portal, which are limited to around 200 taxon keys (species, genus, family, kingdom …). This is due to limitations in browser and the http GET limit. A recent change, has now made it possible to request more species names (up to 9000) in some cases through an http request.


This is a companion discussion topic for the original entry at https://data-blog.gbif.org/post/downloading-long-species-lists-on-gbif/
1 Like

About rgbif and Windows 10,

To make rgbif install, there is a dependency called ‘isoband’ and it will not install through the ordinary packages.install manager. I could only make it work by downloading from remotes::install_github(“wilkelab/isoband”) - this requires installing the ‘remotes’ package as well. Before the isoband binaries can be read there needs to be Rtools installed ( rtools40 is only needed build R packages with C/C++/Fortran code from source.)
When Rtools and isoband are in place, then rgif will install and John’s code will work splendidly.

rttools 4.0 depends on Rcpp in Windows.
Use the repo argument:
install.package(“Rcpp”, repos=“https://rcppcore.github.io/drat”, type=“source”)

It is recommended that you restart the R session after these installations.

In case this information about the code to obtain data from rgbif is of interest. The code part of :

pull(usagekey) %>% # get the gbif taxonkeys

should be with a capital K:

pull(usageKey) %>% # get the gbif taxonkeys.

I hope I was able to help.

1 Like

thanks I fixed it in the post.

1 Like

@jwaller @mgrosjean any reason for me getting 401 response when trying your Python function like this?

	download_query = {}
	download_query["creator"] = ""
	download_query["notificationAddresses"] = [""]
	download_query["sendNotification"] = False # if set to be True, don't forget to add a notificationAddresses above
	download_query["format"] = "SIMPLE_CSV"
	download_query["predicate"] = {
		"type": "in",
		"key": "TAXON_KEY",
		"values": [5687869,2858501,5354656,5333411] # tried this and got 401
	}
	# Generate download
	response = create_download_given_query(login='myuser', password='mypass', download_query=download_query)
	print(response.status_code)

My login and password are the same I use when creating downloads in portal page (and they work).

Maybe empty “creator” was not mandatory when you created the post but it is now?
What value should we use there? (email, user, any text …)

Also, is there any way to catch the preparing download key/url from the function returned api-response? (so my script can store it and check for download status later on)

Thanks!

Hi @sant,

I tried your query and it worked for me. I got a Response [503] if I left the creator empty though. I am surprised that you get a 401, are you sure that your login information is correct?

For you other question, response.content should return the download keys for the successful downloads.

Thanks for explaining @mgrosjean
Yes I am sure my login information is correct.

I checked the code used in pygbif download.py and it is passing user name in creator parameter.
It also has user in other parts of the code, which confuses me a bit.
But I think both contain the same value.

I can confirm my above code works now if I also pass ‘myuser’ as “creator” in the second line.
Maybe your user has special permissions?

This is what I get if I pass different “creator” values:

  • download_query[“creator”] = “blahblah”

    401: “myuser not allowed to create download with creator blahblah

  • download_query[“creator”] = “”

    401: “myuser not allowed to create download with creator

  • download_query[“creator”] = “myuser”

    201: “0000018-240318150302937

For me it only worked doing like that. But it works! :+1:

Actually I just found "creator": "userName" explained in documentation so I think there is no other way. It’s odd needing to pass it twice, in both http request and json data.