I’m considering allowing users of an R package I maintain https://github.com/ropensci/spocc/ to be able to use GBIF occurrence downloads, in addition to using the occurrence search API. I imagine this would be a good addition in terms of making it easier for a user to get a citation for the GBIF data they’ve used (because they’ve used the downloads route rather than search).
However, AFAIK there’s no way to limit number of results with occurrence downloads, correct? Or maybe I’m wrong?
I worry that if I allow users to use occurrence downloads they won’t really grasp what is going on behind the scenes and will easily trigger a bunch of enormous queries that they didn’t intend for, AND they’ll be waiting quite a long time for, which they won’t really understand either. Of course sometimes a user will want a lot of data, but I’d at least like to make it possible to limit results so on a first pass users can quickly get back data without waiting a long time (and putting less burden on your async download servers?).
Might there be a way to say in the occurrence downloads API that you want a maximum of X records returned?
First off, I would love to see downloads implemented in spocc! That being said, I don’t believe there’s a way to limit the number of records returned in a download.
but I’d at least like to make it possible to limit results so on a first pass users can quickly get back data without waiting a long time (and putting less burden on your async download servers?).
Wouldn’t the search API be the best option for a first pass? Frankly, this is the approach that I would like to see users—whether direct API consumers or R users—to take, mimicking what goes on when getting data from the GBIF.org website. You do a first pass search to adjust your parameters and filters (using the search API in the background), and once you’re satisfied, you hit the download button to get the data (triggering a call to the download API).
I’m not sure how this would work in practice, though…
You’re right that it does make sense that users could use the search API as a the first pass, then download API to get final data.
There are a number of bumps in the road however:
User interfaces differ: the way you query differs for search and download API. Ideally I will be able to hide this complexity in spocc, but i’ll still need to allow power users to use the more flexible download API query interface.
If there was a way to limit results in the download API we’d be able to have a default limit of say 100, and just have to deal with the download query interface, simplifying the package
Immediate data return vs. Async waiting: I’m not quite sure how to deal with this. R users are generally not very technical. That is, it will be hard to educate people about the difference between data immediately being returned vs. having to wait for data to be prepared, then download later. It will introduce cognitive dissonance b/c the other data sources return data immediately.
Rate limits for the async download service. Rate limits are definitely justified, but this may cause some issues. I do have a download queue method in rgbif https://docs.ropensci.org/rgbif/reference/occ_download_queue.html so that may work to avoid users having to worry about their rate limits