Web client considerations for a largish dataset

Hello everyone!

I am building a website for UT Austin to present a portion of our GBIF data, for about 25,000 occurrence records which we have not yet uploaded to GBIF. I’m having trouble finding the information I need to plan my GBIF web client, as most of the information seems geared toward desktop client tools.

The coordinates for this data are all sensitive, so we’ll be retrieving them from our local database and only making them available to approved logged-in users. We plan to make the occurrence data itself a subset of our existing dataset, so that we can keep in our normal Specify/GBIF workflow, but limiting the accuracy of the coordinates for these particular specimens. We are applying lessons learned from an earlier project and planning to pull all other data from GBIF rather than from the local Specify database.

I’m considering three possible approaches:

(A) Have my web server download all the relevant data from GBIF once per week, storing it in a database locally. (We upload the data to GBIF weekly.)
(B) Have my web server query GBIF as the web client demands, temporarily caching the data in a local database. That is, a user visiting our website would get the data from our website, but the website would query GBIF behind the scenes and cache the result.
(C) Have the web browser do cross-origin requests to GBIF to retrieve data directly from GBIF, without using our website as an intermediary.

To select an approach, I need to answer the following questions:

(1) Does GBIF provide a way to programmatically download large data sets without requiring user participation such as visiting a download URL sent via email? If so, and if we make this data a subset of an existing dataset, how might we identify this subset of occurrences?
(2) I’ll have a list of all 25,000-or-so catalog numbers that we need from GBIF. What is an efficient way to regularly download these occurrences, if not by a common field? It looks like an XML query would be extremely bulky, while it’s not clear that the ‘q’ parameter allows disjunctions.
(3) At what point will GBIF throttle my requests? Is there no excessive-request detection? At a maximum of 300 results per request, we’re looking at about 85 requests to download all the data. Would I need to spread those out in time? Could approach (B) become a problem with too many users?
(4) Will I need a special account or API key to accomplish any of the above?
(5) Does GBIF support CORS for cross-origin requests? Browsers normally disallow JavaScript downloaded from one website from issuing requests to another website, making an exception for websites that support the CORS header protocol.
(6) What sort of latency can I expect for occurrence data requests? I don’t want the browser timing out under approach (B), and I’d like to decide whether approach (C) is even a reasonable thing to do in today’s world of impatient web users. (We also plan to provide a degree of public access.)

Thanks for any help you can provide!

~joe

It looks like I should be able to accomplish (A) by issuing a post to /occurrence/download/request and keying off a value I provide for COLLECTION_CODE, presumably given by the attribute collectionCode. The limit is 100,000 records per REST download, which will be plenty. I would do this in the background to avoid latency issues.

Hey Joe,

Why this approach, if you don’t mind my asking? I might be missing some information but it seems a bit counterintuitive to download your “own” data from GBIF?

Something you might want to consider is the difference between getting occurrences synchronously via the search API (/occurrence/search) or asynchronously and authenticated via the download API (/occurrence/download). If you plan to do a weekly batch, then the latter would be the preferable option. However, if you want to do on-demand requests, you’ll need to use the former.

There is no limit on records in the download API, whereas the search API is limited (as you point out) to 100k occurrences. We run the occurrence search web interface on the search API so that might give you an idea of the performance to expect.

Anyway, just a few considerations to get a dialogue going. I’m sure others from the Secretariat, perhaps @mhoefft or @thomasstjerne might be able to provide some advice on creating such as web app.

Best,
Daniel

Thank you for the response, @dnoesgaard.

I might be missing some information but it seems a bit counterintuitive to download your “own” data from GBIF?

That’s a great question, and I also found this advice dubious when I first got started. However, now that I’ve been working with the Specify database for about 6 months, I think this is wise. The problem is that the Specify schema is complex, constantly evolving, and minimally documented. I’m here on a grant that runs dry in October, so they need a solution that doesn’t require constant maintenance: I need a stable API. Specify 7 provides a new REST API, but as far as I can tell it’s mainly a 1-1 mapping with the schema, making it no more stable than the schema. And besides, being new, this API can’t be treated as stable. We need minimal client code having maximal stability. Hence GBIF.

Okay, I missed the bit about the download API requiring authentication. Can we create an account dedicated to our server, so the server isn’t logging in as one of us? And after account creation, can the entire process be automated, or is there an expectation of human intervention (e.g. weekly intervention for weekly downloads)?

That occurrence search web interface seems lickety split! Thanks!

This is only a minor justification, but UT Austin also has dreams of eventually incorporating other people’s data into the website, and that would be done via GBIF. Starting with a GBIF API would reduce that eventual effort.

Oh, I see the collectionCode parameter under /occurrence/search, so I guess non-authenticated requests are an option for us. It looks like I may have a solution!

Thanks so much for your help!

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.