Feedback on manuscript on R/Python/Ruby client libraries for GBIF

We wrote a preprint a few years back https://peerj.com/preprints/3304/ on the client libraries for GBIF for R, Python, and Ruby.

I’m preparing a manuscript to submit to Methods in Ecology and Evolution - draft at https://github.com/sckott/gbifms/blob/gh-pages/manuscript_mee.pdf

I’m curious to hear thoughts on sections that could be added - or ideas/etc that are important in the context of the paper topic, but have been left out.

I added a section Citing GBIF Data covering how to get citations for GBIF data. Before I submit the manuscript I’m planning to add some methods to pygbif for citations https://github.com/sckott/pygbif/issues/60 so that both R and Python clients have citation helpers

3 Likes

Hi Scott,

One thing that comes to mind in regards to citing data is the Data user agreement. I know that probably no one reads it or even knows of its existence, but when GBIF data publishers agree to share their data openly with GBIF, they do so knowing that users are required to properly acknowledge them when data is used.

So while there are many arguments for citing data, an important one is simply: it’s the rules :wink:

Reading your paper make me think more about the 200k cap on occurrences through the search API. I wonder what the consequences would be of lowering this significantly—say to 20k. On one hand, a move like that wouldn’t exactly be making it easier to access data, but perhaps it would help nudge people towards using the download API instead?

I would argue that as many as 90% of papers that rely on the search API functions of rgbif don’t live up to the requirements of the data user agreement—that is, they don’t acknowledge data publishers at all :frowning:

1 Like

Thanks @dnoesgaard - I will add discussion of data user agreement, and i think the gbif_citation fxn prints that out, but if it doesn’t i’ll fix that. Hopefully this paper will push more papers to cite data and mind the data user agreement.

I think we did discuss this before, if the 200K max was lower in the search API if it could push users towards the download API. Seems reasonable. I imagine the GBIF website search interface does not want that limit, but there are easy ways to only allow GBIF itself to skip the max limit.

Very interesting initiative. I don’t think I mentionned before that we are using Rgbif package through spocc R package for our Galaxy dedicated tool https://ecology.usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/ecology/spocc_occ/spocc_occ/0.9.0 to import GBIF occurence data. We worked on condaification of spocc https://anaconda.org/bioconda/r-spocc who gives us better reproducibility in installation than “just” R package. As we can add bibtex citations at the bottom of Galaxy tool (so the user can extract full citations list of a complete analytic workflow), it appears to me of interest to cite this manuscript when published on the tool. Don’t hesitate to comment , ask for more informations.
It appears to me that there is several ““issues”” using this spocc R package as 1/ there is a limitation in terms of number of occurences we can download (but this is 10 000 if I am not wrong) and 2/ reproducing the same tool execution, with same parameters give different results over times (notably columns are rearranged), once again if I am not wrong…

@ylebras Thanks for your comments!

Are you aware that the GBIF option for spocc uses the GBIF search API (i.e., rgbif::occ_data()) - so the max occurrence results should be 200,000. I haven’t integrated the GBIF download API (i.e., rgbif::occ_download()) because 1) all the other data sources return data immediately, and 2) it’s not a trivial task to figure out how to make an easy to use interface when you have to kick off a download, then wait some indeterminate amount of time, then dowload the data. If you want to work with the download API, using rgbif woud be easier.

There should be different results through time as the data in GBIF’s database changes through time as data is added/edited. If there are columns rearranged, that could be due to the data returned from GBIF being slightly changed in order (or some other reason, not sure right now), but if you open an issue for this https://github.com/ropensci/spocc/issues/ I can help in more detail.

great! glad to hear it