Search, download, analyze and cite (repeat if necessary) - GBIF Data Blog

There is a lot of GBIF-mediated data available. More than 1.3 B occurrence records covering hundreds of thousands of species in all part of the worlds. All free, open and available at the touch of a button. Users can download data through the GBIF.org portal, via the GBIF API, or one of the third-party tools available for programmatic access, e.g. rgbif.


This is a companion discussion topic for the original entry at https://data-blog.gbif.org/post/search-download-analyze-cite/
2 Likes

rgbif has a function gbif_citation() to help users get citations that works with the various data fetching functions. For example:

# works also for occ_data()
x <- occ_search(taxonKey=9206251, limit=2) 
gbif_citation(x)
#> [[1]]
#> <<rgbif citation>>
#>    Citation: naturgucker.de. naturgucker. Occurrence dataset
#>         https://doi.org/10.15468/uc1apo accessed via GBIF.org on 2019-10-04..
#>         Accessed from R via rgbif (https://github.com/ropensci/rgbif) on
#>         2019-10-04
#>    Rights:
#> 
#> [[2]]
#> <<rgbif citation>>
#>    Citation: Vanreusel W, Barendse R, Steeman R, Gielen K, Swinnen K, Desmet P,
#>         Herremans M (2019). Waarnemingen.be - Non-native plant occurrences in
#>         Flanders and the Brussels Capital Region, Belgium. Version 1.10.
#>         Natuurpunt. Occurrence dataset https://doi.org/10.15468/smdvdo accessed
#>         via GBIF.org on 2019-10-04.. Accessed from R via rgbif
#>         (https://github.com/ropensci/rgbif) on 2019-10-04
#>    Rights:

And works for downloads:

d1 <- occ_download_get("0000122-171020152545675")
gbif_citation(d1)
#> $download
#> [1] "GBIF Occurrence Download https://doi.org/10.15468/dl.yghxj7 Accessed from R via rgbif (https://github.com/ropensci/rgbif) on 2017-10-20"
#> 
#> $datasets
#> $datasets[[1]]
#> <<rgbif citation>>
#>    Citation: Grant S, Jones J (2017). Field Museum of Natural History (Zoology)
#>         Invertebrate Collection. Version 18.6. Field Museum. Occurrence Dataset
#>         https://doi.org/10.15468/6q5vuc accessed via GBIF.org on 2017-10-20..
#>         Accessed from R via rgbif (https://github.com/ropensci/rgbif) on
#>         2019-10-04
#>    Rights: To the extent possible under law, the publisher has waived all
#>         rights to these data and has dedicated them to the Public Domain (CC0
#>         1.0). Users may copy, modify, distribute and use the work, including
#>         for commercial purposes, without restriction.
#> 
# .... cutoff

Also works for an occurrence key, or a dataset key

1 Like

Thanks Scott—I should’ve mentioned that in the blog post. It is, however, mentioned on our rgbif page on gbif.org: https://www.gbif.org/tool/81747/rgbif

I wish that everyone would use rgbif—whether they use searches, combined with downloads—or just downloads—either way is great. As long as they #CiteTheDOI :slight_smile:

Slightly off-topic: do you have any idea how many hits the API gets from rgbif?

Thanks for raising this. What should I do if I make a change to the downloaded (and DOI’d) dataset e.g drop a few records due today cleaning. Is the DOI still valid? Thanks

"“Countless variations of a reference to GBIF—often the GBIF portal—with one single thing in common: Not one of them acknowledges the data publishers whose work their papers rely on.”

This makes good sense when you realise the data that users rely on is not, in fact, the data as provided by the data publishers. It’s instead the data as modified by GBIF in processing. The differences between provided data and processed data can be substantial, especially with regard to taxonomic names.

"“It is a little surprising that GBIF asks users to cite their downloaded
data as authored by the provider (See Methods, Data sources), and that ALA
likewise asks (in each download’s citation.csv file) that data be cited as
records from the provider. Clearly this is not the case for processed data.
It would be more correct to say that aggregated data are made available as
the combined work of provider and aggregator, and that the aggregator is
solely responsible for any differences between original and processed data.”
(https://doi.org/10.3897/zookeys.751.24791; 2018 paper)

If you cite GBIF as the source of your data, you are only telling the truth. The alternative would be to say something like: “Occurrence data from GBIF, as modified from these original sources…”

I would say, yes, absolutely. Especially if you describe your exact data cleaning steps. If you end up removing a lot of records before you analysis for other reasons than data quality, you might want to consider re-downloading with refined filters.

Thanks @datafixer for your input.
Users can actually download original, uninterpreted data on GBIF. See the three download formats currently available:


When a user cite a DOI, anyone is able to access the download page where the file format is specified. In other words, one can know if a study is based on raw or interpreted data.
See the example below:

Glad to see the citation function is mentioned on the page.

Only GBIF has access to the logs. maybe @MattBlissett would know - we do include a user agent string for rgbif in each request, and similarly for pygbif, so it can be determined

@mgrosjean Citing the DOI is a good idea. But citing a DOI for a dataset does not clarify whether the data items used in a study are the original data items or the GBIF-interpreted items.

While obviously it’s the responsibility of the data users to explain how the data were managed in their study, the GBIF-recommended citation form (see examples in the above blog post) attributes the data to the original data providers, which is incorrect because of GBIF interpretation, and not to the data publisher, which in all cases is GBIF.

GBIF has not yet come up with a way for users to make this distinction, and continues to confuse the two, as @dnoesgaard does when writing “acknowledges the data publishers whose work their papers rely on”.

It might be helpful to clarify the “terms and conditions” of providing data to GBIF, expanding on what is already stated in the Data Publisher Agreement (https://www.gbif.org/terms/data-publisher). Here is a possible set of points to include:

(1) We reserve the right to modify or delete any of the data items you provide.

(2) In the case of taxonomic names, we reserve the right to modify both the name you provided and its classification.

(3) We will not advise you of modifications or deletions we make in your dataset. It is your responsibility to compare the data items you provide with their interpreted versions.

(4) If you object to modifications or deletions we make, we may consider changing our interpretation, but we will not guarantee that this will be done in a timely manner.

(5) We will also not advise data users of modifications or deletions we make in your dataset. It is their responsibility to compare the data items you provide with their interpreted versions.

(6) Whether or not we modify or delete any of the data items you provide, we will recommend that data users attribute the data to you as provider. You are therefore responsible for both the original and the interpreted data items in the dataset.

(7) In our Data Publisher and Data User terms and conditions, we will continue to use the term “Data Publisher” to refer to you as data provider, although we are, in fact and in law, the publisher of the data.

(8) In our Data Publisher and Data User terms and conditions, we will continue to say

“GBIF disclaims responsibility for the accuracy and reliability of the data as well as for the suitability of its application for any particular purpose”

without dividing the responsibility for accuracy and reliability between you as provider and GBIF as interpreter of data items.

I had a quick peek at the logs and honestly surprised at the volume of entries with the rOpenSci user agent (average > 70,000/day). I hope to be able to analyze and provide stats more consistently soon…

Nice to hear so many requests come from rgbif!