Using rgbif to explore dataset usage

A user recently opened an issue in the rgbif repo about the occ_download_dataset_activity function https://github.com/ropensci/rgbif/issues/382 - The details are not important here, but working on the fix made me think mentioning the function here could be of interest for data providers interested in programmatically accessing activity on their datasets.

The rgbif::occ_download_dataset_activity() uses the GBIF API route /occurrence/download/dataset/{dataset-ID}, and gives you the metadata for downloads of the dataset.

Using the dataset _ Fundación Carl Faust: Herbario del Jardí Botànic Marimurtra: HMIM_ at https://www.gbif.org/dataset/7f2edc10-f762-11e1-a439-00145eb45e9a/activity

# install the github version for the fixes to the function
remotes::install_github("ropensci/rgbif")
dataset_id <- "7f2edc10-f762-11e1-a439-00145eb45e9a"
res <- occ_download_dataset_activity(dataset_id)

The meta slot is a data.frame with the pagination settings (by default the function returns 20 results and has an offset of 0), whether its the last of the records, and how many records were found for the dataset.

res$meta
#>   offset limit endofrecords count
#>        0    20        FALSE 20605

The results slot has a data.frame with the metadata for each request to the dataset (here looking at first 6 rows)

head(res$results)
#> # A tibble: 6 x 23
#>   downloadKey datasetKey datasetTitle datasetDOI datasetCitation numberRecords download.key download.doi
#>   <chr>       <chr>      <chr>        <chr>      <chr>                   <int> <chr>        <chr>
#> 1 0036796-19… 7f2edc10-… Fundación C… 10.15468/… Fundación Carl…             2 0036796-190… 10.15468/dl…
#> 2 0036645-19… 7f2edc10-… Fundación C… 10.15468/… Fundación Carl…             3 0036645-190… 10.15468/dl…
#> 3 0036604-19… 7f2edc10-… Fundación C… 10.15468/… Fundación Carl…             2 0036604-190… 10.15468/dl…
#> 4 0036549-19… 7f2edc10-… Fundación C… 10.15468/… Fundación Carl…             1 0036549-190… 10.15468/dl…
#> 5 0036280-19… 7f2edc10-… Fundación C… 10.15468/… Fundación Carl…          4277 0036280-190… 10.15468/dl…
#> 6 0036095-19… 7f2edc10-… Fundación C… 10.15468/… Fundación Carl…             4 0036095-190… 10.15468/dl…
#> # … with 15 more variables: download.license <chr>, download.created <chr>, download.modified <chr>,
#> #   download.eraseAfter <chr>, download.status <chr>, download.downloadLink <chr>, download.size <int>,
#> #   download.totalRecords <int>, download.numberDatasets <int>, download.request.sendNotification <lgl>,
#> #   download.request.format <chr>, download.request.predicate.type <chr>, download.request.predicate.key <chr>,
#> #   download.request.predicate.value <chr>, download.request.predicate.predicates <list>

Some interesting bits in the data.frame may be:

library(dplyr)
head(select(res$results, numberRecords, download.totalRecords, download.numberDatasets))
#> # A tibble: 6 x 3
#>   numberRecords download.totalRecords download.numberDatasets
#>           <int>                 <int>                   <int>
#> 1             2                  9317                     155
#> 2             3                 28285                     400
#> 3             2                 24075                     409
#> 4             1                  8174                     182
#> 5          4277               3264922                   26207
#> 6             4                701407                    1258

You can also get the actual search requests, called predicates, e.g., looking at the 2nd request, we can see that the person searched for records with a basis of record of PRESERVED_SPECIMEN, and with a taxon key of 5375388 (Capsella bursa-pastoris)

res$results$download.request.predicate.predicates[[2]]
#>     type             key              value
#> 1 equals BASIS_OF_RECORD PRESERVED_SPECIMEN
#> 2 equals       TAXON_KEY            5375388

Hopefully this will make it easier for GBIF data providers to track their dataset usage.

2 Likes