A user recently opened an issue in the rgbif
repo about the occ_download_dataset_activity
function https://github.com/ropensci/rgbif/issues/382 - The details are not important here, but working on the fix made me think mentioning the function here could be of interest for data providers interested in programmatically accessing activity on their datasets.
The rgbif::occ_download_dataset_activity()
uses the GBIF API route /occurrence/download/dataset/{dataset-ID}
, and gives you the metadata for downloads of the dataset.
Using the dataset _ Fundación Carl Faust: Herbario del Jardí Botànic Marimurtra: HMIM_ at https://www.gbif.org/dataset/7f2edc10-f762-11e1-a439-00145eb45e9a/activity
# install the github version for the fixes to the function
remotes::install_github("ropensci/rgbif")
dataset_id <- "7f2edc10-f762-11e1-a439-00145eb45e9a"
res <- occ_download_dataset_activity(dataset_id)
The meta
slot is a data.frame with the pagination settings (by default the function returns 20 results and has an offset of 0
), whether its the last of the records, and how many records were found for the dataset.
res$meta
#> offset limit endofrecords count
#> 0 20 FALSE 20605
The results
slot has a data.frame with the metadata for each request to the dataset (here looking at first 6 rows)
head(res$results)
#> # A tibble: 6 x 23
#> downloadKey datasetKey datasetTitle datasetDOI datasetCitation numberRecords download.key download.doi
#> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 0036796-19… 7f2edc10-… Fundación C… 10.15468/… Fundación Carl… 2 0036796-190… 10.15468/dl…
#> 2 0036645-19… 7f2edc10-… Fundación C… 10.15468/… Fundación Carl… 3 0036645-190… 10.15468/dl…
#> 3 0036604-19… 7f2edc10-… Fundación C… 10.15468/… Fundación Carl… 2 0036604-190… 10.15468/dl…
#> 4 0036549-19… 7f2edc10-… Fundación C… 10.15468/… Fundación Carl… 1 0036549-190… 10.15468/dl…
#> 5 0036280-19… 7f2edc10-… Fundación C… 10.15468/… Fundación Carl… 4277 0036280-190… 10.15468/dl…
#> 6 0036095-19… 7f2edc10-… Fundación C… 10.15468/… Fundación Carl… 4 0036095-190… 10.15468/dl…
#> # … with 15 more variables: download.license <chr>, download.created <chr>, download.modified <chr>,
#> # download.eraseAfter <chr>, download.status <chr>, download.downloadLink <chr>, download.size <int>,
#> # download.totalRecords <int>, download.numberDatasets <int>, download.request.sendNotification <lgl>,
#> # download.request.format <chr>, download.request.predicate.type <chr>, download.request.predicate.key <chr>,
#> # download.request.predicate.value <chr>, download.request.predicate.predicates <list>
Some interesting bits in the data.frame may be:
library(dplyr)
head(select(res$results, numberRecords, download.totalRecords, download.numberDatasets))
#> # A tibble: 6 x 3
#> numberRecords download.totalRecords download.numberDatasets
#> <int> <int> <int>
#> 1 2 9317 155
#> 2 3 28285 400
#> 3 2 24075 409
#> 4 1 8174 182
#> 5 4277 3264922 26207
#> 6 4 701407 1258
You can also get the actual search requests, called predicates, e.g., looking at the 2nd request, we can see that the person searched for records with a basis of record of PRESERVED_SPECIMEN
, and with a taxon key of 5375388 (Capsella bursa-pastoris)
res$results$download.request.predicate.predicates[[2]]
#> type key value
#> 1 equals BASIS_OF_RECORD PRESERVED_SPECIMEN
#> 2 equals TAXON_KEY 5375388
Hopefully this will make it easier for GBIF data providers to track their dataset usage.