Annual growth per dataset

For the annual reporting of the Dutch node - NLBIF - I need summary stats on how many records were added for each dataset from the Netherlands in 2021. Is there an easy API call to derive that information for e.g. Search dataset_key=15f819bd-6612-4447-854b-14d12ee1022d ?

Hi Niels.

I actually don’t think we log changes in the number of records per dataset, at least not in a systematic way that can be easily retrieved. However, there’s a log for each dataset of every step during ingestion and the number of records is also logged.

You can the ingestion history here (example): GBIF Registry

The API call for a dataset ingestion history would be:

https://api.gbif.org/v1/ingestion/history/15f819bd-6612-4447-854b-14d12ee1022d?limit=<number of events>

In the history for each ingestion event look in pipelineExecutions.steps for numberRecords, e.g.

"pipelineExecutions": [
{
"key": 1812560,
"stepsToRun": [
"VERBATIM_TO_INTERPRETED"
],
"rerunReason": "IUCN_RELEASE",
"created": "2022-01-18T14:10:26.038015",
"createdBy": "nvolik",
"steps": [
{
"key": 4603740,
"type": "VERBATIM_TO_INTERPRETED",
"runner": "DISTRIBUTED",
"started": "2022-01-18T19:13:42.498",
"finished": "2022-01-18T19:37:09.822",
"state": "COMPLETED",
"message": "{\"datasetUuid\":\"15f819bd-6612-4447-854b-14d12ee1022d\",\"attempt\":224,\"interpretTypes\":[\"TAXONOMY\",\"BASIC\"],\"pipelineSteps\":[\"HDFS_VIEW\",\"INTERPRETED_TO_INDEX\",\"VERBATIM_TO_INTERPRETED\"],\"runner\":\"DISTRIBUTED\",\"endpointType\":\"DWC_ARCHIVE\",\"extraPath\":null,\"validationResult\":{\"tripletValid\":false,\"occurrenceIdValid\":true,\"useExtendedRecordId\":null,\"numberOfRecords\":2000000},\"resetPrefix\":\"202201181405\",\"executionId\":1812560,\"routingKey\":\"occurrence.pipelines.verbatim.finished.distributed\"}",
"numberRecords": 4972211,
...

One would then try to find two ingestion events, one year apart and simply compare the values of numberRecords.

Will require a bit of scripting, but I suppose it could be done :slight_smile:

1 Like

Hi Daniel, thanks! That works. I’ll post some lines of R-code later. Niels

1 Like

Great, do share, other people might find it useful too!

@nl-bif, this might not be useful for your this year but we started making available some numbers for datasets over time (including the occurrence counts). However, it only starts in July 2021: Index of /registry/dataset

Something that could be useful now would be to use the metadata associated with the whole GBIF downloads each month.
For example, here is is a download generated the 1st of January 2021 with no filter: https://doi.org/10.15468/dl.djx2hq. You can access the list of the datasets and number of records associated with each of them by using the API (either: https://api.gbif.org/v1/occurrence/download/0147082-200613084148143/datasets or https://api.gbif.org/v1/occurrence/download/0147082-200613084148143/datasets/export if you want to get a file out of it). You can compare it with this download generated the 1st of January 2022: https://doi.org/10.15468/dl.c2ycac (https://api.gbif.org/v1/occurrence/download/0088430-210914110416597/datasets/export). You just have to extract the datasets you need to compare in each file.

3 Likes

Thanks, @mgrosjean, this approach seems more sound than what I had suggested!

Interesting solution! One question, where can I find the whole GBIF downloads for each month?

I don’t know if we have a place where all those are listed. @MattBlissett might know.

In the meanwhile, here is a list I made (we have only been making this type of monthly download since 2018):
monthly_whole_GBIF_downloads_key_doi_date.csv (3.4 KB)

Edit: if you attempt to download all the occurrences on GBIF without any filter from the UI, you should be redirected to the download of the current month.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.