Dataset keys not documented?

plantarum · April 5, 2024, 9:36pm

In response to a previous question, I learned that the canonical iNaturalist dataset should be referenced by the datasetKey “50c9509d-22c7-4a22-a47d-8c48425ef4a7”:

I think this corresponds to this dataset:

But the datasetKey isn’t actually included on the page the doi for this dataset points to. In the process of copying the url into this question, I can see now that the key is present in the url itself! But that’s not the most intuitive ‘documentation’.

I wonder if it would be helpful to include the datasetKey in the “Data Description” or “GBIF Registration” sections of the dataset description page?

I know how to find this stuff now, but it took me an hour of poking around before I finally succeeded.

dnoesgaard · April 5, 2024, 9:59pm

Hi Tyler,

I think I was the one who answered your previous question about selecting iNaturalist records in GBIF. It is correct that the iNaturalist Research-Grade Observations dataset is identified in the GBIF registry by the key “50c9509d-22c7-4a22-a47d-8c48425ef4a7”. However, this is not the identifier people should use when referencing the dataset.

As with all GBIF-mediated dataset, we assign persistent DOIs that should be used when referencing a dataset. In the case of iNaturalist, that would be https://doi.org/10.15468/ab3s5x.

I don’t think most web users need to worry about datasetKeys, but perhaps you have a specific use-case that I’m overlooking?

/Daniel

plantarum · April 5, 2024, 10:07pm

Thanks Daniel,

Yes, I’m thinking specifically of people accessing records through the API. I’m writing a tutorial, and in the process I’ve found myself advising my students:

Which leaves us with the unwieldy datasetKey == "50c9509d-22c7-4a22-a47d-8c48425ef4a7" as the most reliable way to get the
offical iNaturalist dataset.

I’ve also included an explanation of why datasetName doesn’t work. It would be nice to be able to direct them to a canonical source for these keys, but the available methods are all very circuitous.

dnoesgaard · April 5, 2024, 10:35pm

If the aim is to obtain records from a given dataset, I would suggest the following steps as a general procedure:

search for the dataset in the registry API (Registry API :: Technical Documentation), e.g. https://api.gbif.org/v1/dataset/search?q=inaturalist
obtain the relevant dataset key from the response, i.e.

"key": "50c9509d-22c7-4a22-a47d-8c48425ef4a7"

query the occurrence API (Occurrence API :: Technical Documentation) using above key, e.g. https://api.gbif.org/v1/occurrence/search?datasetKey=50c9509d-22c7-4a22-a47d-8c48425ef4a7
you can of course add a multitude of other query parameters for taxa, geography, etc.

There are of course easier ways to do this in R or Python by leveraging existing libraries such as rgbif and pygbif.

datafixer · April 6, 2024, 7:53am

@dnoesgaard, just out of curiosity, why did GBIF choose version 4 UUIDs as identifiers for dataset and other keys?

plantarum · April 8, 2024, 1:55pm

Ok. That’s more or less what I had done. My first draft of the tutorial is here:

https://plantarum.ca/2024/04/04/record-cleaning/

The section on the iNaturalist record set is here:

https://plantarum.ca/2024/04/04/record-cleaning/#inaturalist-and-human-observations

I’d be happy for any feedback!

ty

mgrosjean · April 9, 2024, 7:06am

@plantarum note that you can do a lot more filtering at the query stage. It helps to have as much as the filtering done before the download. My colleague John wrote a very good blogpost explaining what you can filter and do to process data downloaded from GBIF: Common things to look out for when post-processing GBIF downloads - GBIF Data Blog

pieter · April 10, 2024, 11:56am

Small notes, you probably don’t want to encourage students to put their passwords in scripts in plain text, but rather use something like askpass or even better, keyring

I also feel it would be valuable to add a note that many species distribution modelling algorithms do in fact use absence data as an input.

plantarum · April 10, 2024, 9:19pm

Good point!

True. But the more I look at ABSENT records in GBIF, the more I think they aren’t appropriate for use in broad scale distribution modeling - at least not without careful inspection of each dataset.

As an example, the dataset I’m working on now includes a number of records from the Mohonk Forest Health Monitoring Data. For the invasive grass Microstegium vimineum there are 11 presences and 28 absences, all from within a 16km stretch, and with presence and absence records within 100 m of each other.

Using one of those ABSENCES as evidence that climatic conditions in that 1km^2 climate grid were unsuitable for M. vimineum would be an error. Of course, if we’re working at a local scale and have habitat data or very fine scale environmental variables, the PRESENCE/ABSENCE records would be invaluable, but that’s not what I do.

In any case, I’ll try to clarify the text.

Thanks for your interest!

flograttarola · April 12, 2024, 1:47pm

Just a heads up regarding this and the new version of the rgbif package (see changelog).

I used to get the datasetKey for iNaturalist, using this code:

library(rgbif)
library(tidyverse)

### GBIF key for iNaturalist's dataset
iNat_KEY <- datasets(data='all', query='iNaturalist')$data %>% 
  filter(title == 'iNaturalist Research-grade Observations') %>% 
  pull(key)

# character(0)

Now the function datasets() is depracated, and actually, if I run that code, I cannot find the dataset for iNat listed in my query anymore

The equivalent will be to do this:

library(rgbif)
library(tidyverse)

### GBIF key for iNaturalist's dataset
iNat_KEY <- dataset_search(query='iNaturalist')$data %>% 
    filter(datasetTitle == 'iNaturalist Research-grade Observations') %>% 
    pull(datasetKey)

# "50c9509d-22c7-4a22-a47d-8c48425ef4a7"

Best,
Flo

jwaller · April 15, 2024, 7:47am

@flograttarola you shouldn’t need any code at all to get the iNaturalist datasetKey, since that information is VERY unlikely to change. I would simply hard code

“50c9509d-22c7-4a22-a47d-8c48425ef4a7” # iNat

into whatever you are running

flograttarola · April 15, 2024, 9:04am

Thanks, but I generally use more than just the iNat dataset, that’s why I don’t hard code it. In this way I also don’t have to remember all the different not intuitive uuids.

Best,
Flo

Topic		Replies	Views
How is iNaturalist data identified?	2	304	January 12, 2024
iNaturalist database not up-to-date?	4	572	February 24, 2022
Identifying authors of iNaturalist observations within GBIF download data	10	390	February 24, 2024
Use Case: iNaturalist Observations Diversifying the GBIF data model	5	845	January 5, 2023
Another expert review of iNaturalist IDs in GBIF Data Use	1	759	May 2, 2023

Dataset keys not documented?

Related topics