Dataset keys not documented?

In response to a previous question, I learned that the canonical iNaturalist dataset should be referenced by the datasetKey “50c9509d-22c7-4a22-a47d-8c48425ef4a7”:

I think this corresponds to this dataset:

But the datasetKey isn’t actually included on the page the doi for this dataset points to. In the process of copying the url into this question, I can see now that the key is present in the url itself! But that’s not the most intuitive ‘documentation’.

I wonder if it would be helpful to include the datasetKey in the “Data Description” or “GBIF Registration” sections of the dataset description page?

I know how to find this stuff now, but it took me an hour of poking around before I finally succeeded.

Hi Tyler,

I think I was the one who answered your previous question about selecting iNaturalist records in GBIF. It is correct that the iNaturalist Research-Grade Observations dataset is identified in the GBIF registry by the key “50c9509d-22c7-4a22-a47d-8c48425ef4a7”. However, this is not the identifier people should use when referencing the dataset.

As with all GBIF-mediated dataset, we assign persistent DOIs that should be used when referencing a dataset. In the case of iNaturalist, that would be https://doi.org/10.15468/ab3s5x.

I don’t think most web users need to worry about datasetKeys, but perhaps you have a specific use-case that I’m overlooking?

/Daniel

1 Like

Thanks Daniel,

Yes, I’m thinking specifically of people accessing records through the API. I’m writing a tutorial, and in the process I’ve found myself advising my students:

Which leaves us with the unwieldy datasetKey == "50c9509d-22c7-4a22-a47d-8c48425ef4a7" as the most reliable way to get the
offical iNaturalist dataset.

I’ve also included an explanation of why datasetName doesn’t work. It would be nice to be able to direct them to a canonical source for these keys, but the available methods are all very circuitous.

If the aim is to obtain records from a given dataset, I would suggest the following steps as a general procedure:

"key": "50c9509d-22c7-4a22-a47d-8c48425ef4a7"

There are of course easier ways to do this in R or Python by leveraging existing libraries such as rgbif and pygbif.

@dnoesgaard, just out of curiosity, why did GBIF choose version 4 UUIDs as identifiers for dataset and other keys?

1 Like

Ok. That’s more or less what I had done. My first draft of the tutorial is here:

https://plantarum.ca/2024/04/04/record-cleaning/

The section on the iNaturalist record set is here:

https://plantarum.ca/2024/04/04/record-cleaning/#inaturalist-and-human-observations

I’d be happy for any feedback!

  • ty

@plantarum note that you can do a lot more filtering at the query stage. It helps to have as much as the filtering done before the download. My colleague John wrote a very good blogpost explaining what you can filter and do to process data downloaded from GBIF: Common things to look out for when post-processing GBIF downloads - GBIF Data Blog

1 Like

Small notes, you probably don’t want to encourage students to put their passwords in scripts in plain text, but rather use something like askpass or even better, keyring

I also feel it would be valuable to add a note that many species distribution modelling algorithms do in fact use absence data as an input.

Good point!

True. But the more I look at ABSENT records in GBIF, the more I think they aren’t appropriate for use in broad scale distribution modeling - at least not without careful inspection of each dataset.

As an example, the dataset I’m working on now includes a number of records from the Mohonk Forest Health Monitoring Data. For the invasive grass Microstegium vimineum there are 11 presences and 28 absences, all from within a 16km stretch, and with presence and absence records within 100 m of each other.

Using one of those ABSENCES as evidence that climatic conditions in that 1km^2 climate grid were unsuitable for M. vimineum would be an error. Of course, if we’re working at a local scale and have habitat data or very fine scale environmental variables, the PRESENCE/ABSENCE records would be invaluable, but that’s not what I do.

In any case, I’ll try to clarify the text.

Thanks for your interest!

Just a heads up regarding this and the new version of the rgbif package (see changelog).

I used to get the datasetKey for iNaturalist, using this code:

library(rgbif)
library(tidyverse)

### GBIF key for iNaturalist's dataset
iNat_KEY <- datasets(data='all', query='iNaturalist')$data %>% 
  filter(title == 'iNaturalist Research-grade Observations') %>% 
  pull(key)

# character(0)

Now the function datasets() is depracated, and actually, if I run that code, I cannot find the dataset for iNat listed in my query anymore :confused:

The equivalent will be to do this:

library(rgbif)
library(tidyverse)

### GBIF key for iNaturalist's dataset
iNat_KEY <- dataset_search(query='iNaturalist')$data %>% 
    filter(datasetTitle == 'iNaturalist Research-grade Observations') %>% 
    pull(datasetKey)

# "50c9509d-22c7-4a22-a47d-8c48425ef4a7"

Best,
Flo

@flograttarola you shouldn’t need any code at all to get the iNaturalist datasetKey, since that information is VERY unlikely to change. I would simply hard code

“50c9509d-22c7-4a22-a47d-8c48425ef4a7” # iNat

into whatever you are running

Thanks, but I generally use more than just the iNat dataset, that’s why I don’t hard code it. In this way I also don’t have to remember all the different not intuitive uuids.

Best,
Flo