In response to a previous question, I learned that the canonical iNaturalist dataset should be referenced by the datasetKey “50c9509d-22c7-4a22-a47d-8c48425ef4a7”:
I think this corresponds to this dataset:
But the datasetKey isn’t actually included on the page the doi for this dataset points to. In the process of copying the url into this question, I can see now that the key is present in the url itself! But that’s not the most intuitive ‘documentation’.
I wonder if it would be helpful to include the datasetKey in the “Data Description” or “GBIF Registration” sections of the dataset description page?
I know how to find this stuff now, but it took me an hour of poking around before I finally succeeded.
I think I was the one who answered your previous question about selecting iNaturalist records in GBIF. It is correct that the iNaturalist Research-Grade Observations dataset is identified in the GBIF registry by the key “50c9509d-22c7-4a22-a47d-8c48425ef4a7”. However, this is not the identifier people should use when referencing the dataset.
As with all GBIF-mediated dataset, we assign persistent DOIs that should be used when referencing a dataset. In the case of iNaturalist, that would be https://doi.org/10.15468/ab3s5x.
I don’t think most web users need to worry about datasetKeys, but perhaps you have a specific use-case that I’m overlooking?
Yes, I’m thinking specifically of people accessing records through the API. I’m writing a tutorial, and in the process I’ve found myself advising my students:
Which leaves us with the unwieldy datasetKey == "50c9509d-22c7-4a22-a47d-8c48425ef4a7" as the most reliable way to get the
offical iNaturalist dataset.
I’ve also included an explanation of why datasetName doesn’t work. It would be nice to be able to direct them to a canonical source for these keys, but the available methods are all very circuitous.
Small notes, you probably don’t want to encourage students to put their passwords in scripts in plain text, but rather use something like askpass or even better, keyring
I also feel it would be valuable to add a note that many species distribution modelling algorithms do in fact use absence data as an input.
True. But the more I look at ABSENT records in GBIF, the more I think they aren’t appropriate for use in broad scale distribution modeling - at least not without careful inspection of each dataset.
As an example, the dataset I’m working on now includes a number of records from the Mohonk Forest Health Monitoring Data. For the invasive grass Microstegium vimineum there are 11 presences and 28 absences, all from within a 16km stretch, and with presence and absence records within 100 m of each other.
Using one of those ABSENCES as evidence that climatic conditions in that 1km^2 climate grid were unsuitable for M. vimineum would be an error. Of course, if we’re working at a local scale and have habitat data or very fine scale environmental variables, the PRESENCE/ABSENCE records would be invaluable, but that’s not what I do.
@flograttarola you shouldn’t need any code at all to get the iNaturalist datasetKey, since that information is VERY unlikely to change. I would simply hard code
Thanks, but I generally use more than just the iNat dataset, that’s why I don’t hard code it. In this way I also don’t have to remember all the different not intuitive uuids.