Readers may recall that several years ago, GBIF mandated the use of a few CC licenses (or a CC0 waiver). I am wondering if we might now take the same approach with occurrenceID, especially as we are getting serious about digital extended specimens that will evidently require an identifier of this nature to safeguard identity in the face of all the links that will accrue.
Although meant to be globally unique, occurrenceID across all of GBIF is anything but. Some occurrence records lack them altogether, some are numeric auto-increments, some are merely copied/pasted from catalogNumber, some are http URIs, some are https URIs, some are UUIDs. Data publishers frequently change them for various reasons that often have little to do with the identity of the occurrence record but more to do with the administration of the dataset as a whole.
Could we / should we require that occurrenceID be populated and structured in a particular way now to help pave the way? And could we / should we enforce its handling as a persistent identifier, inclusive of its use in GBIF’s own occurrence URLs as a sign of faith, its occurrence APIs, use in BioSchemas & elsewhere? Or, do we wait until digital extended specimens are operational & later decide if their identifiers are the same as what we’d expect to use in occurrenceID?
I tend to think of materialSampleID as THE persistent identifier for THE specimen and occurrenceID simply as an identifier for the “simple Darwin Core record” - and would thus rather suggest a focus on strengthening the persistence of a materialSampleID (with regards to specimens).
You may well be right, but the definition for occurrenceID states:
An identifier for the Occurrence (as opposed to a particular digital record of the occurrence)
…which strongly suggests that the identifier is meant for the object & not its digital representation. However, an Occurrence may not have a physical manifestation, acting merely as a pointer to the “existence” of an organism. And, that existence may have no evidence whatsoever except the digital record itself. That definition of occurrenceID then is a logical knot.
The example provided for materialSample states:
A whole organism preserved in a collection.
…and so materialSampleID is obviously the identifier we want for whole physical specimens preserved in a collection. But, how many publishers use it in this way compared to those that populate occurrenceID? For those that publish data via the Integrated Publishing Toolkit, it’s not a required field as is occurrenceID. To top it off basisOfRecord may only be expressed as a single value, not both PreservedSpecimen and MaterialSample.
(…) In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique.
So even if the occurrenceID identifies THE species occurrence (and not the digital record) there can (and evidently is!) many occurrenceID identifiers for the same Occurrence.
However, I think we anyway often want to identify the actual collection specimen and not only the occurrence that was the origin of the specimen. A species occurrence is an event at a point in time, and a specimen is a physical (durable) thing. [Besides, one single species occurrence can sometimes be the origin for multiple collection specimens].
So I thus tend to think that we should rather focus our efforts on a persistent materialSampleID for collection specimens instead of keep fighting (an unwinnable battle) with achieving persistence for the occurrenceIDs (which does not really identify the actual specimen).
However materialSampleID
An identifier for the MaterialSample (as opposed to a particular digital record of the material sample). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the materialSampleID globally unique.
Thanks @DagEndresen, this is very helpful and better explains the relationship between an Occurrence (and its occurrenceID) and a MaterialSample (and its materialSampleID). Unfortunate that the nuance/distinction is not better appreciated by data publishers that share specimen-based data to any aggregator including GBIF.
I guess the first thing we should investigate is the coverage of materialSampleID across the GBIF network and then investigate what specimen records are likely to be merely MaterialSamples but are expressed as broader, more abstract Occurrences. It’s telling that the advanced search on GBIF does not include materialSampleID as a filtering option on searches for occurrences.
And finally, we need to come to grips with a digital extended specimen and what will be the relationship between its identifier and materialSampleID. Based on the arguments you provided, it would be a mistake to use occurrenceID as the identifier for the specimen.
Maybe what we rather do need (??) is a new term extendedSpecimenID and/or digitalSpecimenID which MUST be a more strictly defined persistent identifier, e.g. a DOI with a specific minimum metadata profile defined!
Hi @DagEndresen yes I think we need a new term, something like digitalExtendedSpecimenID.
Is it bad that part of me wants to scream? It’s hard to do any linking when we make new identifiers. But, if these are to be DOIs, then I suppose we’ll have no choice. Regardless, @DagEndresen is spot on to emphasize that we be strict and precise with what the identifier circumscribes and what is the minimum metadata. I suggest we also offer guidance on what to do with materialSampleID. i.e. would it make for a good suffix?
It’s still a discussion in progress. There was consensus that the physicalSpecimenId is “whatever the institution uses to uniquely identify the item within that institute” and that the ways DwC has been used for such identities is very messy. We’ll probably have to live with multiple current practices but might be able to develop better guidance going forward.