Forcing the issue on occurrenceID

dshorthouse · March 16, 2021, 7:29pm

Readers may recall that several years ago, GBIF mandated the use of a few CC licenses (or a CC0 waiver). I am wondering if we might now take the same approach with occurrenceID, especially as we are getting serious about digital extended specimens that will evidently require an identifier of this nature to safeguard identity in the face of all the links that will accrue.

Although meant to be globally unique, occurrenceID across all of GBIF is anything but. Some occurrence records lack them altogether, some are numeric auto-increments, some are merely copied/pasted from catalogNumber, some are http URIs, some are https URIs, some are UUIDs. Data publishers frequently change them for various reasons that often have little to do with the identity of the occurrence record but more to do with the administration of the dataset as a whole.

Could we / should we require that occurrenceID be populated and structured in a particular way now to help pave the way? And could we / should we enforce its handling as a persistent identifier, inclusive of its use in GBIF’s own occurrence URLs as a sign of faith, its occurrence APIs, use in BioSchemas & elsewhere? Or, do we wait until digital extended specimens are operational & later decide if their identifiers are the same as what we’d expect to use in occurrenceID?

DagEndresen · March 17, 2021, 6:50am

I tend to think of materialSampleID as THE persistent identifier for THE specimen and occurrenceID simply as an identifier for the “simple Darwin Core record” - and would thus rather suggest a focus on strengthening the persistence of a materialSampleID (with regards to specimens).

DagEndresen · March 17, 2021, 12:56pm

See also GBIF data-clustering feature on how the same collection specimen can have multiple legitimate occurrence records.

dshorthouse · March 17, 2021, 8:53pm

You may well be right, but the definition for occurrenceID states:

An identifier for the Occurrence (as opposed to a particular digital record of the occurrence)

…which strongly suggests that the identifier is meant for the object & not its digital representation. However, an Occurrence may not have a physical manifestation, acting merely as a pointer to the “existence” of an organism. And, that existence may have no evidence whatsoever except the digital record itself. That definition of occurrenceID then is a logical knot.

The example provided for materialSample states:

A whole organism preserved in a collection.

…and so materialSampleID is obviously the identifier we want for whole physical specimens preserved in a collection. But, how many publishers use it in this way compared to those that populate occurrenceID? For those that publish data via the Integrated Publishing Toolkit, it’s not a required field as is occurrenceID. To top it off basisOfRecord may only be expressed as a single value, not both PreservedSpecimen and MaterialSample.

DagEndresen · March 18, 2021, 9:53am

… and then the definition continues

(…) In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique.

So even if the occurrenceID identifies THE species occurrence (and not the digital record) there can (and evidently is!) many occurrenceID identifiers for the same Occurrence.

However, I think we anyway often want to identify the actual collection specimen and not only the occurrence that was the origin of the specimen. A species occurrence is an event at a point in time, and a specimen is a physical (durable) thing. [Besides, one single species occurrence can sometimes be the origin for multiple collection specimens].

So I thus tend to think that we should rather focus our efforts on a persistent materialSampleID for collection specimens instead of keep fighting (an unwinnable battle) with achieving persistence for the occurrenceIDs (which does not really identify the actual specimen).

However materialSampleID

An identifier for the MaterialSample (as opposed to a particular digital record of the material sample). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the materialSampleID globally unique.

dshorthouse · March 18, 2021, 1:53pm

Thanks @DagEndresen, this is very helpful and better explains the relationship between an Occurrence (and its occurrenceID) and a MaterialSample (and its materialSampleID). Unfortunate that the nuance/distinction is not better appreciated by data publishers that share specimen-based data to any aggregator including GBIF.

I guess the first thing we should investigate is the coverage of materialSampleID across the GBIF network and then investigate what specimen records are likely to be merely MaterialSamples but are expressed as broader, more abstract Occurrences. It’s telling that the advanced search on GBIF does not include materialSampleID as a filtering option on searches for occurrences.

And finally, we need to come to grips with a digital extended specimen and what will be the relationship between its identifier and materialSampleID. Based on the arguments you provided, it would be a mistake to use occurrenceID as the identifier for the specimen.

cc @hardistyar @waddink @abentley

DagEndresen · March 18, 2021, 2:30pm

Maybe what we rather do need (??) is a new term extendedSpecimenID and/or digitalSpecimenID which MUST be a more strictly defined persistent identifier, e.g. a DOI with a specific minimum metadata profile defined!

waddink · March 19, 2021, 11:54am

Hi @DagEndresen yes I think we need a new term, something like digitalExtendedSpecimenID.

dshorthouse · March 19, 2021, 12:53pm

Hi @DagEndresen yes I think we need a new term, something like digitalExtendedSpecimenID.

Is it bad that part of me wants to scream? It’s hard to do any linking when we make new identifiers. But, if these are to be DOIs, then I suppose we’ll have no choice. Regardless, @DagEndresen is spot on to emphasize that we be strict and precise with what the identifier circumscribes and what is the minimum metadata. I suggest we also offer guidance on what to do with materialSampleID. i.e. would it make for a good suffix?

hardistyar · March 19, 2021, 12:54pm

It will be ‘id’ as in this example: openDS-schemas/basic-structure.md at main · hardistyar/openDS-schemas · GitHub and will be a DOI in DiSSCo.

We will discuss physicalSpecimenId at the next TDWG TG MIDS meeting, April 1st and it would be good to have the materialSampleId input there please.

The DS, identified by ‘id’ will maintain a link to the physical specimen through physicalSpecimenId and institutionCode.

pieter · April 14, 2021, 12:03pm

Did the MIDS meeting bring any clarity on this topic?

hardistyar · April 15, 2021, 12:56pm

It’s still a discussion in progress. There was consensus that the physicalSpecimenId is “whatever the institution uses to uniquely identify the item within that institute” and that the ways DwC has been used for such identities is very messy. We’ll probably have to live with multiple current practices but might be able to develop better guidance going forward.

system · May 15, 2021, 10:57pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Webinar 2: Where are occurrences? (David Shorthouse) Diversifying the GBIF data model	0	575	July 16, 2022
GBIF attempts to improve identifier stability by monitoring changes of occurrenceIDs - GBIF Data Blog Data blog	6	5200	November 9, 2023
OccurrenceID stability (GBIF technical support hour for Nodes) Data Publishing NodesSupportHour	5	865	November 21, 2023
Webinar 2: OpenDS Digital Specimen as Digital Entity? (Wouter Addink) Diversifying the GBIF data model	0	498	July 17, 2022
Adding GBIF identifiers to NCBI BioSample data during NCBI data upload Data Use	11	2117	April 23, 2024

Forcing the issue on occurrenceID

Related topics