7. Persistent identifier (PID) schemes

Moderators: Wouter Addink, Alex Hardisty and Hong Cui

Background

Persistent identifiers (PID) are long-lasting references that can be used to unambiguously identify any kind of object. They are foundational elements of data infrastructure, not only as identifiers but also as connectors of one thing to another. PIDs can uniquely identify physical objects to digital artefacts to records of transactions to the identification of specific vocabulary terms and concepts. Different kinds of PID scheme can be used for identifying different things- DOIs for documents and datasets, for example; ORCiD for persons, ROR for organisations. With a PID scheme we mean in this consultation not only the technical elements but the whole arrangement around PIDs for using and operating them. This includes the ownership, authority, governance and financial elements. A scheme that we aim to discuss in particular is DOI adapted with characteristics specific to natural sciences, being the scheme proposed for Digital extended Specimens.

PIDs are a foundation for achieving the FAIR Guiding Principles of being ‘findable, accessible, interoperable and reusable. There are many different kinds of PID and several kinds can be used in combination in any specific domain or application. What is used at the level of the institution to identify physical objects, database records and collections of those can be different from one another. And these will often be different from what is used elsewhere in data manipulation, aggregation, archival, federation and integration, which is different from what is needed for Digital extended Specimens.

The ability of machines to process Digital extended Specimen (DS) data depends on trustworthy, reliable PIDs for those DS. The challenge is not the choice of identifier scheme for DS, for which DOIs are proposed by DiSSCo but that there is presently no adequate global scheme for assigning and recording DOIs for DS, making this one issue for the present consultation.

Allied to this is the question of what else needs to be persistently identified and how. DS are a special category of object, with a special status. But there are a wide range of other object types associated with the manipulation and processing of Digital Specimens, including for example transactions of loans and visits, annotations and determinations, images and other digitized artefacts, instruments, facilities, collections, provenance and attribution events, and more. Each of these objects must be identified but each category has its own requirements regarding the information associated with the PID and the mechanisms by which PIDs are allocated. It isn’t necessary or even desirable to use the same identifier scheme as for DS when simpler schemes with different Handle prefix(es) can suffice. Establishing global PID schemes for such objects is a further challenge and also a topic for the consultation.

Translating proposals to actionable steps the community can trust, align to and support through smooth, non-disruptive transitions is the key hurdle to overcome for success and widespread adoption. Steps include developing a supporting community around mechanisms for registration/resolution and the services needed for the administration of that, along with the associated business model to remain viable over the long-term. This business model should be adapted to the circumstances of the community served. Custom PID services address the needs of the natural sciences community, with specific characteristics of metadata for describing specimens. Trust in the validity of metadata and referential integrity both imply the need for institutional commitments embodied in formalisations such as service level agreements or memoranda of understanding. Multiple stakeholders must begin a journey to frame an agreement of the necessary technical, ownership, authority, governance and financial elements. This must lead to: i) robust technical implementation; ii) stable policy, governance and funding models; ii) trust in the validity of metadata; iii) referential integrity; and iv) guarantees of long-term persistence.

The goal of this topic of consultation is to discuss the shape of such a framework and to identify the significant milestones towards achieving it. A secondary but equally important objective is to allow organisations with a stakeholding interest to begin to develop trust, alignment, and eventually commitment towards a global PID service framework for the domain of Digital extended Specimens.

Questions to promote discussion

  1. If DOIs were available for Digital extended Specimen referring to the physical specimens in your collection with links to extended information and annotations, what role could they play in your work?
  2. What added benefits/services should be provided to convince your institution to invest in using DOIs for DS?
  3. Implementing DOIs for DS can enable a transformation in how collections data are accessed and used. What transformation would you like to see and how can this be made to succeed?
  4. How should the costs of a PID scheme be paid for and who (which kind of organisations) should be responsible for that?
  5. What advantages are there from moving forward with a community-specific branding of DOIs under the name of ‘Natural Science Identifiers’ (NSId)?
  6. What challenges have you encountered in the past / do you foresee when introducing a new type of PID to your collections? If you’ve used DOIs or other PIDs for identifying things or been responsible for administering the assignment of DOIs/PIDs, for example as a member of a registration agency such as Crossref or DataCite, what’s your experience been like?
  7. What other data elements, objects and/or terms/concepts should be identified with a PID but are not yet able to be identified? What kind of scheme(s) is needed for assigning Handles to these kinds of things?

Information resources

  • Handle.Net registry. http://handle.net/.
  • International DOI Foundation. DOI Handbook. Digital Object Identifier System Handbook.
  • The DONA Foundation. https://www.dona.net/.
  • Hardisty A, Addink W, Glöckler F, Güntsch A, Islam S, Weiland C. (2021) A choice of persistent identifier schemes for the Distributed System of Scientific Collections (DiSSCo). Research Ideas and Outcomes 7: e67379. doi: 10.3897/rio.7.e67379.
  • Davies N, Deck J, Kansa EC, Whitcher S, Kunze J, Meyer C et al. (2021) Internet of Samples (iSamples): Toward an Interdisciplinary Cyberinfrastructure for Material Samples. GigaScience, Volume 10, Issue 5, May 2021, giab028. doi: 10.1093/gigascience/giab028.
  • European Commission (2020) Directorate-General for Research and Innovation. A Persistent Identifier (PID) policy for the European Open Science Cloud. Publications Office of the EU. https://doi.org/10.2777/926037.
  • European Commission (2021) Directorate-General for Research and Innovation. PID architecture for the EOSC. Publications Office of the EU. https://doi.org/10.2777/525581.
  • Madden, F and Woodburn M. (2021) Persistent Identifiers at the Natural History Museum. Case study by the PIDs as IRO Infrastructure AHRC funded project. Can be found on this page: TANC HeritagePIDs - resources along with other case studies and outputs from the project.
  • Towards a national collection. Website of the UK’s AHRC 5-year programme taking the first steps towards opening the UK’s heritage collections to the world by creating a unified virtual ‘national collection’. https://www.nationalcollection.org.uk/.

In the context of question 4 (costs of PID scheme), I would like to point to the experiences of the outcomes of a 2.5 year long strategic planning and road mapping effort of the IGSN Global Sample Number that was aimed toward identifying ways for the PID system to scale to growing demands and to operate with a sustainable business model. The outcome if this effort is a road map toward a partnership of the IGSN e.V. with DataCite e.V. that will support the global adoption, implementation, and use of physical sample identifiers. See blog by DataCite CEO Matt Buys at Bringing together communities: IGSN and DataCite.

1 Like

Following the good example set by Kerstin (being the first contributor to topic 7), I am tossing some very preliminary thoughts on Q5 (NSId). Strong branding is certainly great for incentivizing adoption, but ‘Nature Science Identifiers’ seems to be a bit too broad and a bit too limiting at the same time. It is too broad because the IDs will only be used for DES, not everything identifiable in natural science (correct me if I am wrong here). It is too limiting because it may exclude samples collected in e.g., archaeology. I know natural history specimens is the current focus, but do we intend to always keep this focus? Could “DESID” be considered as a candidate brand?

Small correction: NSID stands for Natural Science Identifier.

I think that successful branding leads to the fact that the connotation of a term does not have to be defined in the term itself. I am very sure that NSIDs can be marketed in such a way that it is always clear that they are identifiers for Digital Specimens.

I never asked myself the archaeology question. It was actually always clear to me that we were talking about natural history.

This post doesn’t actually address costs. I would be interested to know how the program will be funded as I assume museums will have to come up with funds to support PIDs no matter how they come to be. It seems that the expenses related to “digitization” keep piling up.

Having worked with cultural collections, they shy away from the term “specimen”. Digital extended specimen might need to be “extended” and perhaps be re-branded “digital extended object” if we are going to include archaeology.

The costs of having PIDs are directly related to FAIR data, as PIDs are fundamental to FAIR data. There are many studies available about the economic cost of not having FAIR data. For example this Cost-benefit analysis for FAIR research data: doi 10.2777/02999 estimates that the minimum cost of that in the EU is an estimated €10.2bn per year and this is only the measurable cost, while the publication mentions that figures for the open data economy suggest that the impact on innovation of FAIR could add another €16bn. The costs of PIDs are very minor compared to the costs of not having them (e.g. not meeting the first requirement of FAIR data). Minting and storing PIDs have very minor costs, and you can get PIDs like ROR, ORCID for free. For digital objects, if an institution wants to mint handles and have them globally resolvable with their own prefix, it has to pay only 50$ a year. I think that is affordable for any institution. The costs are mainly in additional services offered, like services that check there is only one PIDs for a resource, that guard against broken links, have suitable metadata schemas, make the PIDs discoverable, link them with other PIDs etc. An RA can provide such services, and the cost for operating an RA in DOI like DataCite and providing such services is max. around 1M per year even with the 30B PIDs we may need for digital extended specimens. If these costs would be shared globally that would be very affordable and minor in comparison with the digitisation costs or the economic benefits. There are different cost models possible for this, institutions could bare these or this could be done by the research infrastructures (e.g. by money coming from the governments). What will be a requirement though is that there are no costs for researchers who want to use them.

I think this roadmap towards a partnership of IGSN with DataCite is great for making the IGSN model more sustainable and it provides IGSN as a valuable option for physical sample identifiers next to other options like ARKs or CETAF identifiers. For digital extended specimen identifiers we discussed a similar possibility with DataCite and other DOI members but for these the advise was to create a new RA given the much larger number of DOIs we need. If we would mint these through DataCite it would completely change the focus of their current business. DiSSCo is now a member of the DOI foundation so we can discuss the establishment of an RA for digital extended specimens further with DataCite and other RAs. It is not in DiSSCo’s mission to establish a global RA on its own though, this will need to be done in a wide collaboration with partners globally. Do you have an opinion on what would be the incentives for potential partners to join in such a collaborative effort and what should be the steps toward that? Would you see IGSN as one of these potential partners?

DataCite have, over the years, supported scaling of various identifier communities through governance, sustainability, insurance, and technical implementation facets. These partnership support convergence on our vision “Connecting research, identifying knowledge”. In supporting larger volumes of registrations, it would not change our focus as this is our trajectory and aligned with our mission. Together with IGSN and others we expect to scale our infrastructure services to support increasing registration, resolution and discovery of identifiers. As an example, one of our members registers DOIs on the species seen in DNA clusters such as https://dx.doi.org/10.15156/BIO/TH005107. We welcome continued discussions about collaboration with communities such as DiSSCo.

2 Likes

This touches upon the question what objects should be in scope to get a digital specimen identifier and related to that, what we see as a specimen. Over the past two years there has been a lot of discussion on this, for example in the last TDWG conference and in relation to DarwinCore and the MaterialSample class (see for instance: Change term - MaterialSample · Issue #314 · tdwg/dwc · GitHub).

What I get from these discussion is in short that a specimen is a material sample and what people in collections usually refer to as a sample is actually a subsample (usually of a specimen as material sample). I think this is not the whole picture but I will get to that later.

During TDWG2020 we had a survey “What do you think a specimen is”, see for results: Do the "What is a (physical) specimen" test - SurveyMonkey Dashboard. With only 18 responses it is probably not representative for the community but nevertheless it revealed some interesting things: most people seem to agree that every curated object with its own physical identifier in a collection is a specimen. For example if a fish is dissected in a head, tail and fins and these are preserved as individual objects, then there are three specimens. Also most people see an object as a specimen as soon as it has been gathered (sampled). This means that an object should ideally have a digital specimen identifier already in the field (as in this picture: digital extended specimen infosheet - Google Slides), while it gets a physical specimen identifier later, when accessioned in a collection.

This makes the distinction between a material sample and specimen even more fuzzy, because a material sample is the result of a sampling event while a specimen is the result of a curation process. So you should expect the object to be a material sample after the sampling event in the field, and it becoming a specimen after it went through a curation process at the institution. However you could argue that the curation process often already starts in the field with initial preservation of the sample (for example an insect that is put in alcohol after catching it).

There is one category of specimens that is not a material sample though in my opinion and these are specimens that are the result of a recording. For example a sound record or a drawing. You could even see field notebooks or register books as specimens in that they are the results of multiple recordings. T

A thing for discussion then is whether these always need to be physical objects as results of recordings, like a sound tape, or photo on paper, or if these can also be digital objects like digital photographs or models. I think these can be seen as specimens as well, when they go through a curation process. This means they will need to have a physical specimen identifier as well then. Likewise any (sub-)sample should be treated as a specimen, when it went through a curation process (got an identifier and is preserved and stored).

For digital specimen identifiers we have so far been focusing on natural history only, excluding any human related collections such as anthropological collections, archaeology etc. I think we should go for an extendible model where other collections can also use digital specimen identifiers in the future, if there is demand for it.

The more generic we can keep the metadata in a PID record, the more inclusive it can be. However in the discussions around MIDS (Minimum Information of a Digital Specimen we see that the requirements for metadata are already different at a very minimal level, where we probably can distinguish between different needs between preserved specimens, fossil specimens, living specimens, earth sample specimens and recorded specimens. For instance for an earth sample specimen you want to know the material type, but for a preserved, fossil or living specimen this will always be ‘organic matter’. For a record specimen like a drawing you probably want to know the artist and for a fossil specimen you want to know the geological time period. We cannot create schemas for all possible collections at once, so my proposal would be to start with the natural history specimens which have been our main focus up to now.

Provenance is a core part of a digital extended specimen, which means that information about agents that perform actions on these specimen needs to be captured. We need therefore PIDs for agents.

Should we go for an inclusive model where we allow any type of PID for agents or should we try to receive consensus in this area? There seems to be some consensus already about using ORCID for living people and Wikidata for deceased people.

What about PIDs for organisations? ROR seems a good candidate now it has evolved to include the full metadata schema from GRID (only in the ROR API at the moment), has a curation workflow implemented and has wide support by other PID providers like DataCite and ORCID (it is on the ORCID implementation roadmap for this year).

But what to do with organisations that cannot get a ROR because they are not a top-level research organisation, like museums that are a department in a university? Do we record provenance in that case only on the top-level organisation, or do we need other identifiers for them? Wikidata seems a good choice, also because a group within ROR is working on a departmental extension using Wikidata. Or should we accommodate for just any PID for an organisation (ISNI, GRID and others)?

Recently a new RA (registration Agency) became operational under the DOI umbrella: BSI Identify (https://identify.bsigroup.com/about/). They offer a very simple service: a BSI UPIN (Universal Persistent Identification Number), which is a DOI for construction materials. It will survive product discontinuations so it will allow you to get what is known about a product such as the water pipes in your house long after the product is fabricated, installed, operated or even decommissioned. A manufacturer can provide product data to BSI Identify to create UPIN identifiers for their products and attach it to them as a barcode for example.

I think that for a Digital Specimen identifier a little more would be needed in terms of services. There may be a need for a reverse lookup for a physical specimen identifier for example. What would be the minimum services needed to be provided by a RA for digital specimen identifiers?

That is great to hear Matt, let’s discuss that further in IDF. Do you think DataCite would be able to accommodate different metadata schemas and membership models as well?

Our metadata schema has evolved over the last 10 years to support the evolving needs of our community. It is not possible to say what changes would be needed before we consider the nature of the relationship. In addition, we have found that our membership model provides flexibility across different groups to support their needs (e.g. membership types, service fees, fee caps etc). There are already various unique relationships between different RAs e.g. mEDRA and the OP (Publications Office of the European Union) that others can learn from.

@aguentsch but the general principles of what we propose can, in wider application be used for the digital object representation of any kind of artefact in all kinds of heritage collections. And that may be an argument against specific natural sciences branding.

@jegelewicz I would not express it as the ‘…expenses related to digitization keep piling up’ but would rather talk about the cost of doing business in the digital age. Such costs can be considered not always as additional costs but as displacement costs, when doing something digitally instead of physically.

The costs of a PID scheme are reasonably fixed, can be known in advance and shared according to some formula among an agreed set of stakeholders. The higher the number of stakeholders, the lower the cost for everyone. One common formula for cost sharing is to weight stakeholders’ contributions according to their turnover. The marginal cost of creating and assigning an additional PID is almost zero. The main cost is in collecting and maintaining the (meta)data associated with the PID and that, with efficient software-connected processes can also be very low cost. A change in the core (meta)data in the (collection-management or other authoring) system can automatically lead to a change in the registered (meta)data.

As a principle, there is no expectation that researchers/authors/others should have to pay per PID to either obtain or resolve one.

Your point is noted. Digital extended Specimens are a specific kind of digital object, explicitly geared towards the needs of natural science specimens and the community surrounding those. Other, similar kinds of digital object could be equally well defined for other kinds of artefacts and communities and the scope of the PID scheme could be broadened.

In the context of the general definition of what a digital object is (which I recently wrote) it doesn’t really make sense to include the word ‘extended’ there. All digital objects can be extended at any time.

@hardistyar I agree. DataCite went through an activity based cost exercise last year where we considered the value chain for our community and analysed how we continue to sustain and scale our infrastructure services longterm. We found it helpful to consider this from two perspectives (1) supply - the registration of the PIDs and metadata deposit and (2) aggregation and use of metadata and related services. Each stakeholder in the community has an impact on these and it was then possible to define a model that aligned with cost recovery. It may be worthwhile to add that POSI recommends going beyond basic cost-recovery and establishing contingency funds to support operations for 12 months.

Finally, as you aptly mention, there should be no no expectation that researchers/authors/others should have to pay per PID to either obtain or resolve one.

To some extent this is true but generic means people tend to squeeze different kinds of information into a single field in a general purpose schema because they can’t find somewhere else to put it, as in this example where both the catalog number and the scientific name have been squeezed into the ‘title’ field of the DataCite schema and the collector name has been given as the creator of the record, when in fact there are two collectors listed in Arctos and it’s not clear which one of them actually picked the worm by hand. Surely, they didn’t do it as a couple! The content of title field, while ok for humans is not helpful for machines unless you know the structure a priori. Worse, different people/systems most likely shoehorn data into fields in different ways!

I advocate that we need a specific schema for registering Digital extended Specimens, which is in fact the same schema as for the DS itself (the openDS schema(s) under development by DiSSCo) - because often we cannot distinguish when something is data or metadata. Depending on purpose we might want to process it either way.

1 Like

I don’t see how PIDs could be considered “displacement” costs - in general I think this might involve adding something to physical objects as well as their digital representation if the two are intended to stay connected, but adding PIDs is a new thing (and therefore an additional cost) no matter which way you slice it. No matter how we distribute the cost - it is an additional cost to someone, somewhere.

This is my accounting side speaking…

So far, I had thought that the PID for a physical specimen or a digital object is a quite straightforward issue, once one has decided on an agency generating and maintaining it. Already the discussion here has suggested that it isn’t. During the meeting of the CWFR-working group last Friday (see my post under Topic 10 and its update) part of the discussion revolved around PIDs, their required functionality, the information they store and a range of characteristics. Here is a summary of some key elements, as I understood them:

  • PIDs need to be FDOs (FAIR digital objects) themselves
  • The persistence of PIDs needs to be defined by the users or institutions, and this information needs to be stored somewhere (within the PID or on the platform of the Registration Agency?).
    • Generally, long-term PIDs should be maintained at least for eg. 20-30 years. I guess, it is no 100+ years, since technical development might mean that any approach might have become obsolete after a human generation.
    • In contrast, PIDs assigned during a research project (eg. for (prelimiary versions) of images, DNA-sequences, alignments) might only be meaningful during the duration of the project, ie. until the publication of a final version.
    • The information about the duration of valid persistence needs to be machine-readable.
  • PIDs might have histories themselves, depending on their predefined characteristics:
    • The physical object - data - metadata represented by a PID might be versioned and thus the entity associated with the PID is mutable. Eg. consider the PID of a health marker analyzed from blood samples or feces gathered from a specific individual (or clone, microbe lineage, …) over time. Its results might be collected as time series data in a table, which grows incrementally with each added result. The PID refers to the table with its changing content.
    • Alternatively, PIDs might refer to a static entity, cp. snapshots of the entity at a certain time. Thus, the data associated with the PID is immutable. Here, eg. each version of the table in the time series gets its own PID.
  • PIDs can be considered shadows of the public and/or private data that they represent.
    • If I understood the discussion right, there was support for the idea that PIDs should store license information.
    • This is so, since via the PID it should be possible to find existing resources. However, before access is granted to the (meta) data themselves, you/a machine need(s) to read and understand if the resource you/it want(s) access to is all or in parts open/accessible, or if access is regulated by a license (it should be, to avoid surprises later on …). If a license is attached, you/the machine need(s) to agree to/fulfill the license requirements (which might need human intervention).
    • In this way, PIDs allow us to talk about (private) data and/or metadata as if they were there, even if they are not. Only once access rights were granted, are the (meta)data transferred or become visible.
  • PIDs can be assigned to metadata and data.
    • Thus, hierarchies of PIDs granting ever more access can be created: first you only see the PID → then you pass the license to all or some metadata and might see more PIDs → via these additional PIDs at the lower level, you pass the licenses to more metadata → then the PID-attached license to the data (to varying degrees) → at one point it might be easier to actually talk to a human and ask for access (-> cp. use agreement, Topic 8)
    • Accordingly, metadata and data can have different PIDs and licenses.
3 Likes