7. Persistent identifier (PID) schemes

@JuttaBuschbom Thank you for these points, some content of which is also relevant to topic 8. Meeting legal/regulatory, ethical and sensitive data obligations and even to topics 10. Transactional mechanisms and provenance and 6. Robust access points and data infrastructure alignment as well. Let me try to go through them one by one, quoting only the main bullet points, not the subsidiary ones in an attempt to keep this as short as I can. The reader may need to refer back to your post for details.

This isn’t true at all. A PID is just an identifying string – a name like mine or yours, or more obscure(!). A PID itself has no characteristics that would put it into some class of digital objects. However, a PID can be persistently and reliably resolved to digitally actionable meaningful information about the identified thing - such as an FDO, of which a Digital extended Specimen is a kind.

Agreed, persistence is a principally a socio-organisational construct, just as maintaining a building over the long term is mainly socio-organisational. It has technical components, of course but principally it means commitment to maintain and spending of money to do so. This commitment (the persistence policy) must be agreed, stored and mandated to an agency that will administer a PID scheme on behalf of a community of stakeholders.

In the heritage sector, preservation and curation timescales are measured in many decades and so it seems reasonable to think in similar terms for a PID scheme identifying digital objects related to heritage and preservation. Adopting a scheme that is somewhat/quite immune to underlying technological changes, as DiSSCo will do for DS supports that. DiSSCo technical experts believe that the Handle System (on which DOIs are based) can endure for 100+ years, even with changes in underlying technologies.

As well as using PIDs with long persistence there are many cases where PIDs with shorter, more transient persistence are sufficient, even on per project timescales of just a year, two or three. Not all PIDs have to have the same characteristics and policies. DiSSCo, for example will use (an)other Handle System scheme variant(s) with different characteristics and requirements for identifying other kinds of digital object, including those with shorter persistence times and/or of a more transient nature.

All information associated with PIDs must be machine-readable so that PIDs can play their full role helping research data infrastructures to achieve the goal of become FAIR infrastructures.

‘FAIRness’ is a characteristic exhibited by an infrastructure (or a component of an infrastructure) and the data it manages when that infrastructure maintains compliance with the Guiding Principles of FAIR (Findable, Accessible, Interoperable, and Reusable. This is a characteristic must be protected throughout the lifetime of any data infrastructure now. As we have discussed in our article on FAIR Data and Services in Biodiversity Science and Geoscience achieving FAIRness is substantially assisted by adopting Digital Object Architecture as DiSSCo proposes, and treating digital specimen and related data as FAIR Digital Objects.

I prefer to look at the history of a PID (as opposed to the history of the identified object) more straightforwardly in terms of series of states a PID can have, such as (for example): i) created but not yet assigned to an object; ii) actively identifying and resolving to an object that exists; iii) identifying an object that no longer exists and resolving to a ‘tombstone’ that gives a reason why the object no longer exists (including ‘unknown’).

Indeed, the question of whether something identified by a PID is mutable or immutable (both are possible) is important, with consequences. At the moment, we consider that Digital extended Specimens are mutable. When PIDs identify mutable objects then additional mechanisms are needed to identify and associate the different versions of the object as it evolves. There are several different ways to do this (I won’t go into those here). However, an important question with mutable objects is what characterizes a version to be different from its predecessor and thus makes it worthy of keeping track of that? Not every change has to result in a new version, just as not every change to computer software instantly results in a new version/snapshot/release of that software. With immutable objects the situation is different. The result of performing an operation on an object that potentially can change the object content is that a new object must be created and identified. This results in a massive proliferation of identifiers in a realm (bio/geodiversity sciences) where we already estimate tens of billions of identifiers will be needed in the future.

As a rule, PIDs are NEVER re-assigned (or re-used) to identify a different object. When a PID is ‘finished with’ it must be re-directed to a tombstone as above. For PIDs with transient persistence, it may be acceptable to have a re-use policy, provided the scheme is designed to ensure that confusions can never arise.

In the sense that, if a PID exists then there is something behind it that is now identified. However, usage conditions/licensing are not the means by which access to the thing should be controlled.

Generally speaking, usage conditions and licenses state the terms under which you are allowed to use the data and the obligations upon you. Controlling access to data a separate issue based on: a) the danger you represent to the organization holding the data or to the thing/place described by the data; and b) the (official, trusted) role you have in relation to uses of the data. If you are not trusted to use the data in the expected or controlled manner then no usage condition can police that. Anyone can tick a checkbox saying they’ll abide by the terms of use.

Indeed, PIDs should point to usage terms and conditions relevant to what is being identified, as part of the descriptive metadata of that thing. For reasons I’ve explained in topic 8 (here and here) this isn’t or shouldn’t always be referred to as ‘license conditions’ but as ‘usage conditions’ that can in some circumstances include licensing.

As a rule in an open world, metadata should not have usage conditions associated with it. No-one, for example exercises control over the conditions under which the title/author/publication/date/etc. details of individual books and journal articles (the metadata) is used. (The situation is different if you copy a publisher’s entire list/database of publications metadata, add new value to it and re-publish in a new form. Then you would clearly be in breach of some terms.)

Yes, this is correct. PIDs can identify anything that needs to be identified. Nevertheless, we must take care because it is not always so easy to distinguish between data and metadata. It depends on the purpose you are using the information for.

Take some standard information about a specimen, for example: What it is, a place it (may) occurs in, and the date (say, year) when it was collected – when using that information to search for examples of such specimens across collections that is metadata (strictly speaking, metainformation) because it describes the thing(s) you are looking for. When a list of multiple items of such information is used as the basis of an analysis of the known places and times of occurrence of the species represented by the specimens you found, then it is not metadata but data. The distinction is artificial but sometimes it can be useful to say ‘the information this data represents is describing some other data’ – in which, case name it as metadata. There are many cases where the description of data can be easily separated from the data itself but in our domain this is not so. And so, with Digital extended Specimens we have to be careful because a lot of the information we deal with can be both data and metadata at the same time. Therefore, we do foresee metadata objects separate from DS objects.

DiSSCo has chosen DOI Foundation DOIs for Digital extended Specimens by evaluating twenty-two Handle System PID scheme options. Our evaluation was published 6th July in Research Ideas and Outcomes (RIO) journal. If you’d like to find out how we evaluated and why we chose DOIs, read the article: A choice of persistent identifier schemes for the Distributed System of Scientific Collections (DiSSCo) by @hardistyar, @waddink, fglöckner, @aguentsch, @sharif.islam and @cweiland.

It covers DiSSCo requirements for a Handle-based PID scheme for DS, explanation of the different Handle System variants and modes of operationalising the use of these, as well as discussion of the value of branding identifiers to address a specific community of use, and the next steps we need to take to implement the chosen scheme.

The first of the next steps has already been taken, with the DiSSCo Coordination and Support Office becoming a member of the DOI Foundation, and with moderating the present consultation topic on PID schemes. Further steps include developing the governance, operations, financing and service portfolio models for a new Registration Agency in collaboration with the DOI Foundation to provide durable persistent identifiers for DS.

1 Like

I think a PID is more than that? A PID has also measurements to be persistent and to be resolvable, otherwise it would just be an identifier (ID).

Well, perhaps I almost shot myself in my foot! I meant it’s just a string and not an object. But then I went on and said it can be persistently and reliably resolved. So we agree. :slight_smile:

Thank you for this - I am certain that most people in the collections community have no idea


This was so very informative! Thank you! Now I just need to get about 100 people to read it


1 Like

I agree that NSid seems too broad and likely to lead to confusion. Based on the discussion, it seems like it’s still not clear what the proposed DOIs are going to identify. Once the scope is more defined, I think it will be possible to come up with a brand name that reflects and reinforces it.

Just saw this in twitter (https://twitter.com/salgo60/status/1417033942583107586) about best practices for designing a system with an external identifier scheme like Wikidata.

Even though this is Wikidata and Linked Data focused, it has some good pointers for us about quality, data roundtriping, new skills, maturity levels.

https://www.wikidata.org/wiki/User:Salgo60/ExternalIdentifiers

1 Like

Like I did for Topic 6, I’ll leave a few points responding to the questions up top based on our experience with the Movebank platform for animal-borne sensor data, also addressing some of the discussion above.

Question 1 (possible role of DOIs): We imagine DES PIDs as something that data owners could optionally register for when managing their data within studies (i.e., datasets) on Movebank.

  • Within Movebank, this would allow data owners (currently over 3,000) to identify the same data or animals stored in multiple studies in Movebank. This occurs, for example, when the same data are included in different studies managed by different co-owning organizations; when data subsets from the same long-term project are stored separately, e.g., to represent data subsets analyzed in specific publications; or when multiple movement trajectories are estimated from the same sensor measurements using different data processing algorithms. Currently, users create non-persistent identifiers, but these are not always consistent across studies due to local naming schemes that vary over time or differ between use/analysis contexts or co-owning organizations.
  • Beyond Movebank, PIDs could be used to define data stored in different platforms. For example, here is the first example I’m aware of of a data availability statement (Gu et al., 2021) referring to animal tracking and genome sequence data archived at Movebank and GenBank, respectively.
    Climate-driven flyway changes and memory-based long-distance migration | Nature It could also help ensure that unique organisms are correctly identified if someone obtains the “same” information from different sources: Owners often archive data at Movebank and also at other tracking data platforms (like the Seabird Tracking Database or MOTUS), and some will eventually be published at GBIF. Data miners will acquire the same data in different formats from multiple sources and could derive misleading information if they can’t easily recognize and address overlap and duplication. The use of PIDs could enable protocols to deal with this.

Question 6 (challenges to implementation):

  • Deciding what DES PID can represent. In our case, I would hope that the/a core goal is to trace information back to a unique individual organism. While a broader scope for what a DES PID defines could make them usable by more parties and types of collections, if it is not clear what a PID represents, that could limit their usefulness or lead to inaccurate interpretation of data.
  • Deciding who should be responsible for registering the PIDs. It seems like the original specimen/data holder needs to be responsible for registering and maintaining the DOIs and related information. If data platforms that serve multiple data holders are responsible, we’d quickly run into the problem of multiple DOIs being created for the same animal/specimen. (In the case above, the value of the DES PID is only achieved if the same PIDs are used at both Movebank and GenBank.) This even occurs within GBIF, where there is a smaller number of approved data providers, e.g., as illustrated by the recent data-clustering feature to identify possible related or duplicate records.
  • Feedback on DOI maintenance: In our experience minting DOIs and maintaining DataCite metadata for datasets, one relevant challenge is in versioning and identifying relationships, i.e., when simple sequential relationships are not accurate (DataCite’s relationType property doesn’t have a good option for overlapping data, or related resources without one identified as the original/complete version).
  • Supporting maintenance globally. The more resources are available, the more work can be done to verify use of PIDs; prevent, identify and merge duplicate records; find and apply PIDs to link additional relevant data, retain valid contact information for PID holders, etc. The extent of maintenance will impact the reliability and usefulness of the PIDs.
  • Identifying what information can/must be included with a registered DES PID. I hope the system would minimize requirements for compliance (to increase participation). In our case, core information should include a taxon and any other identifiers related to the PID (ring numbers, sensor IDs, nicknames used for public engagement).
1 Like

Question number 7 is closely related to the question posed under topic 10: There has been some interesting discussion so far that seems to be getting at the building blocks of the DES system, so, in putting together a network of Digital Extended Specimens, what constitutes the building blocks of a “digital specimen”?. This imposes a question for this topic too: do they all need their own digital objects and what PIDs are needed to identify these building blocks? Are there existing PID schemes that could be used for some of these building blocks?

Wanted to point to a couple of other Digital Extended Specimen PID use cases:

  1. An effort led by my colleague Ana Sequeira at the University of Western Australia Oceans Institute and School of Biological Sciences outlined in a recent publication
    A standardisation framework for bio-logging data to advance ecological research and conservation directly calls for unique PIDs for animals to enable standardization of bio-logging data, and recognize the same data for individuals stored in different datasets and platforms: “To guarantee the uniqueness of OrganismID used, we suggest that Darwin Core guidelines (dwc.tdwg.org/terms/#dwc:organismID) could be used to assign a unique-per-animal ID code with a URN (Uniform Resource Name) notation that is adopted across national and international data centres.”

  2. Through a project recently funded through NLBIF led by Peter Desmet at the Research Institute for Nature and Forest, we’ll be developing methods and policy for publishing GPS animal tracking data from Movebank to GBIF. Here a unique animal ID would likewise help ensure that users recognize data for the same animals stored in different places (in this example, UvA-BiTS, Movebank, the Movebank Data Repository, Zenodo, and GBIF).

1 Like

I think these questions will be use case and context specific and we might need to drill down more to align the technical solutions with day-to-day workflow.

Two overarching questions come to my mind (using the PROV terminologies here for agent and entity):

  1. Does the entity (or digital object) need to be referred to by other agents or entities?
  2. Does the entity need to be linked with other entities or agents?

If yes, then the use case of the identifier can be further broken down to different levels and the use case might determine if all components need a DOI or other types of identifiers.

Level 0: Informal identifiers with no standards and policies (like filenames, adhoc naming schemes)
Level 1: Local uniqueness (uniqueness is guaranteed within a domain)
Level 3: Global uniqueness ( uniqueness is guaranteed via established standards and practices)

There are other factors that could be included here → levels and types of persistence (see Kunze et al. 2017, SLAs, machine readability and actionability (not all entities might not be required to have machine-readable metadata).

Thinking about a hypothetical example here. A storage unit for a specific collection might require a local unique identifier, needs to be tracked and linked with the specimens. But this might not need to be exposed to the wider world. This local identifier still could be be part of the DES building blocks. Maybe others have better/real-world examples?