7. Persistent identifier (PID) schemes

@JuttaBuschbom Thank you for these points, some content of which is also relevant to topic 8. Meeting legal/regulatory, ethical and sensitive data obligations and even to topics 10. Transactional mechanisms and provenance and 6. Robust access points and data infrastructure alignment as well. Let me try to go through them one by one, quoting only the main bullet points, not the subsidiary ones in an attempt to keep this as short as I can. The reader may need to refer back to your post for details.

This isn’t true at all. A PID is just an identifying string – a name like mine or yours, or more obscure(!). A PID itself has no characteristics that would put it into some class of digital objects. However, a PID can be persistently and reliably resolved to digitally actionable meaningful information about the identified thing - such as an FDO, of which a Digital extended Specimen is a kind.

Agreed, persistence is a principally a socio-organisational construct, just as maintaining a building over the long term is mainly socio-organisational. It has technical components, of course but principally it means commitment to maintain and spending of money to do so. This commitment (the persistence policy) must be agreed, stored and mandated to an agency that will administer a PID scheme on behalf of a community of stakeholders.

In the heritage sector, preservation and curation timescales are measured in many decades and so it seems reasonable to think in similar terms for a PID scheme identifying digital objects related to heritage and preservation. Adopting a scheme that is somewhat/quite immune to underlying technological changes, as DiSSCo will do for DS supports that. DiSSCo technical experts believe that the Handle System (on which DOIs are based) can endure for 100+ years, even with changes in underlying technologies.

As well as using PIDs with long persistence there are many cases where PIDs with shorter, more transient persistence are sufficient, even on per project timescales of just a year, two or three. Not all PIDs have to have the same characteristics and policies. DiSSCo, for example will use (an)other Handle System scheme variant(s) with different characteristics and requirements for identifying other kinds of digital object, including those with shorter persistence times and/or of a more transient nature.

All information associated with PIDs must be machine-readable so that PIDs can play their full role helping research data infrastructures to achieve the goal of become FAIR infrastructures.

‘FAIRness’ is a characteristic exhibited by an infrastructure (or a component of an infrastructure) and the data it manages when that infrastructure maintains compliance with the Guiding Principles of FAIR (Findable, Accessible, Interoperable, and Reusable. This is a characteristic must be protected throughout the lifetime of any data infrastructure now. As we have discussed in our article on FAIR Data and Services in Biodiversity Science and Geoscience achieving FAIRness is substantially assisted by adopting Digital Object Architecture as DiSSCo proposes, and treating digital specimen and related data as FAIR Digital Objects.

I prefer to look at the history of a PID (as opposed to the history of the identified object) more straightforwardly in terms of series of states a PID can have, such as (for example): i) created but not yet assigned to an object; ii) actively identifying and resolving to an object that exists; iii) identifying an object that no longer exists and resolving to a ‘tombstone’ that gives a reason why the object no longer exists (including ‘unknown’).

Indeed, the question of whether something identified by a PID is mutable or immutable (both are possible) is important, with consequences. At the moment, we consider that Digital extended Specimens are mutable. When PIDs identify mutable objects then additional mechanisms are needed to identify and associate the different versions of the object as it evolves. There are several different ways to do this (I won’t go into those here). However, an important question with mutable objects is what characterizes a version to be different from its predecessor and thus makes it worthy of keeping track of that? Not every change has to result in a new version, just as not every change to computer software instantly results in a new version/snapshot/release of that software. With immutable objects the situation is different. The result of performing an operation on an object that potentially can change the object content is that a new object must be created and identified. This results in a massive proliferation of identifiers in a realm (bio/geodiversity sciences) where we already estimate tens of billions of identifiers will be needed in the future.

As a rule, PIDs are NEVER re-assigned (or re-used) to identify a different object. When a PID is ‘finished with’ it must be re-directed to a tombstone as above. For PIDs with transient persistence, it may be acceptable to have a re-use policy, provided the scheme is designed to ensure that confusions can never arise.

In the sense that, if a PID exists then there is something behind it that is now identified. However, usage conditions/licensing are not the means by which access to the thing should be controlled.

Generally speaking, usage conditions and licenses state the terms under which you are allowed to use the data and the obligations upon you. Controlling access to data a separate issue based on: a) the danger you represent to the organization holding the data or to the thing/place described by the data; and b) the (official, trusted) role you have in relation to uses of the data. If you are not trusted to use the data in the expected or controlled manner then no usage condition can police that. Anyone can tick a checkbox saying they’ll abide by the terms of use.

Indeed, PIDs should point to usage terms and conditions relevant to what is being identified, as part of the descriptive metadata of that thing. For reasons I’ve explained in topic 8 (here and here) this isn’t or shouldn’t always be referred to as ‘license conditions’ but as ‘usage conditions’ that can in some circumstances include licensing.

As a rule in an open world, metadata should not have usage conditions associated with it. No-one, for example exercises control over the conditions under which the title/author/publication/date/etc. details of individual books and journal articles (the metadata) is used. (The situation is different if you copy a publisher’s entire list/database of publications metadata, add new value to it and re-publish in a new form. Then you would clearly be in breach of some terms.)

Yes, this is correct. PIDs can identify anything that needs to be identified. Nevertheless, we must take care because it is not always so easy to distinguish between data and metadata. It depends on the purpose you are using the information for.

Take some standard information about a specimen, for example: What it is, a place it (may) occurs in, and the date (say, year) when it was collected – when using that information to search for examples of such specimens across collections that is metadata (strictly speaking, metainformation) because it describes the thing(s) you are looking for. When a list of multiple items of such information is used as the basis of an analysis of the known places and times of occurrence of the species represented by the specimens you found, then it is not metadata but data. The distinction is artificial but sometimes it can be useful to say ‘the information this data represents is describing some other data’ – in which, case name it as metadata. There are many cases where the description of data can be easily separated from the data itself but in our domain this is not so. And so, with Digital extended Specimens we have to be careful because a lot of the information we deal with can be both data and metadata at the same time. Therefore, we do foresee metadata objects separate from DS objects.