So far, I had thought that the PID for a physical specimen or a digital object is a quite straightforward issue, once one has decided on an agency generating and maintaining it. Already the discussion here has suggested that it isn’t. During the meeting of the CWFR-working group last Friday (see my post under Topic 10 and its update) part of the discussion revolved around PIDs, their required functionality, the information they store and a range of characteristics. Here is a summary of some key elements, as I understood them:
- PIDs need to be FDOs (FAIR digital objects) themselves
- The persistence of PIDs needs to be defined by the users or institutions, and this information needs to be stored somewhere (within the PID or on the platform of the Registration Agency?).
- Generally, long-term PIDs should be maintained at least for eg. 20-30 years. I guess, it is no 100+ years, since technical development might mean that any approach might have become obsolete after a human generation.
- In contrast, PIDs assigned during a research project (eg. for (prelimiary versions) of images, DNA-sequences, alignments) might only be meaningful during the duration of the project, ie. until the publication of a final version.
- The information about the duration of valid persistence needs to be machine-readable.
- PIDs might have histories themselves, depending on their predefined characteristics:
- The physical object - data - metadata represented by a PID might be versioned and thus the entity associated with the PID is mutable. Eg. consider the PID of a health marker analyzed from blood samples or feces gathered from a specific individual (or clone, microbe lineage, …) over time. Its results might be collected as time series data in a table, which grows incrementally with each added result. The PID refers to the table with its changing content.
- Alternatively, PIDs might refer to a static entity, cp. snapshots of the entity at a certain time. Thus, the data associated with the PID is immutable. Here, eg. each version of the table in the time series gets its own PID.
- PIDs can be considered shadows of the public and/or private data that they represent.
- If I understood the discussion right, there was support for the idea that PIDs should store license information.
- This is so, since via the PID it should be possible to find existing resources. However, before access is granted to the (meta) data themselves, you/a machine need(s) to read and understand if the resource you/it want(s) access to is all or in parts open/accessible, or if access is regulated by a license (it should be, to avoid surprises later on …). If a license is attached, you/the machine need(s) to agree to/fulfill the license requirements (which might need human intervention).
- In this way, PIDs allow us to talk about (private) data and/or metadata as if they were there, even if they are not. Only once access rights were granted, are the (meta)data transferred or become visible.
- PIDs can be assigned to metadata and data.
- Thus, hierarchies of PIDs granting ever more access can be created: first you only see the PID → then you pass the license to all or some metadata and might see more PIDs → via these additional PIDs at the lower level, you pass the licenses to more metadata → then the PID-attached license to the data (to varying degrees) → at one point it might be easier to actually talk to a human and ask for access (-> cp. use agreement, Topic 8)
- Accordingly, metadata and data can have different PIDs and licenses.