Making FAIR data for specimens accessible

In all presentations on digital specimens a bit sequence is presented at the kernel of the digital object. This give the impression that a digital object would be created as metadata for any digital representation of that specimen, be that an image, a genome, a sample etc. Therefore the metadata for all these digital objects would have links to all the other digital objects related to that specimen.

The other view is that the digital specimen doesn’t contain the bit sequence at all, but is a collection of links with minimal metadata that links all the objects together.

Which is it, or is a hybrid imagined?

I think this is an important distinction that gets to the issue for how these data are hosted and accessed.

If it is to be a hybrid, much like what I imagine publishers send to Crossref – article metadata plus references with DOIs – what happens when metadata in the specimen object kernel necessarily require adjustment as is often the case in Darwin Core content? Or, is there a core set of fields in the kernel of a digital specimen object that do NOT ever change & are sufficient in number and composition to permit unambiguous linking? If however there are many metadata fields in the kernel of a digital specimen object, there is real risk that any local or downstream links made will rot because it is more likely that the content in those fields will experience flux.

2 Likes

A digital object has 4 components: a PID, kernel metadata, metadata and optionally a bit sequence. The kernel should contain a minimum set of metadata, just enough for resolution and defining the object type and allowed operations. This is metadata that should not change frequently. The metadata section should contain more metadata and there is where it gets fuzzy since much of the specimen label data could be seen as metadata or as data depending on the use case. Whether a digital specimen should contain a bit sequence containing all the data or just a collection of links is a matter of implementation, but currently I think it will be rather be a collection of links in a central metadata catalogue (NSIDR) where the data is stored in local object stores (which could be cordra instances) hosted by the institutions or (for smaller institutions) hosted by DiSSCo in the cloud. These likely need to remain separate from the CMS since many people may have write access to these stores for e.g. annotations while a CMS may have write access only by local staff. The local object store would than contain a copy of the CMS data + the extended specimen data, where the CMS data could be kept in sync through a DOIP interface if the CMS supports this. The PID should resolve to a landing page which could, though content negotiation, just provide a structured JSON with metadata and pointers to the data for machines.

1 Like

Should we have a thread to identify what these are?

How will aspirations for FAIR data be informed by need to address data sovereignty issues, as id’d in this paper introducing CARE? http://doi.org/10.5334/dsj-2020-043

4 Likes

From: Dirk Neumann 03:43 PM
And on which legal basis and with which standards are and can these data be shared? Pulished data is visible for a broad audience, including e.g. legislators tracing compliance or national governments restricting the used of such data in the public domain (not only but also current legal obligations to share or upload ‘DSI’ to databases in the public domain). Recent examples (e.g. current discussion for the introduction of a Global Multilateral Benefit Sharing system) strongly targeted specimen data generated and shared FAIRly (without offerring benefits for original Provider countries) included discussion on retroactive coverage of data derieved e.g. from historic specimens collected 19th century or earlier.

2 Likes

From: Barbara Magagna 03:36 PM
I would like to ask you if you are aware of the work done by RDA I-ADOPT WG. We focus on interoperable observable property descriptions and think our work should fit into the work you are doing and vice versa.

These legal aspects addressed by Dirk Neumann are indeed important and often overlooked. In the European Open Science Cloud study, next to FAIR the consultant introduced the concept of JUST (Judicious, Unbiased, Safe and Transparent), which helps to identify responsibilities in case of legal issues, without addressing them directly though.
linked to comment : Making FAIR data for specimens accessible - #10 by hardistyar

2 Likes

@austinmast linked data with PIDs could better trace the data back to origins

1 Like

I was wondering about this too. How to treat outgoing links from the Digital Specimens and Extended Specimens to genomic data, e.g. sequences held in INSDC databases, which might (I say might!) have a CBD/Nagoya Protocol dimension in the future?

Recall also that FAIR data does not always mean open data. The maxim is ‘as open as possible, as closed as necessary’. In Europe at least, data can be restricted on objective grounds that include intellectual property law, national security, the protection of endangered species, privacy and other regulations. I’m not familiar with other jurisdictions. Of course, whatever international agreements are made around access and benefit sharing will also have to be applied.
Here then we run into the need for technical mechanisms for access control to specimen data, which can only work on a wide-area basis and so might merit its own topic!

1 Like

@JoeMiller I prefer to say ‘data linked with PIDS’, which is different to ‘linked data’ :slight_smile:

3 Likes

@dshorthouse It’s not possible to search kernel metadata, only to return it as part of resolving the PID for the object. Searchable metadata has to be stored elsewhere.
Nevertheless, it would be interesting to have your suggestions for what information should be returned at that level (as opposed to being found by a metadata search in the database of a Registration Agency - example: https://search.crossref.org/ or https://search.datacite.org/).

Yes, and a collection manager with a biology degree has to wrangle with all of this often with no legal help at all. How can we also make this fair to the providers who have limited knowledge/resources?

1 Like

I may have a misunderstood what is meant by kernel metadata vs what is meant by searchable metadata. What is in the former? Are they human-readable? Do they have anything that resembles Darwin Core? Or, are Darwin Core-like terms solely in what you call searchable metadata that sits alongside & is separate from kernel metadata? We may risk getting lost in the weeds with this discussion, but I think this is critical to clearly express what precisely is the identity of the core item that accrues links vs the more fluid, human-readable metadata that gives the object value and shape.

1 Like

[hardistyar]

Indeed with the motto of EOSC, even closed data behind a login/password, payroll can be considered FAIR. Open and FAIR are often confused as synonymous which they are not. Making the data FAIR is prone to comply better with legal issues, than making the data Open. But it makes the I and R for interoperable and re-usable potentially more difficult notably if it should be automated, if conditional. As such the data are re-usable, but you may not have the rights to do so …

The “PID kernel information (PID KI) and PID Kernel Information Profile” are terms from the RDA output - RDA Recommendation on PID Kernel Information (see https://doi.org/10.15497/rda00031). The recommendation makes a distinction between the PID kernel information (PID KI) in the PID record itself – I think this is what @hardistyar was referring to above.

PID KI is for the machine so the recommendation is to keep PID KI as minimal as possible and contain only key-value pairs. One of the RDA principles also highlights:

Every attribute in a profile [PID Kernel Profile] depends only on the identified object and nothing else. Every attribute also depends on the object directly and not through another
attribute.

An use case example we provide in our paper about FAIR digital object (http://doi.org/10.5334/dsj-2020-050) is the following:

One use case that can exploit kernel information is submitting large number (millions) of specimen images in long term storage to a workflow for optical character/text recognition (OCR), making the results findable with a full-text search ( Cazenave et al. 2019 ). These images and OCR’d label texts will reside in an ecosystem with millions of other digital objects (also with research artefacts from different domains). Full resolution of each PID might not be feasible in such cases. So for quick machine interpretation processing appropriate kernel information will be vital.

Thanks @sharif.islam. You write that at PID kernel contains key:value pairs & that every attribute depends only on the identified object and nothing else. What I assume this means is that under no circumstances does PID kernel metadata change. It is the canonical identity of the thing. The “thing” here is the digital object itself and nothing more, inclusive of the physical specimen from which it was derived. Provenance is held elsewhere in the searchable (editable? static?) metadata. Have I mischaracterized this? What then is the verifiable thread (checksums?) that ties the searchable metadata to the kernel PID metadata for humans or machines to verify that the digital specimen object is unique and persistent with respect to its physical counterpart?

To be sure, these are technical matters, but they outline a socio-technical contract, ownership, and chains of responsibilities. Who is it that creates the kernel PID metadata? And, as a result of that action, do they assume responsibility for the unequivocal link between it and the physical specimen even if the latter were transferred to another museum? I suppose this would be comparable in spirit to what happens when a publisher is purchased by another. Although branding can be inferred by a DOI prefix, under these circumstances, the purchasing publisher must accepted the fact they then become responsible for the prefix.

Here are a lot of case-by-case decisions involved, where the FAIR principles could provide only initial guidance. Primarily concerned are the accessibility and reusability of (specimen) data:

  • FAIR accessibility means that access conditions for both humans and machines have to be transparantly specified
  • FAIR Reusability requires correspondingly clear (again: for humans and machines) descriptions of the license status.
    There is a preference of CC licenses, but these are not set terms. So the FAIR priciples might support quite a lot of restrictions, but they guide to give comprehensive (machine-readble) information why :-).
    This provides of course not sufficiently guidance to formulate and implement the (specimen) data policies required - I think an extended survey/overview resulting optimally in a kind of license application/construction kit/recommendation under interoperability aspects would help (to some extend such recommendations are in place, e.g. Hagedorn [https://doi.org/10.3897/zookeys.150.2189] ).
1 Like

@dshorthouse @sharif.islam Here some answers to David’s most recent remarks:

PID kernel information (or PID record attributes, to use an alternative name) only changes occasionally. The most likely time is when the storage location of the identified digital specimen digital object changes. Then it is necessary to update the pointer to that. The PID record can also contain other pointers to other kinds of information, such as metadata, provenance, etc. but what and where depends on some design choices. For simplicity at the moment, let’s just assume there’s a metadata record associated with the digital specimen ‘thing’ (DS), as well as a trail of provenance and that this metadata record appears in a publicly searchable database.

The DS is inclusive of the physical specimen only by the fact that there is a maintained reference from itself to the physical specimen it represents. This reference will be some kind of identifier - the physicalSpecimenId - which most likely equates to the catalog number or barcode of the object in its collection. This may not be unique of course, so something else like institutionCode is also needed. The PID does not directly identify the physical object. There’s a further complexity from another layer of indirection that’s added by the existence of catalog records in a database that are publicly accessible. These records also have their identifiers.

The PID record and other elements can contain checksums so a verifiable thread can be maintained but that doesn’t prevent link rot, so responsibilities must be taken. This is the social contract. We see that already when it comes to assigning PIDs (e.g., DOI) to journal articles and datasets. The publisher (perhaps with assistance from an author) remains responsible for the accuracy of the metadata and for the reliability of the primary pointer to the object. When these change, a proxy - generally, a Registration Agency (RA) - will be instructed by the publisher to update the metadata record and the PID record. In the case of DOIs for journal articles and datasets, Crossref and DataCite are the RAs (proxies). So, the publisher creating the content also creates and maintains its metadata and primary link. The proxy creates and maintains the PID record and proper resolution to the pointers. But the proxy is not responsible for those cases where the publisher fails to inform that metadata and links have changed.

When a specimen is transferred to another museum, responsibility for maintaining the integrity of the corresponding DS also transfers - unless, of course that had been delegated previously to some third-party.

2 Likes