Making FAIR data for specimens accessible

Moderators: Alex Hardisty, Barbara Thiers, and Wouter Addink

Summaries - 1. Making FAIR data for specimens accessible

Background

FAIR data is data that is findable, accessible, interoperable and reusable according to the FAIR Guiding Principles. Generating and providing access to FAIR data during and after the specimen digitisation process is one of several important value streams to be better supported by the digital specimen / extended specimen framework.

Note: A value stream is a sequence of activities that creates an overall result or outcome for a stakeholder (end-user). A stakeholder can be a scientist, a collection manager, a curator, an educator, etc. The result or outcome has a worth or usefulness to the stakeholder.

When it comes to mobilizing data for digital representations of specimens, everyone wants widened and improved access to specimen data; both to data about the specimen and the digitisation process itself, or related to the specimen and data derived from study and analysis of the specimen. Everyone wants data to be findable, accessible, interoperable and reusable (FAIR).

The idea of Webster et al., of extending the scope of the specimen concept itself to include things other than biological or geological materials, such as audio, video, photographic recordings and a wide range of other data types (both directly derived and indirectly related) leads to the idea of the Digital Specimen as proposed by DiSSCo and the notion of a network of extended specimens, as proposed by BCoN. Convergence leads to the idea of the extended digital specimen where ‘extended’ refers to adding derived data and ‘digital’ references the idea of distinguishing between the physical specimen as an identifiable object in the real/natural world and a coupled or corresponding information representation of that object in the digital realm i.e., an identifiable digital object on the Internet that can be manipulated independently of the physical object.

The past twenty years have led to data becoming more easily findable and accessible, yet still not fully meeting all the requirements of the FAIR Guiding Principles. The focus must turn increasingly to making specimen data interoperable and reusable, not only by humans but also by machines i.e., by software. Digital specimens as described by DiSSCo, are FAIR by design and are representations on the Internet (surrogates) corresponding to identifiable physical specimens in a natural science collection – or ‘Specimens on the Internet’ – that can be manipulated by both humans and machines. Standardized as open Digital Specimens in a new specification (openDS) these form the basis for the next generation of collections data infrastructure.

The goal of this category is to discuss the business (including scientific) outcomes that can be achieved by adopting a converged open digital and extended specimen technical framework that we name as openDS, and the opportunities that affords to various stakeholders. It is relevant to discuss technical aspects and capabilities needed, as well as the way forward from the present day, including new models of cooperation and actions that are needed. This category can also consider financial, social, governance, legal and professional implications.

This category is concerned with the overarching issues on which outcomes in the other categories depend.

Principal resources

Additional resources

Questions to promote discussion

We suggest three groups of questions to promote discussion. These are concerned with new models of curation and governance, functions needed for new data science and improving engagement of participants. However, we welcome contributions on any matter related to mobilizing FAIR extended digital specimen data.

Group 1 questions: New models of digitisation, curation and governance

  1. What capabilities do natural science collections (NSC) need to serve comprehensive FAIR data about their specimens when that data is a combination of the collection holder’s data, collector’s data and data from external specialists and third-party sources?
  2. How and where should such combinations of value-added data be stored and curated and who should take the responsibility for that?
  3. What FAIR data needs to be generated or made available during different steps in the digitisation process?
  4. Extended digital specimen data is not the responsibility of a single organisation. What new models of cooperation do NSCs and other actors need to govern and serve such data?
  5. Where are new standards needed to make that possible?
  6. What investments are needed and by who? How can we make the return on investment concrete for serving comprehensive FAIR data?

Group 2 questions: Functions needed for new data science

  1. What types of scientific questions do you want to be able to address with extended digital specimen data?
  2. What functions (services, capabilities) do you need to pursue the scientific questions you wish to address?
  3. What kinds of operations would you want to perform on specimen data remotely across a network (i.e., without having to bring the data to your local computer)?

Group 3 questions: Improving engagement of participants

  1. What metrics/measures motivate you to contribute/share your value-adding data to extended digital specimens? What do you expect in return?
  2. What could be done to broaden the use of extended specimen data and to make it accessible to a broader audience?
  3. How could the experience of users (e.g., of portals, search, etc.) be improved and their lives made easier?

In all presentations on digital specimens a bit sequence is presented at the kernel of the digital object. This give the impression that a digital object would be created as metadata for any digital representation of that specimen, be that an image, a genome, a sample etc. Therefore the metadata for all these digital objects would have links to all the other digital objects related to that specimen.

The other view is that the digital specimen doesn’t contain the bit sequence at all, but is a collection of links with minimal metadata that links all the objects together.

Which is it, or is a hybrid imagined?

I think this is an important distinction that gets to the issue for how these data are hosted and accessed.

If it is to be a hybrid, much like what I imagine publishers send to Crossref – article metadata plus references with DOIs – what happens when metadata in the specimen object kernel necessarily require adjustment as is often the case in Darwin Core content? Or, is there a core set of fields in the kernel of a digital specimen object that do NOT ever change & are sufficient in number and composition to permit unambiguous linking? If however there are many metadata fields in the kernel of a digital specimen object, there is real risk that any local or downstream links made will rot because it is more likely that the content in those fields will experience flux.

2 Likes

A digital object has 4 components: a PID, kernel metadata, metadata and optionally a bit sequence. The kernel should contain a minimum set of metadata, just enough for resolution and defining the object type and allowed operations. This is metadata that should not change frequently. The metadata section should contain more metadata and there is where it gets fuzzy since much of the specimen label data could be seen as metadata or as data depending on the use case. Whether a digital specimen should contain a bit sequence containing all the data or just a collection of links is a matter of implementation, but currently I think it will be rather be a collection of links in a central metadata catalogue (NSIDR) where the data is stored in local object stores (which could be cordra instances) hosted by the institutions or (for smaller institutions) hosted by DiSSCo in the cloud. These likely need to remain separate from the CMS since many people may have write access to these stores for e.g. annotations while a CMS may have write access only by local staff. The local object store would than contain a copy of the CMS data + the extended specimen data, where the CMS data could be kept in sync through a DOIP interface if the CMS supports this. The PID should resolve to a landing page which could, though content negotiation, just provide a structured JSON with metadata and pointers to the data for machines.

1 Like

Should we have a thread to identify what these are?

How will aspirations for FAIR data be informed by need to address data sovereignty issues, as id’d in this paper introducing CARE? http://doi.org/10.5334/dsj-2020-043

4 Likes

From: Dirk Neumann 03:43 PM
And on which legal basis and with which standards are and can these data be shared? Pulished data is visible for a broad audience, including e.g. legislators tracing compliance or national governments restricting the used of such data in the public domain (not only but also current legal obligations to share or upload ‘DSI’ to databases in the public domain). Recent examples (e.g. current discussion for the introduction of a Global Multilateral Benefit Sharing system) strongly targeted specimen data generated and shared FAIRly (without offerring benefits for original Provider countries) included discussion on retroactive coverage of data derieved e.g. from historic specimens collected 19th century or earlier.

2 Likes

From: Barbara Magagna 03:36 PM
I would like to ask you if you are aware of the work done by RDA I-ADOPT WG. We focus on interoperable observable property descriptions and think our work should fit into the work you are doing and vice versa.

These legal aspects addressed by Dirk Neumann are indeed important and often overlooked. In the European Open Science Cloud study, next to FAIR the consultant introduced the concept of JUST (Judicious, Unbiased, Safe and Transparent), which helps to identify responsibilities in case of legal issues, without addressing them directly though.
linked to comment : Making FAIR data for specimens accessible - #10 by hardistyar

2 Likes

@austinmast linked data with PIDs could better trace the data back to origins

1 Like

I was wondering about this too. How to treat outgoing links from the Digital Specimens and Extended Specimens to genomic data, e.g. sequences held in INSDC databases, which might (I say might!) have a CBD/Nagoya Protocol dimension in the future?

Recall also that FAIR data does not always mean open data. The maxim is ‘as open as possible, as closed as necessary’. In Europe at least, data can be restricted on objective grounds that include intellectual property law, national security, the protection of endangered species, privacy and other regulations. I’m not familiar with other jurisdictions. Of course, whatever international agreements are made around access and benefit sharing will also have to be applied.
Here then we run into the need for technical mechanisms for access control to specimen data, which can only work on a wide-area basis and so might merit its own topic!

1 Like

@JoeMiller I prefer to say ‘data linked with PIDS’, which is different to ‘linked data’ :slight_smile:

3 Likes

@dshorthouse It’s not possible to search kernel metadata, only to return it as part of resolving the PID for the object. Searchable metadata has to be stored elsewhere.
Nevertheless, it would be interesting to have your suggestions for what information should be returned at that level (as opposed to being found by a metadata search in the database of a Registration Agency - example: https://search.crossref.org/ or https://search.datacite.org/).

Yes, and a collection manager with a biology degree has to wrangle with all of this often with no legal help at all. How can we also make this fair to the providers who have limited knowledge/resources?

1 Like

I may have a misunderstood what is meant by kernel metadata vs what is meant by searchable metadata. What is in the former? Are they human-readable? Do they have anything that resembles Darwin Core? Or, are Darwin Core-like terms solely in what you call searchable metadata that sits alongside & is separate from kernel metadata? We may risk getting lost in the weeds with this discussion, but I think this is critical to clearly express what precisely is the identity of the core item that accrues links vs the more fluid, human-readable metadata that gives the object value and shape.

1 Like

[hardistyar]

Indeed with the motto of EOSC, even closed data behind a login/password, payroll can be considered FAIR. Open and FAIR are often confused as synonymous which they are not. Making the data FAIR is prone to comply better with legal issues, than making the data Open. But it makes the I and R for interoperable and re-usable potentially more difficult notably if it should be automated, if conditional. As such the data are re-usable, but you may not have the rights to do so …

The “PID kernel information (PID KI) and PID Kernel Information Profile” are terms from the RDA output - RDA Recommendation on PID Kernel Information (see https://doi.org/10.15497/rda00031). The recommendation makes a distinction between the PID kernel information (PID KI) in the PID record itself – I think this is what @hardistyar was referring to above.

PID KI is for the machine so the recommendation is to keep PID KI as minimal as possible and contain only key-value pairs. One of the RDA principles also highlights:

Every attribute in a profile [PID Kernel Profile] depends only on the identified object and nothing else. Every attribute also depends on the object directly and not through another
attribute.

An use case example we provide in our paper about FAIR digital object (http://doi.org/10.5334/dsj-2020-050) is the following:

One use case that can exploit kernel information is submitting large number (millions) of specimen images in long term storage to a workflow for optical character/text recognition (OCR), making the results findable with a full-text search ( Cazenave et al. 2019 ). These images and OCR’d label texts will reside in an ecosystem with millions of other digital objects (also with research artefacts from different domains). Full resolution of each PID might not be feasible in such cases. So for quick machine interpretation processing appropriate kernel information will be vital.