Making FAIR data for specimens accessible

JoeMiller · January 30, 2021, 4:18pm

Moderators: Alex Hardisty, Barbara Thiers, and Wouter Addink

Summaries - 1. Making FAIR data for specimens accessible

Background

FAIR data is data that is findable, accessible, interoperable and reusable according to the FAIR Guiding Principles. Generating and providing access to FAIR data during and after the specimen digitisation process is one of several important value streams to be better supported by the digital specimen / extended specimen framework.

Note: A value stream is a sequence of activities that creates an overall result or outcome for a stakeholder (end-user). A stakeholder can be a scientist, a collection manager, a curator, an educator, etc. The result or outcome has a worth or usefulness to the stakeholder.

When it comes to mobilizing data for digital representations of specimens, everyone wants widened and improved access to specimen data; both to data about the specimen and the digitisation process itself, or related to the specimen and data derived from study and analysis of the specimen. Everyone wants data to be findable, accessible, interoperable and reusable (FAIR).

The idea of Webster et al., of extending the scope of the specimen concept itself to include things other than biological or geological materials, such as audio, video, photographic recordings and a wide range of other data types (both directly derived and indirectly related) leads to the idea of the Digital Specimen as proposed by DiSSCo and the notion of a network of extended specimens, as proposed by BCoN. Convergence leads to the idea of the extended digital specimen where ‘extended’ refers to adding derived data and ‘digital’ references the idea of distinguishing between the physical specimen as an identifiable object in the real/natural world and a coupled or corresponding information representation of that object in the digital realm i.e., an identifiable digital object on the Internet that can be manipulated independently of the physical object.

The past twenty years have led to data becoming more easily findable and accessible, yet still not fully meeting all the requirements of the FAIR Guiding Principles. The focus must turn increasingly to making specimen data interoperable and reusable, not only by humans but also by machines i.e., by software. Digital specimens as described by DiSSCo, are FAIR by design and are representations on the Internet (surrogates) corresponding to identifiable physical specimens in a natural science collection – or ‘Specimens on the Internet’ – that can be manipulated by both humans and machines. Standardized as open Digital Specimens in a new specification (openDS) these form the basis for the next generation of collections data infrastructure.

The goal of this category is to discuss the business (including scientific) outcomes that can be achieved by adopting a converged open digital and extended specimen technical framework that we name as openDS, and the opportunities that affords to various stakeholders. It is relevant to discuss technical aspects and capabilities needed, as well as the way forward from the present day, including new models of cooperation and actions that are needed. This category can also consider financial, social, governance, legal and professional implications.

This category is concerned with the overarching issues on which outcomes in the other categories depend.

Principal resources

Webster MS, editor. The extended specimen: emerging frontiers in collections-based ornithological research. CRC Press; 2017 Jul 20. doi: 10.1201/9781315120454; especially Chapter 1.
DiSSCo Tech - What is a Digital Specimen? https://bit.ly/DigitalSpecimen
BCoN, 2019 - Extending US biodiversity collections to address national challenges. https://bcon.aibs.org/wp-content/uploads/2019/01/Report-Public-Comment-draft.pdf
ES/DS Framework, A technical explanation towards convergence. Video presentation: Dropbox - Archi-v0.4-16Dec2020-export.mp4 - Simplify your life
TDWG 2020 SYM07: Standards development to support transformation of collection data into digital specimens. Recording of session: TDWG 2020: Standards development to support transformation of collection data into digital specimens - YouTube.
TDWG 2020 PD03: Panel discussion on enabling digital specimen & extended specimen concepts in current tools & services. Recording of session: TDWG 2020 Enabling digital specimen & extended specimen concepts in current tools & services - PD03 - YouTube.
TDWG 2020 BoF 01: Birds of a Feather session on converging Digital & Extended Specimens towards global specification. Recording of session: TDWG 2020: Converging Digital & Extended Specimens towards global specification - Working Sessions - YouTube.

Additional resources

Lannom, L., Koureas, D., and Hardisty, A.R. (2020). FAIR Data and Services in Biodiversity Science and Geoscience. Data Intelligence 2(1):122-130. doi: 10.1162/dint_a_00034
Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, da Silva Santos LB, Bourne PE, Bouwman J (2016) The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3. doi: 10.1038/sdata.2016.18.
De Smedt K, Koureas D, Wittenburg P (2020) FAIR Digital Objects for Science: From data pieces to actionable knowledge units. Preprints 2020, 2020030073. doi: 10.20944/preprints202003.0073.v1.
Specification for open Digital Specimens (openDS), Github repository: https://github.com/DiSSCo/openDS; especially:
- Answers to frequently asked questions about Digital Specimens and openDS;
- Explanation of how openDS and the Extended Specimen Network are related;
- The openDS data model relates to other important structures, standards and initiatives in the wider world, as well as to information science in different domains of scientific discourse. Positioning openDS in the landscape is one of the first and most important steps in development of the specification. Making sure that everyone likely to make use of the model agrees on this is essential to progress.
- openDS consists of three principal and interrelated components, as follows:
The openDS data model. Read the introduction to the openDS data model;
The Ontology for open Digital Specimens (ODS). Read the introduction to the ODS ontology; and,
The openDS Application Programming Interface (API). Read the introduction to the openDS API.
Lendemer J, Thiers B, Monfils AK, Zaspel J, Ellwood ER, Bentley A, LeVan K, Bates J, Jennings D, Contreras D, Lagomarsino L (2019) The Extended Specimen Network: A Strategy to Enhance US Biodiversity Collections, Promote Research and Education. BioScience, biz140. doi: 10.1093/biosci/biz140.
Hardisty A, Saarenmaa H, Casino A, Dillen M, Gödderz K, Groom Q, Hardy H, Koureas D, Nieva de la Hidalga A, Paul DL, Runnel V, Vermeersch X, van Walsum M, Willemse L (2020) Conceptual design blueprint for the DiSSCo digitization infrastructure - DELIVERABLE D8.1. Research Ideas and Outcomes 6: e54280. doi: 10.3897/rio.6.e54280; especially sections 2 (The DiSSCo Research Infrastructure) and 4 (Architecture, tools and technologies).
National Academies of Sciences, Engineering, and Medicine 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. https://doi.org/10.17226/25592.
Creating Darwin Core OWL files for OBO Foundry ontologies GitHub - BiodiversityOntologies/dwcobo: code for creating in Darwin Core ontology modules to import to BCO.

Questions to promote discussion

We suggest three groups of questions to promote discussion. These are concerned with new models of curation and governance, functions needed for new data science and improving engagement of participants. However, we welcome contributions on any matter related to mobilizing FAIR extended digital specimen data.

Group 1 questions: New models of digitisation, curation and governance

What capabilities do natural science collections (NSC) need to serve comprehensive FAIR data about their specimens when that data is a combination of the collection holder’s data, collector’s data and data from external specialists and third-party sources?
How and where should such combinations of value-added data be stored and curated and who should take the responsibility for that?
What FAIR data needs to be generated or made available during different steps in the digitisation process?
Extended digital specimen data is not the responsibility of a single organisation. What new models of cooperation do NSCs and other actors need to govern and serve such data?
Where are new standards needed to make that possible?
What investments are needed and by who? How can we make the return on investment concrete for serving comprehensive FAIR data?

Group 2 questions: Functions needed for new data science

What types of scientific questions do you want to be able to address with extended digital specimen data?
What functions (services, capabilities) do you need to pursue the scientific questions you wish to address?
What kinds of operations would you want to perform on specimen data remotely across a network (i.e., without having to bring the data to your local computer)?

Group 3 questions: Improving engagement of participants

What metrics/measures motivate you to contribute/share your value-adding data to extended digital specimens? What do you expect in return?
What could be done to broaden the use of extended specimen data and to make it accessible to a broader audience?
How could the experience of users (e.g., of portals, search, etc.) be improved and their lives made easier?

qgroom · February 16, 2021, 9:18am

In all presentations on digital specimens a bit sequence is presented at the kernel of the digital object. This give the impression that a digital object would be created as metadata for any digital representation of that specimen, be that an image, a genome, a sample etc. Therefore the metadata for all these digital objects would have links to all the other digital objects related to that specimen.

The other view is that the digital specimen doesn’t contain the bit sequence at all, but is a collection of links with minimal metadata that links all the objects together.

Which is it, or is a hybrid imagined?

I think this is an important distinction that gets to the issue for how these data are hosted and accessed.

dshorthouse · February 16, 2021, 1:20pm

If it is to be a hybrid, much like what I imagine publishers send to Crossref – article metadata plus references with DOIs – what happens when metadata in the specimen object kernel necessarily require adjustment as is often the case in Darwin Core content? Or, is there a core set of fields in the kernel of a digital specimen object that do NOT ever change & are sufficient in number and composition to permit unambiguous linking? If however there are many metadata fields in the kernel of a digital specimen object, there is real risk that any local or downstream links made will rot because it is more likely that the content in those fields will experience flux.

waddink · February 16, 2021, 2:11pm

A digital object has 4 components: a PID, kernel metadata, metadata and optionally a bit sequence. The kernel should contain a minimum set of metadata, just enough for resolution and defining the object type and allowed operations. This is metadata that should not change frequently. The metadata section should contain more metadata and there is where it gets fuzzy since much of the specimen label data could be seen as metadata or as data depending on the use case. Whether a digital specimen should contain a bit sequence containing all the data or just a collection of links is a matter of implementation, but currently I think it will be rather be a collection of links in a central metadata catalogue (NSIDR) where the data is stored in local object stores (which could be cordra instances) hosted by the institutions or (for smaller institutions) hosted by DiSSCo in the cloud. These likely need to remain separate from the CMS since many people may have write access to these stores for e.g. annotations while a CMS may have write access only by local staff. The local object store would than contain a copy of the CMS data + the extended specimen data, where the CMS data could be kept in sync through a DOIP interface if the CMS supports this. The PID should resolve to a landing page which could, though content negotiation, just provide a structured JSON with metadata and pointers to the data for machines.

dshorthouse · February 16, 2021, 3:03pm

Should we have a thread to identify what these are?

austinmast · February 16, 2021, 3:51pm

How will aspirations for FAIR data be informed by need to address data sovereignty issues, as id’d in this paper introducing CARE? http://doi.org/10.5334/dsj-2020-043

hardistyar · February 16, 2021, 3:58pm

From: Dirk Neumann 03:43 PM
And on which legal basis and with which standards are and can these data be shared? Pulished data is visible for a broad audience, including e.g. legislators tracing compliance or national governments restricting the used of such data in the public domain (not only but also current legal obligations to share or upload ‘DSI’ to databases in the public domain). Recent examples (e.g. current discussion for the introduction of a Global Multilateral Benefit Sharing system) strongly targeted specimen data generated and shared FAIRly (without offerring benefits for original Provider countries) included discussion on retroactive coverage of data derieved e.g. from historic specimens collected 19th century or earlier.

hardistyar · February 16, 2021, 3:59pm

From: Barbara Magagna 03:36 PM
I would like to ask you if you are aware of the work done by RDA I-ADOPT WG. We focus on interoperable observable property descriptions and think our work should fit into the work you are doing and vice versa.

pmergen · February 16, 2021, 4:07pm

These legal aspects addressed by Dirk Neumann are indeed important and often overlooked. In the European Open Science Cloud study, next to FAIR the consultant introduced the concept of JUST (Judicious, Unbiased, Safe and Transparent), which helps to identify responsibilities in case of legal issues, without addressing them directly though.
linked to comment : Making FAIR data for specimens accessible - #10 by hardistyar

JoeMiller · February 16, 2021, 4:15pm

@austinmast linked data with PIDs could better trace the data back to origins

Markus_B · February 16, 2021, 4:16pm

I was wondering about this too. How to treat outgoing links from the Digital Specimens and Extended Specimens to genomic data, e.g. sequences held in INSDC databases, which might (I say might!) have a CBD/Nagoya Protocol dimension in the future?

hardistyar · February 16, 2021, 4:17pm

Recall also that FAIR data does not always mean open data. The maxim is ‘as open as possible, as closed as necessary’. In Europe at least, data can be restricted on objective grounds that include intellectual property law, national security, the protection of endangered species, privacy and other regulations. I’m not familiar with other jurisdictions. Of course, whatever international agreements are made around access and benefit sharing will also have to be applied.
Here then we run into the need for technical mechanisms for access control to specimen data, which can only work on a wide-area basis and so might merit its own topic!

hardistyar · February 16, 2021, 4:18pm

@JoeMiller I prefer to say ‘data linked with PIDS’, which is different to ‘linked data’

hardistyar · February 16, 2021, 4:23pm

@dshorthouse It’s not possible to search kernel metadata, only to return it as part of resolving the PID for the object. Searchable metadata has to be stored elsewhere.
Nevertheless, it would be interesting to have your suggestions for what information should be returned at that level (as opposed to being found by a metadata search in the database of a Registration Agency - example: https://search.crossref.org/ or https://search.datacite.org/).

jegelewicz · February 16, 2021, 4:30pm

Yes, and a collection manager with a biology degree has to wrangle with all of this often with no legal help at all. How can we also make this fair to the providers who have limited knowledge/resources?

dshorthouse · February 16, 2021, 5:33pm

I may have a misunderstood what is meant by kernel metadata vs what is meant by searchable metadata. What is in the former? Are they human-readable? Do they have anything that resembles Darwin Core? Or, are Darwin Core-like terms solely in what you call searchable metadata that sits alongside & is separate from kernel metadata? We may risk getting lost in the weeds with this discussion, but I think this is critical to clearly express what precisely is the identity of the core item that accrues links vs the more fluid, human-readable metadata that gives the object value and shape.

pmergen · February 16, 2021, 5:56pm

[hardistyar]

Indeed with the motto of EOSC, even closed data behind a login/password, payroll can be considered FAIR. Open and FAIR are often confused as synonymous which they are not. Making the data FAIR is prone to comply better with legal issues, than making the data Open. But it makes the I and R for interoperable and re-usable potentially more difficult notably if it should be automated, if conditional. As such the data are re-usable, but you may not have the rights to do so …

sharif.islam · February 17, 2021, 8:51am

The “PID kernel information (PID KI) and PID Kernel Information Profile” are terms from the RDA output - RDA Recommendation on PID Kernel Information (see https://doi.org/10.15497/rda00031). The recommendation makes a distinction between the PID kernel information (PID KI) in the PID record itself – I think this is what @hardistyar was referring to above.

PID KI is for the machine so the recommendation is to keep PID KI as minimal as possible and contain only key-value pairs. One of the RDA principles also highlights:

Every attribute in a profile [PID Kernel Profile] depends only on the identified object and nothing else. Every attribute also depends on the object directly and not through another
attribute.

An use case example we provide in our paper about FAIR digital object (http://doi.org/10.5334/dsj-2020-050) is the following:

One use case that can exploit kernel information is submitting large number (millions) of specimen images in long term storage to a workflow for optical character/text recognition (OCR), making the results findable with a full-text search ( Cazenave et al. 2019 ). These images and OCR’d label texts will reside in an ecosystem with millions of other digital objects (also with research artefacts from different domains). Full resolution of each PID might not be feasible in such cases. So for quick machine interpretation processing appropriate kernel information will be vital.

Topic		Replies	Views
Summaries - 1. Making FAIR data for specimens accessible Digital/Extended Specimen	2	1572	February 26, 2021
8. Meeting legal/regulatory, ethical and sensitive data obligations Digital/Extended Specimen	55	4600	August 23, 2021
Structure and responsibilities of a #digextspecimen Digital/Extended Specimen	30	4194	June 29, 2021
6. Robust access points and data infrastructure alignment Digital/Extended Specimen	32	3053	August 31, 2021
Extending, enriching and integrating data Digital/Extended Specimen	53	3969	April 5, 2021