Making FAIR data for specimens accessible

dshorthouse · February 17, 2021, 12:51pm

Thanks @sharif.islam. You write that at PID kernel contains key:value pairs & that every attribute depends only on the identified object and nothing else. What I assume this means is that under no circumstances does PID kernel metadata change. It is the canonical identity of the thing. The “thing” here is the digital object itself and nothing more, inclusive of the physical specimen from which it was derived. Provenance is held elsewhere in the searchable (editable? static?) metadata. Have I mischaracterized this? What then is the verifiable thread (checksums?) that ties the searchable metadata to the kernel PID metadata for humans or machines to verify that the digital specimen object is unique and persistent with respect to its physical counterpart?

To be sure, these are technical matters, but they outline a socio-technical contract, ownership, and chains of responsibilities. Who is it that creates the kernel PID metadata? And, as a result of that action, do they assume responsibility for the unequivocal link between it and the physical specimen even if the latter were transferred to another museum? I suppose this would be comparable in spirit to what happens when a publisher is purchased by another. Although branding can be inferred by a DOI prefix, under these circumstances, the purchasing publisher must accepted the fact they then become responsible for the prefix.

cweiland · February 17, 2021, 2:26pm

Here are a lot of case-by-case decisions involved, where the FAIR principles could provide only initial guidance. Primarily concerned are the accessibility and reusability of (specimen) data:

FAIR accessibility means that access conditions for both humans and machines have to be transparantly specified
FAIR Reusability requires correspondingly clear (again: for humans and machines) descriptions of the license status.
There is a preference of CC licenses, but these are not set terms. So the FAIR priciples might support quite a lot of restrictions, but they guide to give comprehensive (machine-readble) information why :-).
This provides of course not sufficiently guidance to formulate and implement the (specimen) data policies required - I think an extended survey/overview resulting optimally in a kind of license application/construction kit/recommendation under interoperability aspects would help (to some extend such recommendations are in place, e.g. Hagedorn [Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information] ).

hardistyar · February 17, 2021, 4:24pm

@dshorthouse @sharif.islam Here some answers to David’s most recent remarks:

PID kernel information (or PID record attributes, to use an alternative name) only changes occasionally. The most likely time is when the storage location of the identified digital specimen digital object changes. Then it is necessary to update the pointer to that. The PID record can also contain other pointers to other kinds of information, such as metadata, provenance, etc. but what and where depends on some design choices. For simplicity at the moment, let’s just assume there’s a metadata record associated with the digital specimen ‘thing’ (DS), as well as a trail of provenance and that this metadata record appears in a publicly searchable database.

The DS is inclusive of the physical specimen only by the fact that there is a maintained reference from itself to the physical specimen it represents. This reference will be some kind of identifier - the physicalSpecimenId - which most likely equates to the catalog number or barcode of the object in its collection. This may not be unique of course, so something else like institutionCode is also needed. The PID does not directly identify the physical object. There’s a further complexity from another layer of indirection that’s added by the existence of catalog records in a database that are publicly accessible. These records also have their identifiers.

The PID record and other elements can contain checksums so a verifiable thread can be maintained but that doesn’t prevent link rot, so responsibilities must be taken. This is the social contract. We see that already when it comes to assigning PIDs (e.g., DOI) to journal articles and datasets. The publisher (perhaps with assistance from an author) remains responsible for the accuracy of the metadata and for the reliability of the primary pointer to the object. When these change, a proxy - generally, a Registration Agency (RA) - will be instructed by the publisher to update the metadata record and the PID record. In the case of DOIs for journal articles and datasets, Crossref and DataCite are the RAs (proxies). So, the publisher creating the content also creates and maintains its metadata and primary link. The proxy creates and maintains the PID record and proper resolution to the pointers. But the proxy is not responsible for those cases where the publisher fails to inform that metadata and links have changed.

When a specimen is transferred to another museum, responsibility for maintaining the integrity of the corresponding DS also transfers - unless, of course that had been delegated previously to some third-party.

cweiland · February 17, 2021, 4:33pm

I don’t understand why (all) PID KI record’s values must be immutable? I’d agree that the set of attributes chosen (the “profile”) shouldn’t change or should be limited to “releases” of a DS/ES profile and change frequency should be low.

dshorthouse · February 17, 2021, 5:29pm

This was wonderful & we’re starting to illuminate what might be the expectations required of collections (or their proxies) who publish data. To re-iterate what @austinmast asked elsewhere, institutions will need these expectations outlined such that they can gauge whether or not these might have financial & staffing implications.

dhobern · February 19, 2021, 3:59am

This is the fundamental issue and the one where we need to get it right from the beginning.

If we assume that the kernel metadata offers little more than a unique identifier, (possibly) an object class identifier, and ownership and provenance information, almost everything that really interests us is part of the searchable metadata or the payload.

Darwin Core has not encouraged us generally to be particularly rigorous about how we map properties to digital records/objects belonging to well-defined classes. Our overall philosophy is often that data publishers should be free to normalise or denormalise at their own convenience, with lots of variation from flat specimen records that include all relevant information as a set of text properties through to separately managed and identified records/objects for collectors, taxa, localities, projects, media, etc. That introduces a challenge whether we allow the same flexibility in our digital specimen objects.

Part of the problem is that we really do want to filter our specimens via properties that are better considered metadata for one of these other objects - by genus, by geographic region, etc. Do we therefore include or duplicate these elements in the searchable metadata for the digital specimen? Or do we prefer just to include opaque identifiers for the associated objects? Is it reasonable to have a digital object that is really just a bag of relationships with other objects that then need to be resolved? If this is not reasonable, then where do we draw the line in terms of embedding content?

We need clear guidelines for these choices and those guidelines need to be in response to the real-world use cases we care about.

Note that some of these issues evaporate if we delegate important use cases to a genuinely robust and globally inclusive indexing service that instantiates the searchable views we need, regardless of whether the DOs include all indexed properties directly. GBIF has to a large extent played exactly this role to date and in my (not completely impartial) view needs to sit at the heart of the global architecture for specimen DOs. Focus on DOs can then be all about their efficiency, accuracy and trustworthiness as the authoritative master views of what is know about the specimen. GBIF, COL, BHL, INSDC, etc. can be as the lenses that allow rapid discovery, filtering and retrieval of these objects or of the specific views of the objects that applications actually need,

JuttaBuschbom · February 19, 2021, 12:47pm

A foundation of rights-based considerations will be fundamental to our success of building a network or central repository of integrated DS/ES data. Thank you @pmergen and Barbara Magagna for pointing to the JUST and CARE principles (posts 9 & 10).

As scientists and conservation professionals we are excited about the prospect of globally integrated, high quality biodiversity data. However, once published and accessible, the data can be used for all kinds of purposes. Dirk Neumann mentions very realistic but problematic “use case”-scenarios associated with ABS (access and benefit sharing, cp. Nagoya Protocol).

Other problematic use cases can be constructed with regard to biodiversity crime (cp. UNEP-INTERPOL Report: The Rise of Environmental Crime (2016), see the Interpol webpage Our response to environmental crime). For many threatened, but economically highly valuable taxa with with-held location data in a public integrated data infrastructure, machine-learning algorithms likely will be able to identify locations based on, eg, occurrence records of associated taxa. Nevertheless, at the same time a successful DS/ES infrastructure is absolutely essential for effective wildlife, fisheries and timber forensics, plus conservation and monitoring in general.

With regard to ABS, I am quite hopeful that a just and - for the basic biodiversity sciences - practical solution can be found (I know, give it some years and I will be realistically disillusioned). A webinar from the Secretariat of the Convention on Biological Diversity (CBD) last week summarized policy options, which are right now on the table for DSI (Digital Sequence Information). It was pointed out that a combination of options might allow acceptance by the parties.

There are some options, I believe, which will take the burden of worrying about retrospective and future demands from the natural history collections and their institutions. See the graphic summary on slide 11 of the presentation from the last webinar found here https://www.cbd.int/abs/DSI-webinar/Dsi-Webinar3-Policy-options.pdf [can’t upload pdfs yet]. Certain combinations of accepted policies can even provide good rights-based frameworks for an integrated biodiversity data infrastructure.

While developing and implementing a global, integrated biodiversity data infrastructure, we need to build-in these social and ethical considerations. Since collection-based data are essential to the success of conservation, we will (have to) find solutions.

We can always have a look at what is done with regard to these questions in human (population) genomics and large-scale medical consortia. Cooperating with legal experts as @pmergen mentions sounds like a good idea.

pmergen · February 19, 2021, 1:30pm

Something, I have noticed is that there is an attitude as if Nagoya (and to some extent DSI) would replace the existing other legislation. Not so much a misunderstanding, but because they are most discussed and visible for the moment and other legislation on permits, CITES tends to be overlooked.

Concerning the tracking, papers being identified where the sampling or shipping permits were not followed is increasing because of the Nagoya and DSI, attention lately, while other legal issues and permits breaches are then also detected.

Something in data publishing to be more attentive. Can be in the requirements to accept publications or providing of data, information, paper … but the responsibility should stay with the owner or original source of the content.

JuttaBuschbom · February 20, 2021, 10:07am

@pmergen, thanks for pointing out that a CBD-focus is too limited and that permits should be intrinsically considered in the design of an infrastructure for biodiversity data.

Apart from those potentially arising due to other interests, permits and legal documentation can be or are important conservation tools. Their information should be naturally integrated into the infrastructure.

From a practical point of view, an integrated data infrastructure can support their management. It can also foster adherence to legal requirements, either via peer pressure (it just doesn’t look good to publish data without attaching the associated permits) or publication etc. requirements.

ylebras · February 20, 2021, 10:31am

@dshorthouse There is a RDA group related to this question (which content for PID kernel metadata) : “Recommendation on PID Kernel Information | RDA” https://www.rd-alliance.org/group/pid-kernel-information-profile-management-wg/outcomes/recommendation-pid-kernel-information "

If your question is related to the PID “searchable metadata” (not the kernel metadata), and the manner to link the Digital Specimen to external info/data as genomics one, it appears to me important to try reusing existing domain oriented metadata having such capabilities, and so , for example, using Ecological Metadata Language (EML) to specify this "searchable metadata) (and I am thinking more precisely to the complete EML specification, not the limited GBIF EML profile)

ylebras · February 20, 2021, 10:45am

Agree with you, going into this kernel metadata vs “searchable metadata” discussion is maybe too complicated. Maybe we can “just” focus on “searchable metadata” as this is more related to what people already know a little bit And here, I strongly think we have to consider using existing domain related standard as EML. It seems to me using it, Digital Specimens can win opportunities to be linked “more easily” with others 1/ domain (ecology) related Research Objects (as genomics data, GIS data, not-occurences oriented data… and also softwares, workflows, publications, protocoles…) using EML terms and with 2/ extra domain (not ecology related) RO using notably EML annotation module facilitating addition / inclusion of terminological resources

ylebras · February 20, 2021, 10:55am

Totally agree with you. Notably, even if FAIR doesn’t mean open, openness facilitate FAIRness, for sure!

JuttaBuschbom · February 20, 2021, 11:11am

Following up on the topics of access and benefit sharing, permits, legal and rights-based principles, fundamentally these touch on ethical questions. In this consultation, this topic so far seems to be missing or at least doesn’t have its own thread.

A such extensive endeavor as implementing the DS/ES concepts in an integrated global biodiversity infrastructure should be accompanied by a group of experts, stakeholders and interested biodiversity scientists, who address ethical issues of data sharing and publishing.

Earlier this year, the results of a large cancer consortium were published, including a comment describing and discussing their ethical considerations and strategies:
Phillipsetal2020.pdf (383.9 KB) We might learn from the problems that they encountered and adapt some of the solutions that they found.

Taking ethical considerations in a biodiversity context one step further, in my opinion, the collections community should arrive at taking a multispecies ethics stance, see eg. van Dooren et al_2016.pdf (461.5 KB)

Pragmatically, we need a clear set of values, which forms the basis for our communications and interactions with the public and from which we can develop arguments, why our actions as collection-based community are not destroying what we want to protect and conserve.

A colleague recently entered the need for such values into the strategy discussions of SPNHC’s Biodiversity Crisis Response Committee. She connected them with principles and values developed by the animal research community, eg. the 3 R’s: replace, reduce, and refine.

The FAIR principles are very close to research data. However, there is a continuous connection and it’s only a couple of short steps from them to more philosophical or very practical (eg. legal) ethical considerations. Ethical considerations and objectives should be consciously integrated into the development of a global biodiversity data infrastructure.

ylebras · February 20, 2021, 11:41am

Agree with you. Concerning “Our overall philosophy is often that data publishers should be free to normalise or denormalise at their own convenience” it seems to me that using a “metadata oriented standardisation (like using EML)” is better for that than a "data oriented standardisation (like using Darwin Core), notably because a detailed metadata language like EML can be used as a pivot standard between raw heterogeneous datasets and standardized ones (as DwC and others). Considering distributed international data infrastructures, for sure GBIF is one of the major one to which we have to contribute. Another “similar” infrastructure coming to my mind is DataOne, focusing on standardisation through detailed metadata, and using DataOne at a first infrastructure allows to use GBIF as a derivated one, gaining from both systems pros and mitigating / balancing cons

JuttaBuschbom · February 20, 2021, 11:42am

@hardistyar could you explain if you mean
a) the links themselves have Permanent Identifiers (S?) or
b) the links use the PIDs of the two objects that are linked to create and uniquely identify the link?
Thanks!

JuttaBuschbom · February 20, 2021, 12:32pm

@Markus_B Am I misunderstanding the DS/ES concepts and their relationship to ABS under the Nagoya Protocol?

So far, I thought that genomic data - independent of their storage location - are part of the DS/ES concept, ie that they are one realization of the concept of “extended”. Other realizations of “extended” might be isotopes, 3D-scans, photos, proteomic data, etc.

Because this is my assumption, there is or can be no fundamental (legal) distinction between a physical specimen and its DSI (most/all definitions). Without a physical specimen, there is no DNA and DNA sequence information. Also, all DSI from a naturally occurring individual will have a physical origin and specimen, even if this is a hair/blood/feces sample from a critter still running around in the wild. An extended specimen is inseparable, no matter if the DNA-sequence, the tissue and the critter are located in three different countries.

The important distinction with regard to Nagoya occurs, when someone along the value chain generates benefits, specifically eg. they derive a commercial gain. This gain could be based already on the physical specimen, eg. an orchid with a rare flower mutation collected for a botanical garden that then was cloned via cell culture and brought to market; or it could be based on any derived information, eg. the specimens/individuals DNA, biochemistry, etc. The agent who generated a (commercial) benefit is of interest and consequence for the application of the Nagoya Protocol.

I am getting myself into trouble here, though, linking a specimen with its DSI should not make a difference - or I am missing something?

JuttaBuschbom · February 20, 2021, 12:53pm

Correction: @austinmast pointed to the CARE principles (post 7). Sorry for mixing this up.

nickyn · February 22, 2021, 11:54am

Re

Group 3 questions: Improving engagement of participants
(12) How could the experience of users (e.g., of portals, search, etc.) be improved and their lives made easier?

Pointer to related comment under Analyzing/mining specimen data for novel applications

Markus_B · February 22, 2021, 1:17pm

@JuttaBuschbom I do not think you are misunderstanding anything. But let me give you my thoughts:

I agree that genomic data are part of the DS and ES concepts; that is how I understand the concepts and how they were presented (I hope I did not say anything in my previous post that suggested otherwise). What I meant was this: Potential future regulation of genomic data/DSI may very well attach to location of the data. So for me, the interesting question is, what if you provide an outgoing link to that location?
Now for the interesting part: Whether or not there can be a fundamental legal distinction between a physical genetic resource (be it a specimen or any part of it) and its DSI (the placeholder term chosen in the current discussions), is a matter of legal and political debate. The fact that they are tied together in a way that one cannot exist without the other (talking naturally occurring sequences) has not necessarily a bearing on the kind of regulation they face. In many jurisdictions, eBooks are currently treated very differently from physical books, because they come with different properties leading to different economic consequences. This is what the parties to the CBD/NP have to decide in the upcoming years - regulate DSI and physical genetic resources the same way or differently and if so, how. That is all fluid at the moment and, unfortunately, as I said, a lot of it is politics and not grounded in scientific reasoning. Therefore I hope that scientists will have a big impact on the process.
You mention the distinction between commercial and non-commercial research (which presents many problems in a practical sense). Yes, if let’s say a commercial entity generates commercial benefits, those will have to be shared with the provider of the resource. But that is only the endpoint, and I think it is unproblematic. For me, the important distinction with regard to the Nagoya Protocol and the discussions surrounding DSI is the one about access to genetic resources. As a researcher in the EU, as soon as you access a physical genetic resource from Oct. 2014 onwards over which a state exercises sovereign rights, you are within the scope of Reg. 511/2014 (which transposes the Nagoya Protocol for the EU) and you need to follow all the legal jazz (PIC/MAT) that comes with it. For DSI, we do not have any of that at the moment.

So, if you are getting yourself into trouble then it’s positive trouble I would say, because these are important questions that I ask myself (and I assume, many others do too). For me, following the developments about the DS/ES is interesting because it is something that I have so far not seen reflected in the CBD/NP discussions. They are narrowly focussing on DSI and do not take the wider picture into account. Datafication is not just something that is happening with regard to DSI - all the biodiversity sciences are affected, even ecology is becoming a big data venture. But only genomic data generates such controversy.

Markus_B · February 22, 2021, 1:30pm

I think this goes to the heart of the matter. Be FAIR and CARE, as they say. The real issues (and not just the political interests masked as legal concerns I addressed in my reply to your post below) are the ones you describe here and that have also been summed up by Sabina Leonelli in Nature 574 , 317-320 (2019), https://doi.org/10.1038/d41586-019-03062-w. The important question is: How do we define open science?

Topic		Replies	Views
Summaries - 1. Making FAIR data for specimens accessible Digital/Extended Specimen	2	1562	February 26, 2021
Extending, enriching and integrating data Digital/Extended Specimen	53	3953	April 5, 2021
Background and context for phase 2 Digital/Extended Specimen	0	1087	June 8, 2021
Analyzing/mining specimen data for novel applications Digital/Extended Specimen	43	2896	April 4, 2021
6. Robust access points and data infrastructure alignment Digital/Extended Specimen	32	3041	August 31, 2021

Making FAIR data for specimens accessible

Related topics