Making FAIR data for specimens accessible

Following up on the topics of access and benefit sharing, permits, legal and rights-based principles, fundamentally these touch on ethical questions. In this consultation, this topic so far seems to be missing or at least doesn’t have its own thread.

A such extensive endeavor as implementing the DS/ES concepts in an integrated global biodiversity infrastructure should be accompanied by a group of experts, stakeholders and interested biodiversity scientists, who address ethical issues of data sharing and publishing.

Earlier this year, the results of a large cancer consortium were published, including a comment describing and discussing their ethical considerations and strategies:
Phillipsetal2020.pdf (383.9 KB) We might learn from the problems that they encountered and adapt some of the solutions that they found.

Taking ethical considerations in a biodiversity context one step further, in my opinion, the collections community should arrive at taking a multispecies ethics stance, see eg. van Dooren et al_2016.pdf (461.5 KB)

Pragmatically, we need a clear set of values, which forms the basis for our communications and interactions with the public and from which we can develop arguments, why our actions as collection-based community are not destroying what we want to protect and conserve.

A colleague recently entered the need for such values into the strategy discussions of SPNHC’s Biodiversity Crisis Response Committee. She connected them with principles and values developed by the animal research community, eg. the 3 R’s: replace, reduce, and refine.

The FAIR principles are very close to research data. However, there is a continuous connection and it’s only a couple of short steps from them to more philosophical or very practical (eg. legal) ethical considerations. Ethical considerations and objectives should be consciously integrated into the development of a global biodiversity data infrastructure.

1 Like

Agree with you. Concerning “Our overall philosophy is often that data publishers should be free to normalise or denormalise at their own convenience” it seems to me that using a “metadata oriented standardisation (like using EML)” is better for that than a "data oriented standardisation (like using Darwin Core), notably because a detailed metadata language like EML can be used as a pivot standard between raw heterogeneous datasets and standardized ones (as DwC and others). Considering distributed international data infrastructures, for sure GBIF is one of the major one to which we have to contribute. Another “similar” infrastructure coming to my mind is DataOne, focusing on standardisation through detailed metadata, and using DataOne at a first infrastructure allows to use GBIF as a derivated one, gaining from both systems pros and mitigating / balancing cons

1 Like

@hardistyar could you explain if you mean
a) the links themselves have Permanent Identifiers (S?) or
b) the links use the PIDs of the two objects that are linked to create and uniquely identify the link?

@Markus_B Am I misunderstanding the DS/ES concepts and their relationship to ABS under the Nagoya Protocol?

So far, I thought that genomic data - independent of their storage location - are part of the DS/ES concept, ie that they are one realization of the concept of “extended”. Other realizations of “extended” might be isotopes, 3D-scans, photos, proteomic data, etc.

Because this is my assumption, there is or can be no fundamental (legal) distinction between a physical specimen and its DSI (most/all definitions). Without a physical specimen, there is no DNA and DNA sequence information. Also, all DSI from a naturally occurring individual will have a physical origin and specimen, even if this is a hair/blood/feces sample from a critter still running around in the wild. An extended specimen is inseparable, no matter if the DNA-sequence, the tissue and the critter are located in three different countries.

The important distinction with regard to Nagoya occurs, when someone along the value chain generates benefits, specifically eg. they derive a commercial gain. This gain could be based already on the physical specimen, eg. an orchid with a rare flower mutation collected for a botanical garden that then was cloned via cell culture and brought to market; or it could be based on any derived information, eg. the specimens/individuals DNA, biochemistry, etc. The agent who generated a (commercial) benefit is of interest and consequence for the application of the Nagoya Protocol.

I am getting myself into trouble here, though, linking a specimen with its DSI should not make a difference - or I am missing something?

Correction: @austinmast pointed to the CARE principles (post 7). Sorry for mixing this up.


Group 3 questions: Improving engagement of participants
(12) How could the experience of users (e.g., of portals, search, etc.) be improved and their lives made easier?

Pointer to related comment under Analyzing/mining specimen data for novel applications

@JuttaBuschbom I do not think you are misunderstanding anything. But let me give you my thoughts:

  • I agree that genomic data are part of the DS and ES concepts; that is how I understand the concepts and how they were presented (I hope I did not say anything in my previous post that suggested otherwise). What I meant was this: Potential future regulation of genomic data/DSI may very well attach to location of the data. So for me, the interesting question is, what if you provide an outgoing link to that location?

  • Now for the interesting part: Whether or not there can be a fundamental legal distinction between a physical genetic resource (be it a specimen or any part of it) and its DSI (the placeholder term chosen in the current discussions), is a matter of legal and political debate. The fact that they are tied together in a way that one cannot exist without the other (talking naturally occurring sequences) has not necessarily a bearing on the kind of regulation they face. In many jurisdictions, eBooks are currently treated very differently from physical books, because they come with different properties leading to different economic consequences. This is what the parties to the CBD/NP have to decide in the upcoming years - regulate DSI and physical genetic resources the same way or differently and if so, how. That is all fluid at the moment and, unfortunately, as I said, a lot of it is politics and not grounded in scientific reasoning. Therefore I hope that scientists will have a big impact on the process.

  • You mention the distinction between commercial and non-commercial research (which presents many problems in a practical sense). Yes, if let’s say a commercial entity generates commercial benefits, those will have to be shared with the provider of the resource. But that is only the endpoint, and I think it is unproblematic. For me, the important distinction with regard to the Nagoya Protocol and the discussions surrounding DSI is the one about access to genetic resources. As a researcher in the EU, as soon as you access a physical genetic resource from Oct. 2014 onwards over which a state exercises sovereign rights, you are within the scope of Reg. 511/2014 (which transposes the Nagoya Protocol for the EU) and you need to follow all the legal jazz (PIC/MAT) that comes with it. For DSI, we do not have any of that at the moment.

So, if you are getting yourself into trouble then it’s positive trouble I would say, because these are important questions that I ask myself (and I assume, many others do too). For me, following the developments about the DS/ES is interesting because it is something that I have so far not seen reflected in the CBD/NP discussions. They are narrowly focussing on DSI and do not take the wider picture into account. Datafication is not just something that is happening with regard to DSI - all the biodiversity sciences are affected, even ecology is becoming a big data venture. But only genomic data generates such controversy.

1 Like

I think this goes to the heart of the matter. Be FAIR and CARE, as they say. The real issues (and not just the political interests masked as legal concerns I addressed in my reply to your post below) are the ones you describe here and that have also been summed up by Sabina Leonelli in Nature 574 , 317-320 (2019), The important question is: How do we define open science?


I have some concerns about an international data resource that operates in a variety of intellectual property law jurisdictions. In the United States, data or “facts” are not subject to copyright protection, meaning copyright-based licenses like Creative Commons cannot be used to, for example, require attribution when using data. We also have the fair use doctrine, which I believe differs substantially from similar exceptions in the EU. It seems it might be a good idea to recruit a team of intellectual property law experts from the various jurisdictions to comment and review plans early in the development process to avoid hitting legal/licensing obstacles after significant investment of time/resources.

1 Like

Do you have more specific concerns here @arountre? The adoption of Creative Commons licences by the GBIF community dates to 2014-2016—in large measure thanks to a community consultation much like this one (h/t @peterdesmet)—and the issues you raise have been raised previously without impact. Meanwhile, the recommendations of that consultation were informed and guided with the help of appropriately specialized counsel. Similar frameworks have since been implemented across other open science-friendly communities and platforms as well.

My understanding is that the case law in the U.S. (much of it led by the Electronic Frontier Foundation) has established Creative Commons valid standing and status. It would be good to know if you have specific counterexamples that suggest the need to revisit this.

1 Like

The IPT we currently use to present data to GBIF only has CC licenses as choices. We do not hold the copyright to the data we serve (because data is not protected under copyright in the US- see Feist Publications, Inc. v. Rural Telephone Service Co. and others), and applying a Creative Commons License might be considered copyfraud because it gives the impression that the data are under copyright when they are not. Of course, we want to share the data and have it reused, but we also want proper attribution. This attribution requirement should probably be done with a license that does not rely on copyright.

I agree that Creative Commons licenses are useful and apparently valid when applied to copyrighted works. I am not a legal expert, I am only recognizing that an expert in US copyright law should be consulted, particularly if many new kinds of data will be distributed through the system.

Are you saying that you don’t have the rights to this data or simply asserting that it’s not copyrighted/copyrightable?

In the former case, by publishing it you’d run afoul of GBIF’s data publisher agreement.

In the latter, CC0 is not a licence but a public-domain waiver also described as “no rights reserved.” Would this not be appropriate?

Our legal advice has consistently taken account of the fact that we operate globally across different legal settings and jurisdictions, including the U.S.

I am saying that the data are not copyrightable. A CC0 declaration gives us no mechanism to require proper attribution.

Maybe we can take this offline, rather than playing this out in public? Happy to discuss via email at kcopas [at] gbif [dot] org. I don’t think this is insurmountable.

From the summary:

Discussion today provided some clarification on kernel metadata and its role in the main types of searches performed on specimen data. It also raised the question about the nature of the digital specimen object, and whether is should be “just a bag of relationships with other objects” or if metadata from the original specimen should be embedded in it.

Is it possible to represent these contrasting approaches in a rough sketch diagram? (I emphasise rough sketch, as a polished diagram tends to imply “already decided” rather than “still up for debate”).


This is what the INSDC databases use, but it is a complete disaster and will take years to clean up. We need a link to the physical specimen that can be validated.

1 Like

I think the biggest worry for me is that applying licenses to data and images that do not have any existing rights gives the impression to many people that rights exist when in fact there are none.
Nevertheless, this is very widely done and I don’t think it is possible to put the genie back in the bottle, as they say.

1 Like

This is the hot, inner core of DS upon which all other layers are built. If there’s mess here as in INSDC then we’ve effectively dropped the engine out of the car. As we’ve heard elsewhere, a DS is meant to represent a surrogate or a twin of the physical item. If ever there is a disconnect between a DS and the physical item because the physicalSpecimenID is malformed or not what the owner of the physical item uses locally, then we’ve effectively split the infrastructure. The annotation layers we’ve discussed elsewhere are rendered inaccessible to the very providers of the data.

The key is establishing who is responsible for creating and maintaining that critical link between the physicalSpecimenID and the unique identifier for the DS twin and indeed who is responsible for birthing a DS. In Crossref’s world of DOIs, this would be the paying, member publisher that commits to additionally make landing pages, accessed via those unique identifiers in a timely manner or face penalties. I suspect there are few natural science organizations in the world that can commit to comparable financial and staffing requirements; we’ve already set the precedent that sharing specimen data is relatively inexpensive and low-maintenance. If there were proxy organizations to shoulder these requirements on behalf of natural science organizations, I’d fear that the indirection introduced would again result in undesirable drift between the physicalSpecimenID and the DS just as it evidently has in the INSDC database.


Not sure, I think that impression existed already before licenses became mainstream. Might actually be the opposite, that CC0 statements make it more clear than before that specimen data and images are public domain, which could also save us from some trouble with future developments in ABS and DSI.

1 Like

This begs the question, whether there is ever just one DS for one specimen? It would be easy to imagine several that are duplicates for reason of origin or opinion. The only thing that binds them together is the physical specimen.
Do we only mint a DS if we know a specimen exists or can we mint a DS from a specimen citation?

1 Like