Making FAIR data for specimens accessible

I have some concerns about an international data resource that operates in a variety of intellectual property law jurisdictions. In the United States, data or “facts” are not subject to copyright protection, meaning copyright-based licenses like Creative Commons cannot be used to, for example, require attribution when using data. We also have the fair use doctrine, which I believe differs substantially from similar exceptions in the EU. It seems it might be a good idea to recruit a team of intellectual property law experts from the various jurisdictions to comment and review plans early in the development process to avoid hitting legal/licensing obstacles after significant investment of time/resources.

1 Like

Do you have more specific concerns here @arountre? The adoption of Creative Commons licences by the GBIF community dates to 2014-2016—in large measure thanks to a community consultation much like this one (h/t @peterdesmet)—and the issues you raise have been raised previously without impact. Meanwhile, the recommendations of that consultation were informed and guided with the help of appropriately specialized counsel. Similar frameworks have since been implemented across other open science-friendly communities and platforms as well.

My understanding is that the case law in the U.S. (much of it led by the Electronic Frontier Foundation) has established Creative Commons valid standing and status. It would be good to know if you have specific counterexamples that suggest the need to revisit this.

1 Like

The IPT we currently use to present data to GBIF only has CC licenses as choices. We do not hold the copyright to the data we serve (because data is not protected under copyright in the US- see Feist Publications, Inc. v. Rural Telephone Service Co. and others), and applying a Creative Commons License might be considered copyfraud because it gives the impression that the data are under copyright when they are not. Of course, we want to share the data and have it reused, but we also want proper attribution. This attribution requirement should probably be done with a license that does not rely on copyright.

I agree that Creative Commons licenses are useful and apparently valid when applied to copyrighted works. I am not a legal expert, I am only recognizing that an expert in US copyright law should be consulted, particularly if many new kinds of data will be distributed through the system.

Are you saying that you don’t have the rights to this data or simply asserting that it’s not copyrighted/copyrightable?

In the former case, by publishing it you’d run afoul of GBIF’s data publisher agreement.

In the latter, CC0 is not a licence but a public-domain waiver also described as “no rights reserved.” Would this not be appropriate?

Our legal advice has consistently taken account of the fact that we operate globally across different legal settings and jurisdictions, including the U.S.

I am saying that the data are not copyrightable. A CC0 declaration gives us no mechanism to require proper attribution.

Maybe we can take this offline, rather than playing this out in public? Happy to discuss via email at kcopas [at] gbif [dot] org. I don’t think this is insurmountable.

From the summary:

Discussion today provided some clarification on kernel metadata and its role in the main types of searches performed on specimen data. It also raised the question about the nature of the digital specimen object, and whether is should be “just a bag of relationships with other objects” or if metadata from the original specimen should be embedded in it.

Is it possible to represent these contrasting approaches in a rough sketch diagram? (I emphasise rough sketch, as a polished diagram tends to imply “already decided” rather than “still up for debate”).


This is what the INSDC databases use, but it is a complete disaster and will take years to clean up. We need a link to the physical specimen that can be validated.

1 Like

I think the biggest worry for me is that applying licenses to data and images that do not have any existing rights gives the impression to many people that rights exist when in fact there are none.
Nevertheless, this is very widely done and I don’t think it is possible to put the genie back in the bottle, as they say.

1 Like

This is the hot, inner core of DS upon which all other layers are built. If there’s mess here as in INSDC then we’ve effectively dropped the engine out of the car. As we’ve heard elsewhere, a DS is meant to represent a surrogate or a twin of the physical item. If ever there is a disconnect between a DS and the physical item because the physicalSpecimenID is malformed or not what the owner of the physical item uses locally, then we’ve effectively split the infrastructure. The annotation layers we’ve discussed elsewhere are rendered inaccessible to the very providers of the data.

The key is establishing who is responsible for creating and maintaining that critical link between the physicalSpecimenID and the unique identifier for the DS twin and indeed who is responsible for birthing a DS. In Crossref’s world of DOIs, this would be the paying, member publisher that commits to additionally make landing pages, accessed via those unique identifiers in a timely manner or face penalties. I suspect there are few natural science organizations in the world that can commit to comparable financial and staffing requirements; we’ve already set the precedent that sharing specimen data is relatively inexpensive and low-maintenance. If there were proxy organizations to shoulder these requirements on behalf of natural science organizations, I’d fear that the indirection introduced would again result in undesirable drift between the physicalSpecimenID and the DS just as it evidently has in the INSDC database.


Not sure, I think that impression existed already before licenses became mainstream. Might actually be the opposite, that CC0 statements make it more clear than before that specimen data and images are public domain, which could also save us from some trouble with future developments in ABS and DSI.

1 Like

This begs the question, whether there is ever just one DS for one specimen? It would be easy to imagine several that are duplicates for reason of origin or opinion. The only thing that binds them together is the physical specimen.
Do we only mint a DS if we know a specimen exists or can we mint a DS from a specimen citation?

1 Like

hi Quentin,
good question. The idea is to have one DS per physical specimen and to have some mechanisms in place to ensure that. One of these mechanisms would be to only allow the responsible institution/private person for a physical specimen (the access provider, not necessary the owner) to create the DS, either themselves or by proxy. This mean we only mint a DS if the specimen exists or (according to the identified responsible) has existed. This raises a question though what to do with specimens referenced in literature for which no existing owner can be identified, for instance because the collection does not exist anymore. These may need to be ‘adopted’ by an institution of infrastructure who then becomes the responsible.

@JuttaBuschbom I don’t mean either of those interpretations. The term ‘Linked Data’ has a specific technical meaning according to the four principles in Tim Berners-Lee’s 2006 Linked Data Design Note. I want to avoid that people assume we are talking in those technical implementation terms when we are really talking logically i.e., talking about linking two things together that are related in some way so that we can do something. There are several ways of achieving that technically.

As with journal article and dataset publishing, it is a joint responsibility managed through a contract. An agent acting on the publisher’s behalf captures metadata and registers DOIs; and maintains the availability/functioning of those. The publisher is responsible for and motivated to maintain the availability of the article/dataset and accuracy of the metadata, and to provide updates about that to the agent. The publisher pays the agent by some mechanism - not always directly.

@dshorthouse and @hardistyar have expanded on this with some notes about the allocation of responsibilities for each part of a digital extended specimen in the sub-topic on structure and responsibilities of a #digextspecimen.

Who makes landing pages depends. Actually, many collections already making landing pages. Motivation and return on investment are important to consider. If collections-holding institutions want their data to be more widely used, to be more relevant to society, economy, etc. so they can receive more income - either directly through use of data or indirectly by public funding then new responsibilities must be taken on.

A robust agent/publisher contract would contribute to avoiding undesirable drift. But also, DS have their own lifecycle. Although this is associated with the lifecycle of the corresponding physical specimen it is also quite independent of it. We can expect presentations and uses of digital extended specimens to evolve in new directions.

Or for specimens that do not exist - either because they have become lost or destroyed - in the latter case, could be destroyed after the DS was created or before but lots of data about the specimen is known. As I noted in another reply, the lifecycle of a DS can evolve independent of the lifecycle of a PS.

@Markus_B Thanks a lot for your detailed answer. It is helping me to understand, why I find the ongoing legal debate about “DSI” in parts so detached from the reality of its wider context. I am recognizing just now that when I am discussing ABS policy solutions for DSI, I am already intuitively taking into account and talking about the whole set of possible DS/ES information.

DNA-sequence information gains value mainly by it being associated with phenotypic or additional extended data (see eg. the information sections proposed by @hardistyar in the the subtopic-thread on structure and responsibilities (Structure and responsibilities of a #digextspecimen).

DNA-sequence datasets in isolation are “only” of interests to genome scientists, otherwise they aren’t really that interesting, eg. to society.

Adding information about the geographic coordinates of the sample (-> physical specimen) to a DNA-sequence, population genomicists start to get excited. Statistical approaches for reliably resolving population structure and diversity at increasingly smaller scales are currently a rapidly advancing scientific field. Thereby, these datasets now intersect with legal, social, economic and conservation spheres in the form of forensics, certification and monitoring. Simply, but pivotal, by adding geographic information, DNA-sequence information starts to collide eg. with the 200 billion annual profit sector of environmental crime. Now you have all kinds of interests at play.

These societal interests non-linearly expand and intensify with the association of phenotypic, ecological and environmental information - the core information of specimens in collections and of biodiversity records (ABCD: units - not sure about the appropriate term).

In most applied fields, biodiversity is recognized as a topic of fundamental importance, yet it is a nuisance topic - not unlike the situation in human medical R&D. Having a background of 10+ years in forest genetics, in my experience the majority of stakeholders there are not interested in biodiversity per se, they want to know if the timber they are buying is the real deal and not a cheaper substitute; how to breed “Spessart” or “Slavonian” high quality oaks; and which species and provenance mix will be able to hold up under and adapt to climate change.

All of this requires reliable geographic origin, phenotypic, ecological and environmental information. Exactly the information the DS/ES infrastructure of the biodiversity sciences intends to provide through digitalisation (Thanks for pointing out the crucial role of digitalisation). Suddenly, dusty old natural history collections aren’t a a romantic, slightly backwards enterprise that is a financial sink. They, via their data, become a goldmine.

That extended data are a game changer is mentioned in this article by Powell 2021 (, part of a series on the 20th anniversary of the human genome:

Bahlo and others say that data federation efforts become even more important as the field pivots to digging deeper into phenotype data, which have grown in scope and complexity. “That data comes in all sorts of forms — environmental exposures, smoking status, medical imaging data,” says Bahlo.

Therefore, I completely agree with your point of view that all biodiversity information is, should be or will be subject to questions regarding access and benefit sharing, and hence discussions about ABS within the Convention on Biological Diversity. My guess is that the narrow focus on DSI can be explained by historical processes.

The information by the Secretariat of the CBD is that there will be a decision at the next Conference of the Parties (COP 15), tentatively this year. Likely, it will not be possible so late in the policy process to change the object from “DSI” to “biodiversity information - DS/ES information” in general.

However, it seems crucial, due to the time constraints imposed by global change, to already now, in parallel, consider the consequences of implementations of the DS/ES concept when discussing ABS policy option for DSI. A well-shaped set of options for DSI could then be more easily expanded to include all of biodiversity information in the future.

In return, it is necessary to already now consider the implications of the DS/ES concept and its implementation(s) for access and benefit sharing, as well as with regard to privacy concerns = the protection of sensitive data and proprietary information.

While the ethical foundations start out well in this topic, the discussion about applied consequences is moving quickly into the subject of topic 5 Analyzing/mining specimen data for novel applications.

1 Like

@Markus_B Maybe part of the answer is that reality shows us that “open” will need to be defined on a case-by-case basis.

At the same time, we can build infrastructures that nevertheless provide the foundation for access and sharing. With such infrastructures in place, once you gained permission for a specific use and a certain set of information, it is easy to get the data and work with it.

For this to work, you need to be able to find data that exists. Thus, at least metadata needs to be available to learn about existing data. This metadata, I think, will look differently at different levels of sensitivity and proprietorship.

This guide published by GBIF on “Current Best Practices for Generalizing Sensitive Species Occurrence Data” from Chapmann 2020 (Current Best Practices for Generalizing Sensitive Species Occurrence Data) might be a starting point for how to design different categories of sensitivity and how the associated metadata can look like.

I recently saw this policy document from a Canadian project that deals with water/river basin data: (DataStream Data Policy) and mentions this idea of Ethically Open Access. I think this might be an avenue to attempt to combine FAIR and CARE.

Ethically Open Access

Data are made available on an equal basis, fully, freely and openly in a timely way. Exemptions to this open data policy are allowable for ethical reasons.

Trying to combine FAIR and CARE will be challenging on both the technical (we need more sophistication than just plain access control) and non-technical side (legal and cultural). The conversations around Nagoya/CBS and the points raised by @MAFleming in this post (in the
Extending, enriching and integrating section) are good starting points to think about a more coordinating discussion. A first step might be to create/use a venue so all the stakeholders can discuss the issues, maybe in RDA where members of this community are also heavily involved. There is an International Indigenous Data Sovereignty IG and Biodiversity Data Integration IG. Discussions from this consultation can feed into a cross IG collaboration? CETAF also could be interested in this discussion.

We (NHMUK) regularly acquire object/specimen data for other institutions which can range from other London museums, the European Space Agency/NASA, and our government. For most of these contracts we get copyright signed over but not for all of them. Sometimes they retain complete ownership and we keep the data as a back-up. This can include clinical data associated with specimens (e.g. for neglected tropical disease research).

For the majoirty of this specimen data we would want to serve it as FAIR data but we would not necessarily publish it in a public portal but it should be possible for other potential users to know that it exists.