When to assign a new DOI to an existing dataset?

Hi Camila @camisilver -

With interest, I read your questions and comments about how to assign DOIs to different versions of collection datasets.

Before answering your specific questions, I wanted to share some context.

GBIF has put great effort into helping to cite existing collections in publications by issuing DOIs for all registered collection datasets.

The main purpose of the GBIF DOIs is to attribute the people, institutions associated with collections and track usage of the valuable datasets they produce in scientific publications. This is why (correct me if I am wrong) Daniel is hesitant to issue a new DOI for a collection: GBIF Collection DOIs cite information about a collection (e.g., title, contributors, institution, ipt urls), not a specific version of data that a collection happens to make available at a given time.

Unfortunately, DOIs are, by design, not sufficient to reliably cite specific dataset versions and more work is need to adopt method to complement the existing citation infrastructures.

I hope that GBIF will take note from projects like the Software Heritage (https://softwareheritage.org) to complement their existing DOI-based citations with a reliable (or “context-free”) data references that help unambiguously reference specific dataset versions. One of their publications [1] provides an overview of identifiers schemes. While their focus is on software preservation, you can make the same case for data: simply replace all occurrences of “software” and “source code” with “datasets” and “data records” respectively. I wish that I had read and cited their works in our recent publication [2] and talks [3,4] on the subject of reliable dataset references. These works are more specific to biodiversity datasets (e.g., how to reliably cite >500M records in eBird).

Now, for your specific questions:

I would like to know when do you consider a new DOI should be created for an old dataset?

To be consistent with the GBIF DOI approach, I’d only mint DOIs when registering new collections.

What have you established for your node or institutions that a change is ‘scientifically significant’ to get a new DOI?

I am not quite sure what “scientifically significant” means in the context of the IPT guidelines. For instance, if I’d want to reproduce an analysis or study, I’d want to make sure to do this with the exact same copy of the data referenced. This is why I would find it “scientifically significant” to keep identifiers for each and every version that it published. However, I am sure that others have different interpretations of the phrase. I hope that the IPT guidelines will be updated to use more specific language around DOI usage.

Also, like I mentioned earlier, DOIs alone are not sufficient to reliably reference dataset versions.

I hope my perspective is useful to you an am curious to hear remaining thoughts/comments,

thx,

-jorrit

Research Scholar, Ronin Institute

Global Biotic Interactions

References

[1] Di Cosmo, R., Gruenpeter, M., & Zacchiroli, S. (2019, June 20). 204.4 Identifiers for Digital Objects: The case of software source code preservation. OSF | 204.4 Identifiers for Digital Objects: The case of software source code preservation. https://www.softwareheritage.org/wp-content/uploads/2020/01/ipres-2018-swh.pdf

[2] Elliott, M. J., Poelen, J. H., & Fortes, J. (2020). Toward reliable biodiversity dataset references. Ecological Informatics. Redirecting

[3] Poelen, J. H., Fortes, J., & Elliott, M. J. (2020). Reliable dataset identifiers are essential building blocks for reproducible research. OSF | Reliable dataset identifiers are essential building blocks for reproducible research (includes video!)

[4] Poelen, J. H., & Boettiger, C. (2020). Reliable Data Use In R. OSF | Reliable Data Use In R (includes video!)

2 Likes