When to assign a new DOI to an existing dataset?

IPT Dataset Versioning Policy establishes that a new DOI should follow a major version given a ‘scientifically significant change’. Besides the first versión and DOI assignment, I would like to know when do you consider a new DOI should be created for an old dataset? What have you established for your node or institutions that a change is ‘scientifically significant’ to get a new DOI?

Take for example a biological collection, dataset version 1.0 has 300 occurrences, the following update adds 600 occurrences plus minor (format) updates on the initial 300 occurrences. There is a major change in the dataset adding more scientifically relevant data but without affecting the original records of the dataset. Will you request a new DOI?

What about national and thematic checklists?

** Some key points for this question from the IPT Dataset Versioning Policy **

  • A new major version leads to the creation of a new DOI, whereas a new minor version does not.
  • A new major version is assigned to the dataset (a) the first time it’s published, or (b) after it has been republished following one or more scientifically significant changes to the dataset. The publisher must decide what constitutes a scientifically significant change (see definition below for help).
  • A scientifically significant change (a) typically affects the majority of records in the dataset, and (b) could change the results of a scientific analysis using the dataset.

I found point (b) difficult to crack as in most data-use cases multiple datasets are involved rather than a single one, and because the effect of a dataset change may be significant for one use and not for another one, and these are difficult to foresee at the data publishing point.

Probably many you already have a clear understanding of the topic, please shed me some light!

2 Likes

Hi Camila,

Thanks for raising this topic! I’m not sure I can provide a bullet-proof response, but I’m happy to share some thoughts.

I am somewhat reluctant about assigning new DOIs to existing resources—unless that resource completely changes. And in that case, I would probably consider leaving the “old” resource as it is—and creating a brand-new resource, if that option is feasible. Would adding 10 new records to a dataset constitute a ‘scientifically significant’ change? Probably not. Would adding 10,000 records? Perhaps, but it’s worth adding a new DOI?

The benefit of having different DOIs for different versions is of course that users are able to cite a specific version of the dataset in a machine-readable manner. But in my opinion this can also lead to confusion. And, as you point out, the vast majority of users who access data through GBIF will download based on a specific taxonomic, geographic and/or temporal focus—and obtain data from multiple datasets—for which a unique download DOI is assigned. In this case, the citation of data will be very specific, as it will point to the contributing datasets and allow others to re-download the exact same data.

I’m actually not familiar with many IPT users who have taken advantage of versioning through new DOIs. I believe the option is only available for organizations who have their own Datacite credentials—which isn’t that many. But perhaps some of them can chime in here as well? Would be great to get some more insights!

Best regards,
Daniel

2 Likes

Hi Camila @camisilver -

With interest, I read your questions and comments about how to assign DOIs to different versions of collection datasets.

Before answering your specific questions, I wanted to share some context.

GBIF has put great effort into helping to cite existing collections in publications by issuing DOIs for all registered collection datasets.

The main purpose of the GBIF DOIs is to attribute the people, institutions associated with collections and track usage of the valuable datasets they produce in scientific publications. This is why (correct me if I am wrong) Daniel is hesitant to issue a new DOI for a collection: GBIF Collection DOIs cite information about a collection (e.g., title, contributors, institution, ipt urls), not a specific version of data that a collection happens to make available at a given time.

Unfortunately, DOIs are, by design, not sufficient to reliably cite specific dataset versions and more work is need to adopt method to complement the existing citation infrastructures.

I hope that GBIF will take note from projects like the Software Heritage (https://softwareheritage.org) to complement their existing DOI-based citations with a reliable (or “context-free”) data references that help unambiguously reference specific dataset versions. One of their publications [1] provides an overview of identifiers schemes. While their focus is on software preservation, you can make the same case for data: simply replace all occurrences of “software” and “source code” with “datasets” and “data records” respectively. I wish that I had read and cited their works in our recent publication [2] and talks [3,4] on the subject of reliable dataset references. These works are more specific to biodiversity datasets (e.g., how to reliably cite >500M records in eBird).

Now, for your specific questions:

I would like to know when do you consider a new DOI should be created for an old dataset?

To be consistent with the GBIF DOI approach, I’d only mint DOIs when registering new collections.

What have you established for your node or institutions that a change is ‘scientifically significant’ to get a new DOI?

I am not quite sure what “scientifically significant” means in the context of the IPT guidelines. For instance, if I’d want to reproduce an analysis or study, I’d want to make sure to do this with the exact same copy of the data referenced. This is why I would find it “scientifically significant” to keep identifiers for each and every version that it published. However, I am sure that others have different interpretations of the phrase. I hope that the IPT guidelines will be updated to use more specific language around DOI usage.

Also, like I mentioned earlier, DOIs alone are not sufficient to reliably reference dataset versions.

I hope my perspective is useful to you an am curious to hear remaining thoughts/comments,

thx,

-jorrit

Research Scholar, Ronin Institute
http://ronininstitute.org/research-scholars/jorrit-poelen/

Global Biotic Interactions
https://globalbioticinteractions.org

References

[1] Di Cosmo, R., Gruenpeter, M., & Zacchiroli, S. (2019, June 20). 204.4 Identifiers for Digital Objects: The case of software source code preservation. https://doi.org/10.17605/OSF.IO/KDE56 https://www.softwareheritage.org/wp-content/uploads/2020/01/ipres-2018-swh.pdf

[2] Elliott, M. J., Poelen, J. H., & Fortes, J. (2020). Toward reliable biodiversity dataset references. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2020.101132

[3] Poelen, J. H., Fortes, J., & Elliott, M. J. (2020). Reliable dataset identifiers are essential building blocks for reproducible research. https://doi.org/10.17605/OSF.IO/AT4XE (includes video!)

[4] Poelen, J. H., & Boettiger, C. (2020). Reliable Data Use In R. https://doi.org/10.17605/OSF.IO/VKJ9Q (includes video!)

2 Likes

Thanks, @dnoesgaard four your answer, Colombia is one of the nodes with Datacite credentials, thus we assign DOI’s at the IPT level, that’s also why we were interested in having external thoughts about the topic. Until now we have used only one DOI by resource, which I’m glad it looks like a good decision given your and Jorrit’s answer.

Thanks, @jhpoelen, the GBIF context you gave is very helpful for the question raised. On the other hand, I was unaware of the lack of power of DOIs to reliably cite specific dataset versions, and the references you mention open an interesting line of discussion.

I see two necessities,

  1. For data use and reproducible research, having DOI’s for each download covers the necessity of tracking a specific dataset (composed by parts of many resources) as the DOI’s are maintained if the downloaded dataset is cited, so at the data portal level, the current GBIF approach (I believe) is good enough.

  2. For data publishing at the IPT (or other publishing instance) level we may need to improve the system to have more reliable dataset identifiers given Jorrit’s references. Probably a very interesting topic to discuss in the context of Data papers, where the paper does refer to a very specific versión of the data.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.