Version control of a dataset

@efujioka @dnoesgaard Please note an earlier post related to this topic at When to assign a new DOI to an existing dataset? - #3 by jhpoelen .

If I understand it correctly, GBIF download DOIs describe the query , not necessarily the dataset associated with the download: the data might be deleted if the download DOI is not cited in a peer-reviewed publication after some time (six months?). The query can be re-run, but the results of the recreated result is likely different, because the query is run against the current snapshot of the data. Even if the data remains, there’s no clear relation between the DOI and the dataset (for detailed analysis see allow for tracking GBIF query/download DOIs · Issue #63 · bio-guoda/preston · GitHub ).

Similarly, the GBIF dataset DOIs describe the dataset metadata, not the data associated with the dataset at some point in time.

The (useful!) purpose of the GBIF DOIs is primarily to attribute the institutions, collections and people behind the data, not specific versions of the data.

Also, because downloads are compiled from GBIF interpreted (or “mediated”) , it is not straight forward to to trace which exact version of the original dataset was used as a basis for the derived data.

I am open to discuss more systematic approaches to more reliably reference specific versions of datasets and their origin. See e.g., Elliott et al. 2020 https://doi.org/10.1016/j.ecoinf.2020.101132 disclaimer: I am one of the co-authors.

Please do holler if I am misrepresenting / misunderstanding how the GBIF DOIs are designed to work: eager to learn and/or gain a better understanding of the GBIF infrastructure.