Version control of a dataset

Hello all,

This is my first post to the GBIF Community, so please be kind if I didn’t follow some rules or whatever.

I have a large collection of datasets maintained in an own IPT and publish to GBIF. Some of the datasets grow and others may change (add/delete attributes, change format etc). Based on the post (When to assign a new DOI to an existing dataset?), DOI is assigned when a dataset is registered and it will not change when the dataset is updated/changed.

So, I wonder what is the best practice to do version control. Is there a kind of version number in the metadata of a dataset similar to the one within IPT (emlVersion; IPT also can hold versionHistory)? How should researchers cite a particular version of a GBIF dataset? Does GBIF archive the old versions and make them accessible?

A typical situation I have in mind is that a dataset gets registered to GBIF on, say, May 1st, 2021. I add new records to the same dataset and publish it to IPT every month, which is automatically reflected to GBIF (but the DOI remains same). A researcher downloads the dataset at some point, say, June 1st, 2021 and uses it for his/her analysis and publish a journal article. The journal requires an open access to the dataset. So, the researcher wants to provide a URL to the June 1st version of the dataset. The DOI alone can’t satisfy this as the latest dataset has more records than the June 1st version.

As in the post I mentioned above, you may suggest registering the dataset as a new dataset to get a new DOI every time it grows. But that is very difficult to maintain and it would register redundant and duplicate records which would confuse the users and may cause invalid outputs like an overestimate of species abundance. So, I don’t think it’s a good idea.

When a user downloads data from GBIF, the specific download is assigned a new unique DOI which refers to the dataset at that specific time. This DOI can be cited in journals and provides open access to the data at the time of download. Anyone accessing the DOI will also have the option of rerunning the query to obtain any new records added, since the time of the original download.

The above applies whether a query is for a single dataset—or e.g. a given taxon across numerous datasets.

Thank you very much for very helpful information. I didn’t know that each download gets a unique DOI. I think that is what researchers need!

@efujioka @dnoesgaard Please note an earlier post related to this topic at When to assign a new DOI to an existing dataset? - #3 by jhpoelen .

If I understand it correctly, GBIF download DOIs describe the query , not necessarily the dataset associated with the download: the data might be deleted if the download DOI is not cited in a peer-reviewed publication after some time (six months?). The query can be re-run, but the results of the recreated result is likely different, because the query is run against the current snapshot of the data. Even if the data remains, there’s no clear relation between the DOI and the dataset (for detailed analysis see allow for tracking GBIF query/download DOIs · Issue #63 · bio-guoda/preston · GitHub ).

Similarly, the GBIF dataset DOIs describe the dataset metadata, not the data associated with the dataset at some point in time.

The (useful!) purpose of the GBIF DOIs is primarily to attribute the institutions, collections and people behind the data, not specific versions of the data.

Also, because downloads are compiled from GBIF interpreted (or “mediated”) , it is not straight forward to to trace which exact version of the original dataset was used as a basis for the derived data.

I am open to discuss more systematic approaches to more reliably reference specific versions of datasets and their origin. See e.g., Elliott et al. 2020 disclaimer: I am one of the co-authors.

Please do holler if I am misrepresenting / misunderstanding how the GBIF DOIs are designed to work: eager to learn and/or gain a better understanding of the GBIF infrastructure.

@jhpoelen, your points are very valid. A download DOI describes the query and may be be linked to the dataset returned at the time of the query. While we guarantee to keep the data for 6 months, in reality only extremely large datasets are ever deleted.

Rerunning the query at a later time may produce a different result. Download datasets are indeed snapshots of data and besides being linked to the parent datasets in GBIF, they cannot be reproduced at the record-level, as occurrence ids are not persistent.