Toward Reliable Biodiversity Dataset References

Hey Daniel -

Thanks for responding and taking the time to read the preprint.

Please see my comments below.

Thanks for posting that here, Jorrit. I had spotted the paper previously, but hadn’t had a chance to look at it in detail.

From the GBIF point of view, I think that an important aspect is missing from your paper. First of all, all datasets registered with GBIF are assigned DOIs, and even though you’re absolutely right that a DOI is only as good as the URL that it resolves to, I would venture that all GBIF dataset DOIs resolve to landing pages. Naturally, if a web server is down, so is all its resources. But aside from that, I’ve yet to find a DOI under our prefix (10.15468) that doesn’t resolve correctly. But I’d be interested in seeing some of the raw results from your Preston tracker (I’m not familiar with the approach).

GBIF dataset/download DOIs and their associated landing pages were not used in our study. This was because we were interested in studying the origin and behavior of the raw source material of GBIF: the source datasets registered by participating institutions.

The findings in the preprint are derived from the Preston archives that are referenced in the preprint. One such reference points to Biodiversity Dataset Archive : Jorrit Poelen : Free Download, Borrow, and Streaming : Internet Archive . The description/README will point you to the raw Preston provenance logs related to the biodiversity dataset registered in GBIF. Also, see GitHub - bio-guoda/preston: a biodiversity dataset tracker for some provenance log examples.

With that being said, every single download event from GBIF.org is assigned its own unique DOI which resolves to a landing describing the details of search criteria, provenance of records, i.e. which datasets contributed records to the download archive, and finally a link to re-download the archive. These landing pages are permanent. The archive may be deleted if the download remains uncited, but we strive to keep them around for as long as possible. Any download cited in a study will be lagged and kept indefinitely.

While your paper touches on both DOIs and the concept of “download events”, I think that our approach solves some of the issues you raise—but is to some extent ignored in the paper?

Our results show that referencing datasets by URL leads to link rot and content drift. In addition, when using a URL as a reference, the reader has no way to verify that the retrieved data has changed since it was initially referenced. Because GBIF dataset/download page DOIs still reference the underlying data by URL, these issues remain. Also, the “download events” in context of our paper differ from GBIF download events in that we cryptographically link a dataset archive version to a URL to allow for content-based addressing of a dataset version.

As a disclaimer I should add that I have only read your paper once and I might have missed some details. I would love to hear your thoughts on my points.

I hope I was able to address your questions and comments. Please let me know if you have any remaining/additional concerns/ideas related to our preprint.

Thanks again for your feedback,

-jorrit