Toward Reliable Biodiversity Dataset References

jhpoelen · January 16, 2020, 10:43pm

Hey y’all -

Jen Hammock of the Encyclopedia of Life hinted (see https://gitter.im/EOL/eol?at=5e1fde41a50f33623f44fa87) that this EcoEvoRxiv preprint might be of interest to the GBIF community. Note that the preprint has not yet been peer-reviewed, but has been submitted. Curious to hear your take on it. (disclaimer: I am second author)

Elliott, M. J., Poelen, J. H., & Fortes, J. (2020, January 3). Toward Reliable Biodiversity Dataset References. Toward Reliable Biodiversity Dataset References

For those who do not like to click on doi links, here the abstract:

Toward Reliable Biodiversity Dataset References
Abstract

No systematic approach has yet been adopted to reliably reference
and provide access to digital biodiversity datasets. Based on
accumulated evidence, we argue that location-based identifiers such
as URLs are not sufficient to ensure long-term data access. We
introduce a method that uses dedicated data observatories to
evaluate long-term URL reliability.

From March through October of 2019, we took periodic
inventories of the data served by major biodiversity aggregators,
including GBIF, iDigBio, DataONE, and BHL. Over the period of
observation, we found that, for each network, 5% to 43% of
registered URLs were intermittently or consistently unresponsive,
0% to 63% produced unstable content, and 13% to 76% became
either unresponsive or unstable.

We propose the use of cryptographic hashing to generate
content-based identifiers that can reliably reference datasets. We
show that content-based identifiers facilitate decentralized archival
and reliable distribution of biodiversity datasets to enable long-term
accessibility of the referenced datasets.

dnoesgaard · January 17, 2020, 1:32pm

Thanks for posting that here, Jorrit. I had spotted the paper previously, but hadn’t had a chance to look at it in detail.

From the GBIF point of view, I think that an important aspect is missing from your paper. First of all, all datasets registered with GBIF are assigned DOIs, and even though you’re absolutely right that a DOI is only as good as the URL that it resolves to, I would venture that all GBIF dataset DOIs resolve to landing pages. Naturally, if a web server is down, so is all its resources. But aside from that, I’ve yet to find a DOI under our prefix (10.15468) that doesn’t resolve correctly. But I’d be interested in seeing some of the raw results from your Preston tracker (I’m not familiar with the approach).

With that being said, every single download event from GBIF.org is assigned its own unique DOI which resolves to a landing describing the details of search criteria, provenance of records, i.e. which datasets contributed records to the download archive, and finally a link to re-download the archive. These landing pages are permanent. The archive may be deleted if the download remains uncited, but we strive to keep them around for as long as possible. Any download cited in a study will be flagged and kept indefinitely.

While your paper touches on both DOIs and the concept of “download events”, I think that our approach solves some of the issues you raise—but is to some extent ignored in the paper?

As a disclaimer I should add that I have only read your paper once and I might have missed some details. I would love to hear your thoughts on my points.

Thanks,
Daniel

jhpoelen · January 18, 2020, 2:26am

Hey Daniel -

Thanks for responding and taking the time to read the preprint.

Please see my comments below.

Thanks for posting that here, Jorrit. I had spotted the paper previously, but hadn’t had a chance to look at it in detail.

From the GBIF point of view, I think that an important aspect is missing from your paper. First of all, all datasets registered with GBIF are assigned DOIs, and even though you’re absolutely right that a DOI is only as good as the URL that it resolves to, I would venture that all GBIF dataset DOIs resolve to landing pages. Naturally, if a web server is down, so is all its resources. But aside from that, I’ve yet to find a DOI under our prefix (10.15468) that doesn’t resolve correctly. But I’d be interested in seeing some of the raw results from your Preston tracker (I’m not familiar with the approach).

GBIF dataset/download DOIs and their associated landing pages were not used in our study. This was because we were interested in studying the origin and behavior of the raw source material of GBIF: the source datasets registered by participating institutions.

The findings in the preprint are derived from the Preston archives that are referenced in the preprint. One such reference points to Biodiversity Dataset Archive : Jorrit Poelen : Free Download, Borrow, and Streaming : Internet Archive . The description/README will point you to the raw Preston provenance logs related to the biodiversity dataset registered in GBIF. Also, see GitHub - bio-guoda/preston: a biodiversity dataset tracker for some provenance log examples.

With that being said, every single download event from GBIF.org is assigned its own unique DOI which resolves to a landing describing the details of search criteria, provenance of records, i.e. which datasets contributed records to the download archive, and finally a link to re-download the archive. These landing pages are permanent. The archive may be deleted if the download remains uncited, but we strive to keep them around for as long as possible. Any download cited in a study will be lagged and kept indefinitely.

While your paper touches on both DOIs and the concept of “download events”, I think that our approach solves some of the issues you raise—but is to some extent ignored in the paper?

Our results show that referencing datasets by URL leads to link rot and content drift. In addition, when using a URL as a reference, the reader has no way to verify that the retrieved data has changed since it was initially referenced. Because GBIF dataset/download page DOIs still reference the underlying data by URL, these issues remain. Also, the “download events” in context of our paper differ from GBIF download events in that we cryptographically link a dataset archive version to a URL to allow for content-based addressing of a dataset version.

As a disclaimer I should add that I have only read your paper once and I might have missed some details. I would love to hear your thoughts on my points.

I hope I was able to address your questions and comments. Please let me know if you have any remaining/additional concerns/ideas related to our preprint.

Thanks again for your feedback,

-jorrit

rdmpage · January 23, 2020, 2:54pm

I tweeted this preprint and there’s been some responses indicating people are not sure what to make of this, e.g. https://twitter.com/dpsSpiders/status/1220336521917628419 @jhpoelen any thoughts?

jhpoelen · January 23, 2020, 11:21pm

@rdmpage Thanks for tweeting the preprint and letting me know of the responses. I dusted off my personal twitter account and replied via https://twitter.com/jhpoelen/status/1220482905379655682 -

GBIF data in figs 2,3 and table 1 of https://doi.org/10.32942/osf.io/mysfp … use the hashes of DwC-A datasets of institutional endpoints registered with GBIF. For related queries, see https://github.com/bio-guoda/preston-scripts/tree/master/query#capacity-queries … . Suggest to use https://discourse.gbif.org/t/toward-reliable-biodiversity-dataset-references/1637/4 … to discuss further. https://twitter.com/dpsSpiders/status/1220336521917628419 …

I am open to suggestions to help clarify the manuscript or respond to any remaining questions and concerns.

Screenshot from 2020-01-23 15-04-39

carrieseltzer · January 24, 2020, 6:26pm

I briefly reviewed the manuscript but I didn’t see any details about the underlying datasets accessed via the data networks. I’m (unsurprisingly) most interested in iNaturalist’s role in the GBIF churn. Is there a way to see what proportion of the GBIF urls were from iNaturalist, and what proportion of them were unresponsive or unstable?

Since the iNaturalist community generates a living dataset, I would be unsurprised if there was high instability in the content of the urls (i.e. identifications for an observation can change, other data elements can be corrected), and since content can be deleted there’s also inevitably some unresponsive urls (though hopefully that number is much lower).

jhpoelen · January 24, 2020, 7:51pm

@carrieseltzer Thank you for having a look at the manuscript and for providing comments and questions.

You can find all information related to the underlying datasets via cited Preston data publications in Table 1 (see A biodiversity dataset graph: GBIF, iDigBio, BioCASe or Biodiversity Dataset Archive : Jorrit Poelen : Free Download, Borrow, and Streaming : Internet Archive ). These data archives include structured provenance logs of GBIF-registered iNaturalist data over time. The provenance logs can be queried using Sparql or filtered with tools like “grep” to select graphs or lines associated with iNaturalist.

For instance, I imagine a crude way of selecting iNaturalist versions can be done on a mac/linux terminal by:

$ java -jar preston.jar ls --remote https://zenodo.org/record/3484205/files | grep “inaturalist” | grep “hasVersion” | uniq
<http://www.inaturalist.org/observations/gbif-observations-dwca.zip\> <http://purl.org/pav/hasVersion\> <hash://sha256/13b01131a4acdbdb11f3613a87c233b95ee0a4dfa9894f7ea093273c619c82e1> .
<http://www.inaturalist.org/observations/gbif-observations-dwca.zip\> <http://purl.org/pav/hasVersion\> <hash://sha256/0bbd0a6b0ed33dc50aa032ae97ce17915d204ced27354a8aaa96e816d2635767> .
…

Sparql can be used to do more detailed searches (e.g., https://github.com/bio-guoda/preston-scripts/tree/master/query#capacity-queries ).

Alternatively, you can have a peek at https://hash-archive.org/history/http://www.inaturalist.org/observations/gbif-observations-dwca.zip to get a glimpse of changes in content produced by http://www.inaturalist.org/observations/gbif-observations-dwca.zip .

A quick glance tells me that the iNaturalist dataset URL registered with GBIF experiences content drift (i.e. unstable) and is responsive (i.e. no link rot).

Thanks again for taking the time to have a look at our preprint and let me know if you have any additional comments or questions.

-jorrit

carrieseltzer · January 24, 2020, 8:28pm

Thanks Jorrit! I’ve shared with my colleagues too. I don’t expect I’ll be able to do the data wrangling to explore much, but will report back if I do.

system · February 24, 2020, 6:28am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Type Specimen CASTYPE1652 found via filtered query https://doi.org/10.15468/dl.xf6ahb, but not in open-access GBIF data product https://doi.org/10.15468/dl.pk3trq Miscellaneous	18	619	May 6, 2023
Downloads failing to include all files in the archive Data Use	16	1028	October 29, 2023
Why (pushing) data citations (still) matter Data Use	15	6131	February 3, 2020
Search, download, analyze and cite (repeat if necessary) - GBIF Data Blog Data blog	15	3030	September 15, 2021
When does evidence of impact become too onerous to track? Miscellaneous research-data , tracking , citation , impact	11	348	November 10, 2024

Toward Reliable Biodiversity Dataset References

Related topics