10. Transactional mechanisms and provenance

I was pleasantly surprised to see the topic of “Transactional mechanisms and provenance” in this “Digital/Extended Specimen” consultation.

@hardistyar 's quoted description above of this transactions-provenance topic neatly matches ideas centered around Preston (https://preston.guoda.bio / GitHub - bio-guoda/preston: a biodiversity dataset tracker), a transactional git-like dataset tracker, that has been in use since 2018 and that I’ve referenced, and discussed with many of you in various meetings, discussions, prototypes, and email exchanges since 2018. (see GitHub - bio-guoda/preston: a biodiversity dataset tracker for some examples of publications, conference presentations, data publications, and forum/twitter posts)

So, I’d suggest to add a reference to Preston (GitHub - bio-guoda/preston: a biodiversity dataset tracker) and related publication (e.g., MJ Elliott, JH Poelen, JAB Fortes (2020). Toward Reliable Biodiversity Dataset References. Ecological Informatics. Redirecting) in the section “Information Resources” at the start of this discussion topic.

Please bear with me as I attempt to address @hardistyar excellent discussion prompts point-by-point:

Transactional publishing (or data tracking using reliable content-based identifiers) allows for systematic tracking (and referencing) of datasets (and records in them), irregardless of were they happened to be located. Existing mechanisms (e.g., web APIs, data portals) usually rely on trusting a single web entities to follow hard to maintain/verify “Cool URIs” (Cool URIs for the Semantic Web) principles . For a suite of use cases enabled by systematic dataset content tracking, GitHub - bio-guoda/preston: a biodiversity dataset tracker might be of interest.

Yes, transactional publishing provides a solid basis for reliable data references, an essential ingredient of reliable and reproducible data integration and attribution. You’ll find that Preston uses standards based (rdf / Prov-O, sha-2 hashing, triple implosion) techniques to implement transactions through linking provenance logs of biodiversity datasets in a decentralized storage architecture (pretty much following the git architecture beyond the file system). We’ve been running the Preston infrastructure using existing run-of-the-mill storage solutions (Zenodo, Internet Archive, Software Heritage Library, rsync-ed mirrors, commodity external hard-disk) to systematically track and version datasets/records registered in GBIF/iDigBio (and more) since 2018 .

In other words, we’ve shown that 404s (aka linkrot) or (un)expected changes (aka content drift) are no longer as much of an issue by introducing transactional provenance without having to significantly change existing biodiversity infrastructures.

I’d say that commonly used unique identifiers like UUIDs/DOIs/PURLs, while being aspirationally unique at best, are useful as long as the provenance of the context in which they appear can be reliably referenced and accessed.

So, I’d say that “unique” identifiers are useful, but not necessary.

I think that a well-designed provenance / transaction mechanism can be independent of the kind of domain-specific content that is being tracked. I found that Prov-O (PROV-O: The PROV Ontology) entity-activity-actor model provides basic information elements/model to implement a content/format-agnostic (or as the original git author Linus Torvalds put it: “dumb”) git-like provenance-transaction system.

I see mostly conceptual challenges in moving from a location-based data perspective (the Berner-Lee “Cool-URI” dilemma) to a content-based perspective (information-centric networking similar to Ted Nelson’s DocuVerse as proposed in his visionary Computer Lib/Dream Machines Computer Lib/Dream Machines - Wikipedia).

Also, I see some socio-economical coordination challenges in deciding how to make sure to keep many copies of relevant digital data around. But similar conservation/curation challenge have been taken up by collection managers, curators, and librarians for centuries, and some of this institutional know-how can be taken from the physical (e.g., books, specimen, artworks) and applied to the digital realm (e.g., files, digital datasets, digital images).

Over the last years, we’ve successfully tracked data across major biodiversity networks without having to ask for any changes to be made. Admittedly, more efficient approaches can be imagined, but from what I can tell versioned snapshots of existing APIs can go a long way.

Preston uses content-addressed storage to allow for provenance and content to be kept separately in decentralized manner without compromising the integrity of the versioned data corpus. We have examples of Zenodo data publications only including the provenance (and their translation chains) and storing (heavier) data objects elsewhere (see e.g., http://doi.org/10.5281/zenodo.3852671)

Primary publishers / curators can reliably reference trusted datasets in their (cryptographically signed) provenance logs similar to how tradition publishers bundle received manuscripts in journals. In other words, trusted institutions or publishers can curate high quality data collections by publishing reliable references (and related provenance) to datasets they deem worthy to their cause and quality standards. They may even choose to keep copies of these referenced datasets around to help keep the data around for a little longer.

Because of ability to store data “off-chain”, sensitive data can be reliably referenced in scientific papers or composite datasets without making the sensitive dataset openly available. An additional layer of access control can be introduced by providing a reference to an encrypted copy of the sensitive dataset. But, this poses long term archiving issues (e.g., losing private keys will make data inaccessible forever) that may outweight the benefits of added access controls.

I hope my comments show that transactions mechanisms and provenance are a pragmatic and proven way to overcome many of the data integration issues we all encounter daily without having to move our digital heaven and earth.

Thank you for your attention and I hope my contributions to this topic will help advance our methods to make the most out of our digital biodiversity corpora now and in the future.

Curious to hear your thoughts and comments.

-jorrit