10. Transactional mechanisms and provenance

Moderators: Nelson Rios, Bertram Ludaescher, Andy Bentley, and James Macklin

Background

The prevailing mechanism for managing and sharing natural history collections data involves static publishing of point-in-time snapshots. This data publishing paradigm has been in place for nearly two decades and has allowed the collections community to mobilize over 1.3 billion specimen records. While simple to implement, this model lacks a suitable feedback mechanism to facilitate tracking usage and integration of downstream products (enhancements, annotations, derivatives, publications etc.) from third parties. Building on previous topics in Consultation 1 on Extending, Enriching and Integrating Data (Extending, enriching and integrating data) and a parallel discussion in the Structure and Responsibilities of the Digital Extended Specimen subtopic (Structure and responsibilities of a #digextspecimen), the goal of this topic is to evaluate transactional data publishing as a solution and its role in maintaining provenance, whereby data elements are published in near real-time in the form of linked, immutable & append-only transactions.

In a transactional data publishing model, each edit to a data record, is represented through a transaction published to a linked list. Similar to block chains, transactions in the list are cryptographically related to all prior transactions ensuring data integrity and immutability and preservation of version history. Yet, unlike block chains, transactions do not need to be bundled into blocks nor are computationally intensive models of consensus necessary. When summed, transactions can be used to derive a given view of a data item. Transactions may represent the registration of a new data resource such as an institutional data publisher, accession of new specimens, modifications and annotations to existing or new fields (georeferencing history, morphological landmarks, identifications etc), transfers of material between institutions and individuals, or other specimen related processes in need of tracking. Combining this model with an open distributed network would empower researchers to publish annotations to specimens records that have yet to be fully recognized by collections. The same network could allow collections to accept or reject third-party annotations while still preserving the derived view published by the third-party. Furthermore, once synchronized with the network, biodiversity data aggregators would only need to ingest and index new transactions entering the network. The goal of the consultation is to identify real world social and technical challenges of utilizing such a model and whether it may provide a suitable approach to overcome the limitations of contemporary data publishing systems.

Questions to promote discussion

  1. What use cases can transactional publishing solve that can’t be solved by existing mechanisms?
  2. Is transactional publishing a solution to the existing problem of data integration and attribution?
  3. Any transactional mechanism is going to rely on unique identifiers for all data objects. How do the issues surrounding Topic 7 relate to transactional publishing? Are identifiers necessary (at least in the way we normally think of them)?
  4. What pieces of information do we need to capture to Extend specimen records - annotations, georeferences, publications, Genbank sequences, Nagoya related information (loans, gifts, accessions? Who made these (agents)? When were they made (dates)?
  5. What infrastructure components are missing and needed to make this a reality?
  6. What changes to existing infrastructure (CMSs, aggregators, publishers, data flow, etc.) are needed to make this a reality? Can the existing pieces support transactional publishing? Why? Why not?
  7. To accommodate the full breadth of digital assets, data storage would likely need to be separate from the transaction chain? What are the implications of “off-chain” data storage and what mechanisms need to be in place to accommodate this?
  8. The goal is to have an open network enabling researchers to publish derivative data at will even if rejected by primary data sources. What measures need to be in place to ensure primary publishers maintain the authoritative views for a given record?
  9. How do specifics of Topic 8 (regulatory, legal, ethical topics) affect the open sharing of data and exposure of transactions in this fashion? How do we obscure data that should not be seen (threatened/endangered/CITES species, other sensitive data)?

Information resources

I’d like to pick-up on this question, or at least my interpretation of it. What’s highlighted here is a blend of two tensions: (1) information (mostly) endogenous to the specimen record, i.e. lifted from the physical labels & perhaps interpreted in some way, and (2) information exogenous to the specimen record, expressed as linkages to other entities that act as value-added materials. The items in (2) may be made under the assumption of robust stability in the values of particular data elements in expressed (1). How might a transactional mechanism deal with drift in (1) that affects the veracity of (2)?

In other words, there may be chains of data enhancements made to “core” elements over time. When a DES is first released/published, it may be sparse with skeletal information about locality. It may later be enhanced with georeferenced latitudes and longitudes. That new enhancement may trigger the accrual of new linkages to yet other entities held external to the specimen. So far so good. However, as is often the case, such enhancements are refined even more & perhaps completely changed through correction of discovered error. If corrections are made to those initial georeferenced latitudes and longitudes, what then happens to the linkages made to other entities that were reliant on those values, created via other transactions? How does a transactional mechanism model the nestedness of actions? And, does it afford any mechanism to automate the severing of links or the flushing of content should the required state of an item change?

1 Like

By design it would not be possible to flush or revise prior content, otherwise it would violate the goal of immutability. One issue that we may have to address out is how can a record be removed without destroying the historical record. There could be cases where an institution publishes a record only to realize they need to remove it for ethical, legal and/or privacy concerns. Encryption might be an option here to keep the record in play.
Generally though, the fact that a prior attribute to a record is no longer accepted does not invalidate that it did exist at one time and I think we would want to maintain that and the related linkages unless there was a real institutional concern as described above. It should be possible, though, to develop tools/services that monitor for new transactions that impact linkages and address those impacts through the addition of more transactions. So, building on your in your case, say we have a set of records with a locality description and no higher geography. Some third party (maybe an automated tool) georeferences those records and publishes the results to the network. Then say GBIF chooses to accept those results and determines higher geography from those coordinates via an automated reverse geocoding script that dumps the results back into the network. So now we have a set of transactions that define the official institutional records, a supplemental set defining new coordinates and yet another supplemental set used to derive GBIF records. A researcher then relies on GBIF’s derived records to produce a species atlas for a given region and eventually that end product get linked to the source records. Down the road someone discovers a pile of field notes and updates many of those coordinates, so they publish new coordinates to the records in question. Another tool (likely the same GBIF tools that published the reversed geocoding into the chain to begin with) sees the new transactions and publishes another set of reverse geocoding results. So we still have a bunch of records that are linked to the atlas that shouldn’t have been included in the atlas to begin with. At this point it is really up to the researcher to say “I need to publish a new atlas with newer data”. Even if he did, the old linkages should still be persisted.

How much of the ALL the extensions (= linkages) and core data elements - these are the (1) and (2) distinctions in my post above - get copied over alongside each new transaction such that the context is preserved? If the answer is in fact ALL, then ought we worry about the environmental impact that such an architecture creates for precisely the same reason that Bitcoin is very bad.

The model for transactional publishing is fundamentally quite different from cryptocurrencies. A better analogy would probably be git rather than bitcoin. The problem with git is that data aren’t immutable as you can easily modify histories.
No data elements are copied in a transaction. Data would be stored “off-chain” in a mix of community and institutional repositories with some replication included for added redundancy and verification. The chain itself would largely consist of some tracking metadata, the computed hashes and linkages among them. Over time this should significantly reduce storage and computational costs over IPT.

1 Like

In some cases the transactional record could get as large as or larger than the data element/digital object. I think the idea of transactions ensuring data integrity and immutability and preservation of version history is fascinating but I am worried about extra overhead. Every workflow and execution need to make a call to the transaction network. How do you envision this? And this will be adding another storage layer for maintenance.

Above you mentioned the idea of a network where the transactions will be published. How do you envision this network? Would the network consist of distributed nodes maintained by a consortium? Some sort of quorum mechanism? What protocol could be used to sync the nodes? Maybe a global index/catalog service is needed to manage all the transactions?

We still need to store versioned datasets and digital objects (either copies of the entire object or a delta), correct? How will that reduce storage costs?

1 Like

I worry about the overhead too. Another model that could work would be using an annotation framework supported by a graph. The links we make between objects/entities within a DES concept are assertions. These assertions can be documented by evidence that provides context for why a DNA sequence, for example, is a derivative of the digital specimen in question (and by twinning the physical specimen). All assertions related to a DS could be queried in the graph and analyzed/mined in ways to promote discovery and broader relationships to other objects. Yes, the graph could get very large but technology has improved in this respect to allow for significant scaling and performance. The graph could also get “polluted” by many assertions related to the same objects and their relationships (equivalent of versioning…?) but this would happen in any implementation of transaction services. The W3C Annotation standard gives a good framework to work with and its extension to data that a group of us led by the late, great Bob Morris helped to define as part of the FilteredPush project (see https://doi.org/10.1371/journal.pone.0076093). Someone with more graph/semantic experience needs to weigh in here… Thoughts?

1 Like

There would be additional overhead to maintaining transactions (there is no free lunch and we are wanting to add significant capability). I don’t think there is anyway around that, but don’t see why it needs to be the limiting factor if we design things properly. Using current IPT model, data publishers have to store full copies of their holdings for each version they publish, despite very small change sets. This problem could be eliminated using a transactional model where each transaction represents a pointer to the delta transform for a particular key/value pair (or set of pairs) which are then stored in an append only distributed repository. Data consumers (aggregators and power users) would use specialized tools (to be developed) to synchronize the chain and index either all or portions of the data contained within repositories. An initial sync might just call for skeletal metadata which could then be used to identify which assets to perform complete retrieval. Once a consumer has synchronized the chain they should only need to request the tail to bring your records “up to date” (lots of implementation details here that still need to be figured out). I don’t think the system should handle queries directly as that would be significant overhead. Querying should probably continue to be handled by specialized indexes like GBIF. GBIF would likely retrieve everything for its indexing with complete version history, while specific annotation tools might synchronize the chain, but use GBIF to query data and only use the chain to verify data integrity, confirm it’s annotation have been committed and monitor for new annotations from other tools.

1 Like

Regarding node architecture - lots of option here, but I would vote for a distributed node architecture as you suggest.

James - W3C Annotation standard could have a role to play here and we should evaluate it deeper in these discussions. Would we adopt W3C annotations as the primary communication protocol? or would we just enable support for a W3C annotations among other protocols? How tightly coupled is the standard to the web? Does the standard support digital signatures and some form of data integrity out of the box or do we need to layer that in?

@nelson I guess, a “delta” is the difference between two states?
This is what is actually done today in population-genetics to deal with the Ancestral Recombination Graph (ARG). Those inferences/reconstructions are tree-based (-> graphs) and only store differences, as far as I remember, only referencing the particular branches/parts of the tree, which are involved. This allows massive reductions in storage requirements and computational costs (time).

The user story that you are unfolding, with discrete annotations that get combined with other changes, changed back, dropped, etc. all in a bedazzling multitude of combinations, reminds me of the accumulating effects of mutation and recombination over time.

Each little or big annotation, done manually by a human or automatically by a process/bot, can be imagined to correspond to a recombination event in a genome. The nucleus is the physical specimen, both hold everything together. The chromosomes are more or less distinct subsets of (meta) data (classes), eg. georeferencing, genomics, morphology, etc. Most annotations will be within those subsets.

The analogy with recombination and the ARG might also help to reduce the complexity of the network evolving from the time-series of annotations. If all “searches”/(required) updates go through/are anchored by the digital representation of the physical specimen as a small-world hub, we are dealing more with a tree-graph than a highly interconnected network.

For example, from what was mentioned in an online copy of a publication, a user wants to have a look at the reconstructed genome or a histological image. The search process would not directly link from the publication to the genome reconstruction. It would follow within the literature part of the information tree links first down to the digital hub representing the physical specimen and then back out along the genomics or anatomy-morphology part of the tree to the specific branch with the versions of the genome or image.

I was pleasantly surprised to see the topic of “Transactional mechanisms and provenance” in this “Digital/Extended Specimen” consultation.

@hardistyar 's quoted description above of this transactions-provenance topic neatly matches ideas centered around Preston (https://preston.guoda.bio / GitHub - bio-guoda/preston: a biodiversity dataset tracker), a transactional git-like dataset tracker, that has been in use since 2018 and that I’ve referenced, and discussed with many of you in various meetings, discussions, prototypes, and email exchanges since 2018. (see GitHub - bio-guoda/preston: a biodiversity dataset tracker for some examples of publications, conference presentations, data publications, and forum/twitter posts)

So, I’d suggest to add a reference to Preston (GitHub - bio-guoda/preston: a biodiversity dataset tracker) and related publication (e.g., MJ Elliott, JH Poelen, JAB Fortes (2020). Toward Reliable Biodiversity Dataset References. Ecological Informatics. Redirecting) in the section “Information Resources” at the start of this discussion topic.

Please bear with me as I attempt to address @hardistyar excellent discussion prompts point-by-point:

Transactional publishing (or data tracking using reliable content-based identifiers) allows for systematic tracking (and referencing) of datasets (and records in them), irregardless of were they happened to be located. Existing mechanisms (e.g., web APIs, data portals) usually rely on trusting a single web entities to follow hard to maintain/verify “Cool URIs” (Cool URIs for the Semantic Web) principles . For a suite of use cases enabled by systematic dataset content tracking, GitHub - bio-guoda/preston: a biodiversity dataset tracker might be of interest.

Yes, transactional publishing provides a solid basis for reliable data references, an essential ingredient of reliable and reproducible data integration and attribution. You’ll find that Preston uses standards based (rdf / Prov-O, sha-2 hashing, triple implosion) techniques to implement transactions through linking provenance logs of biodiversity datasets in a decentralized storage architecture (pretty much following the git architecture beyond the file system). We’ve been running the Preston infrastructure using existing run-of-the-mill storage solutions (Zenodo, Internet Archive, Software Heritage Library, rsync-ed mirrors, commodity external hard-disk) to systematically track and version datasets/records registered in GBIF/iDigBio (and more) since 2018 .

In other words, we’ve shown that 404s (aka linkrot) or (un)expected changes (aka content drift) are no longer as much of an issue by introducing transactional provenance without having to significantly change existing biodiversity infrastructures.

I’d say that commonly used unique identifiers like UUIDs/DOIs/PURLs, while being aspirationally unique at best, are useful as long as the provenance of the context in which they appear can be reliably referenced and accessed.

So, I’d say that “unique” identifiers are useful, but not necessary.

I think that a well-designed provenance / transaction mechanism can be independent of the kind of domain-specific content that is being tracked. I found that Prov-O (PROV-O: The PROV Ontology) entity-activity-actor model provides basic information elements/model to implement a content/format-agnostic (or as the original git author Linus Torvalds put it: “dumb”) git-like provenance-transaction system.

I see mostly conceptual challenges in moving from a location-based data perspective (the Berner-Lee “Cool-URI” dilemma) to a content-based perspective (information-centric networking similar to Ted Nelson’s DocuVerse as proposed in his visionary Computer Lib/Dream Machines Computer Lib/Dream Machines - Wikipedia).

Also, I see some socio-economical coordination challenges in deciding how to make sure to keep many copies of relevant digital data around. But similar conservation/curation challenge have been taken up by collection managers, curators, and librarians for centuries, and some of this institutional know-how can be taken from the physical (e.g., books, specimen, artworks) and applied to the digital realm (e.g., files, digital datasets, digital images).

Over the last years, we’ve successfully tracked data across major biodiversity networks without having to ask for any changes to be made. Admittedly, more efficient approaches can be imagined, but from what I can tell versioned snapshots of existing APIs can go a long way.

Preston uses content-addressed storage to allow for provenance and content to be kept separately in decentralized manner without compromising the integrity of the versioned data corpus. We have examples of Zenodo data publications only including the provenance (and their translation chains) and storing (heavier) data objects elsewhere (see e.g., http://doi.org/10.5281/zenodo.3852671)

Primary publishers / curators can reliably reference trusted datasets in their (cryptographically signed) provenance logs similar to how tradition publishers bundle received manuscripts in journals. In other words, trusted institutions or publishers can curate high quality data collections by publishing reliable references (and related provenance) to datasets they deem worthy to their cause and quality standards. They may even choose to keep copies of these referenced datasets around to help keep the data around for a little longer.

Because of ability to store data “off-chain”, sensitive data can be reliably referenced in scientific papers or composite datasets without making the sensitive dataset openly available. An additional layer of access control can be introduced by providing a reference to an encrypted copy of the sensitive dataset. But, this poses long term archiving issues (e.g., losing private keys will make data inaccessible forever) that may outweight the benefits of added access controls.

I hope my comments show that transactions mechanisms and provenance are a pragmatic and proven way to overcome many of the data integration issues we all encounter daily without having to move our digital heaven and earth.

Thank you for your attention and I hope my contributions to this topic will help advance our methods to make the most out of our digital biodiversity corpora now and in the future.

Curious to hear your thoughts and comments.

-jorrit

I’ve been pondering “authority” a lot lately. Why do we think “primary publishers” should be the “authority”? Is that a requirement? I am beginning to wonder if we shouldn’t be far more open to others’ edits and additions WITHOUT the need for review and acceptance. It has been my experience that collections staff don’t have the time to reject or accept derivative data. I realize this is treading on sacred ground, but we need to start thinking outside the box if we are going to keep up with the fire hose of data aimed at us.

I think we first need to think about who “we” are. I don’t understand half of what is being discussed here and I am certain that there are a lot of people responsible for both the physical and digital objects we are discussing who are in the same boat. How are we supposed to sell this to admins (and when I say sell, I actually mean “get them to pay for it”)? All of this sounds very expensive.

Love the analogy to ARG. Yes, the delta would represent change in a given value for any attribute and would be linked to the relevant entity. In most cases the change is a replacement. Value X becoming value Y rather than a mathematical diff between X & Y.

1 Like

Primary publishers do have a vested interest in maintaining the specimens and the data derived from them. Maybe we shouldn’t think of primary data publishers as “Authorities” but rather “Data Custodians”.

1 Like

I see “transactions” from a slightly different perspective, though the perspective doesn’t change the technical data issues being discussed here. Viewed through the lens of the physical objects in collections, “transactions” are also events in the life of an object. Edits to metadata, separation into component parts (E.g., endo and exoparasites), processing of subsamples (cryopreservation of tissue, DNA extraction), data extraction (digital imaging, CT scans, sequencing). If we really re-think collecting events in the view of extended specimens that Joe Cook and I described in our PLoS Bio paper, then separating objects collected as one collecting event (plant, soil, herbivorous insect on the plant, parasite on the insect) are events in the life of the objects from that collecting event. These component objects are likely to move to different collections in different institutions, but the transactions that affect them should be entries in the same chain. I visualize it as a diary in which multiple authors can write. As long as they’re discoverable, can be indexed, and their relationships can be recovered, do they need to co-reside anywhere?

There might, however, be a need to categorize the kinds of relationship of a diary entry to the original object, or the original collecting event: physical subsample, processed subsample, analytical result, collected with, annotation, publication based on, etc.

1 Like

This is an interesting point and starts to get a little closer to a Wikipedia-style open model. One could imagine a “community” view of a record where anyone could make changes, with the ability for requesting moderation should an “edit-war” surface. Edits could either be applied immediately or be community vetted, and services could be developed to provide the feed of suggested changes back to the source if desired.

This is one of the models we’ve pondered for enabling annotations in GBIF, whereby we’d add a 4th view of the record in addition to the current raw (whatever came from the source), verbatim (normalized view of source data), and interpreted (view after automated cleanup). Enabling it as an opt-in service could be one way to manage concerns.

Have you any thoughts on how to get a sense of the desire for this kind of approach? You’ve mentioned keeping up with data changes is one concern, but I suspect technical capacity to adapt to changes (e.g. integrate new APIs) is another real constraint groups struggle with.

I think our treatment should go beyond the binary taxonomy of “original source” versus “everybody else”. “Users” are the people who have examined, published, subsampled, analyzed, scanned, sequenced and generally have added value to the original data. Perhaps the data and subsamples and digital representations of the original object might be termed “amendments” as opposed to “edits”, but I think of them all as entries in the diary of that object (or collecting event), many of them coming from people other than the original source (usually the collector or collection manager). That doesn’t mean the edits/amendments that come from many sources are exempt from questions, challenges, corrections by others, but isn’t it worth distinguishing between people who have examined the object?

1 Like