10. Transactional mechanisms and provenance

Moderators: Nelson Rios, Bertram Ludaescher, Andy Bentley, and James Macklin

Background

The prevailing mechanism for managing and sharing natural history collections data involves static publishing of point-in-time snapshots. This data publishing paradigm has been in place for nearly two decades and has allowed the collections community to mobilize over 1.3 billion specimen records. While simple to implement, this model lacks a suitable feedback mechanism to facilitate tracking usage and integration of downstream products (enhancements, annotations, derivatives, publications etc.) from third parties. Building on previous topics in Consultation 1 on Extending, Enriching and Integrating Data (Extending, enriching and integrating data) and a parallel discussion in the Structure and Responsibilities of the Digital Extended Specimen subtopic (Structure and responsibilities of a #digextspecimen), the goal of this topic is to evaluate transactional data publishing as a solution and its role in maintaining provenance, whereby data elements are published in near real-time in the form of linked, immutable & append-only transactions.

In a transactional data publishing model, each edit to a data record, is represented through a transaction published to a linked list. Similar to block chains, transactions in the list are cryptographically related to all prior transactions ensuring data integrity and immutability and preservation of version history. Yet, unlike block chains, transactions do not need to be bundled into blocks nor are computationally intensive models of consensus necessary. When summed, transactions can be used to derive a given view of a data item. Transactions may represent the registration of a new data resource such as an institutional data publisher, accession of new specimens, modifications and annotations to existing or new fields (georeferencing history, morphological landmarks, identifications etc), transfers of material between institutions and individuals, or other specimen related processes in need of tracking. Combining this model with an open distributed network would empower researchers to publish annotations to specimens records that have yet to be fully recognized by collections. The same network could allow collections to accept or reject third-party annotations while still preserving the derived view published by the third-party. Furthermore, once synchronized with the network, biodiversity data aggregators would only need to ingest and index new transactions entering the network. The goal of the consultation is to identify real world social and technical challenges of utilizing such a model and whether it may provide a suitable approach to overcome the limitations of contemporary data publishing systems.

Questions to promote discussion

  1. What use cases can transactional publishing solve that can’t be solved by existing mechanisms?
  2. Is transactional publishing a solution to the existing problem of data integration and attribution?
  3. Any transactional mechanism is going to rely on unique identifiers for all data objects. How do the issues surrounding Topic 7 relate to transactional publishing? Are identifiers necessary (at least in the way we normally think of them)?
  4. What pieces of information do we need to capture to Extend specimen records - annotations, georeferences, publications, Genbank sequences, Nagoya related information (loans, gifts, accessions? Who made these (agents)? When were they made (dates)?
  5. What infrastructure components are missing and needed to make this a reality?
  6. What changes to existing infrastructure (CMSs, aggregators, publishers, data flow, etc.) are needed to make this a reality? Can the existing pieces support transactional publishing? Why? Why not?
  7. To accommodate the full breadth of digital assets, data storage would likely need to be separate from the transaction chain? What are the implications of “off-chain” data storage and what mechanisms need to be in place to accommodate this?
  8. The goal is to have an open network enabling researchers to publish derivative data at will even if rejected by primary data sources. What measures need to be in place to ensure primary publishers maintain the authoritative views for a given record?
  9. How do specifics of Topic 9 (regulatory, legal, ethical topics) affect the open sharing of data and exposure of transactions in this fashion? How do we obscure data that should not be seen (threatened/endangered/CITES species, other sensitive data)?

Information resources

I’d like to pick-up on this question, or at least my interpretation of it. What’s highlighted here is a blend of two tensions: (1) information (mostly) endogenous to the specimen record, i.e. lifted from the physical labels & perhaps interpreted in some way, and (2) information exogenous to the specimen record, expressed as linkages to other entities that act as value-added materials. The items in (2) may be made under the assumption of robust stability in the values of particular data elements in expressed (1). How might a transactional mechanism deal with drift in (1) that affects the veracity of (2)?

In other words, there may be chains of data enhancements made to “core” elements over time. When a DES is first released/published, it may be sparse with skeletal information about locality. It may later be enhanced with georeferenced latitudes and longitudes. That new enhancement may trigger the accrual of new linkages to yet other entities held external to the specimen. So far so good. However, as is often the case, such enhancements are refined even more & perhaps completely changed through correction of discovered error. If corrections are made to those initial georeferenced latitudes and longitudes, what then happens to the linkages made to other entities that were reliant on those values, created via other transactions? How does a transactional mechanism model the nestedness of actions? And, does it afford any mechanism to automate the severing of links or the flushing of content should the required state of an item change?

By design it would not be possible to flush or revise prior content, otherwise it would violate the goal of immutability. One issue that we may have to address out is how can a record be removed without destroying the historical record. There could be cases where an institution publishes a record only to realize they need to remove it for ethical, legal and/or privacy concerns. Encryption might be an option here to keep the record in play.
Generally though, the fact that a prior attribute to a record is no longer accepted does not invalidate that it did exist at one time and I think we would want to maintain that and the related linkages unless there was a real institutional concern as described above. It should be possible, though, to develop tools/services that monitor for new transactions that impact linkages and address those impacts through the addition of more transactions. So, building on your in your case, say we have a set of records with a locality description and no higher geography. Some third party (maybe an automated tool) georeferences those records and publishes the results to the network. Then say GBIF chooses to accept those results and determines higher geography from those coordinates via an automated reverse geocoding script that dumps the results back into the network. So now we have a set of transactions that define the official institutional records, a supplemental set defining new coordinates and yet another supplemental set used to derive GBIF records. A researcher then relies on GBIF’s derived records to produce a species atlas for a given region and eventually that end product get linked to the source records. Down the road someone discovers a pile of field notes and updates many of those coordinates, so they publish new coordinates to the records in question. Another tool (likely the same GBIF tools that published the reversed geocoding into the chain to begin with) sees the new transactions and publishes another set of reverse geocoding results. So we still have a bunch of records that are linked to the atlas that shouldn’t have been included in the atlas to begin with. At this point it is really up to the researcher to say “I need to publish a new atlas with newer data”. Even if he did, the old linkages should still be persisted.

How much of the ALL the extensions (= linkages) and core data elements - these are the (1) and (2) distinctions in my post above - get copied over alongside each new transaction such that the context is preserved? If the answer is in fact ALL, then ought we worry about the environmental impact that such an architecture creates for precisely the same reason that Bitcoin is very bad.

The model for transactional publishing is fundamentally quite different from cryptocurrencies. A better analogy would probably be git rather than bitcoin. The problem with git is that data aren’t immutable as you can easily modify histories.
No data elements are copied in a transaction. Data would be stored “off-chain” in a mix of community and institutional repositories with some replication included for added redundancy and verification. The chain itself would largely consist of some tracking metadata, the computed hashes and linkages among them. Over time this should significantly reduce storage and computational costs over IPT.

1 Like

In some cases the transactional record could get as large as or larger than the data element/digital object. I think the idea of transactions ensuring data integrity and immutability and preservation of version history is fascinating but I am worried about extra overhead. Every workflow and execution need to make a call to the transaction network. How do you envision this? And this will be adding another storage layer for maintenance.

Above you mentioned the idea of a network where the transactions will be published. How do you envision this network? Would the network consist of distributed nodes maintained by a consortium? Some sort of quorum mechanism? What protocol could be used to sync the nodes? Maybe a global index/catalog service is needed to manage all the transactions?

We still need to store versioned datasets and digital objects (either copies of the entire object or a delta), correct? How will that reduce storage costs?

I worry about the overhead too. Another model that could work would be using an annotation framework supported by a graph. The links we make between objects/entities within a DES concept are assertions. These assertions can be documented by evidence that provides context for why a DNA sequence, for example, is a derivative of the digital specimen in question (and by twinning the physical specimen). All assertions related to a DS could be queried in the graph and analyzed/mined in ways to promote discovery and broader relationships to other objects. Yes, the graph could get very large but technology has improved in this respect to allow for significant scaling and performance. The graph could also get “polluted” by many assertions related to the same objects and their relationships (equivalent of versioning…?) but this would happen in any implementation of transaction services. The W3C Annotation standard gives a good framework to work with and its extension to data that a group of us led by the late, great Bob Morris helped to define as part of the FilteredPush project (see https://doi.org/10.1371/journal.pone.0076093). Someone with more graph/semantic experience needs to weigh in here… Thoughts?