Moderators: Nelson Rios, Bertram Ludaescher, Andy Bentley, and James Macklin
Background
The prevailing mechanism for managing and sharing natural history collections data involves static publishing of point-in-time snapshots. This data publishing paradigm has been in place for nearly two decades and has allowed the collections community to mobilize over 1.3 billion specimen records. While simple to implement, this model lacks a suitable feedback mechanism to facilitate tracking usage and integration of downstream products (enhancements, annotations, derivatives, publications etc.) from third parties. Building on previous topics in Consultation 1 on Extending, Enriching and Integrating Data (Extending, enriching and integrating data) and a parallel discussion in the Structure and Responsibilities of the Digital Extended Specimen subtopic (Structure and responsibilities of a #digextspecimen), the goal of this topic is to evaluate transactional data publishing as a solution and its role in maintaining provenance, whereby data elements are published in near real-time in the form of linked, immutable & append-only transactions.
In a transactional data publishing model, each edit to a data record, is represented through a transaction published to a linked list. Similar to block chains, transactions in the list are cryptographically related to all prior transactions ensuring data integrity and immutability and preservation of version history. Yet, unlike block chains, transactions do not need to be bundled into blocks nor are computationally intensive models of consensus necessary. When summed, transactions can be used to derive a given view of a data item. Transactions may represent the registration of a new data resource such as an institutional data publisher, accession of new specimens, modifications and annotations to existing or new fields (georeferencing history, morphological landmarks, identifications etc), transfers of material between institutions and individuals, or other specimen related processes in need of tracking. Combining this model with an open distributed network would empower researchers to publish annotations to specimens records that have yet to be fully recognized by collections. The same network could allow collections to accept or reject third-party annotations while still preserving the derived view published by the third-party. Furthermore, once synchronized with the network, biodiversity data aggregators would only need to ingest and index new transactions entering the network. The goal of the consultation is to identify real world social and technical challenges of utilizing such a model and whether it may provide a suitable approach to overcome the limitations of contemporary data publishing systems.
Questions to promote discussion
- What use cases can transactional publishing solve that can’t be solved by existing mechanisms?
- Is transactional publishing a solution to the existing problem of data integration and attribution?
- Any transactional mechanism is going to rely on unique identifiers for all data objects. How do the issues surrounding Topic 7 relate to transactional publishing? Are identifiers necessary (at least in the way we normally think of them)?
- What pieces of information do we need to capture to Extend specimen records - annotations, georeferences, publications, Genbank sequences, Nagoya related information (loans, gifts, accessions? Who made these (agents)? When were they made (dates)?
- What infrastructure components are missing and needed to make this a reality?
- What changes to existing infrastructure (CMSs, aggregators, publishers, data flow, etc.) are needed to make this a reality? Can the existing pieces support transactional publishing? Why? Why not?
- To accommodate the full breadth of digital assets, data storage would likely need to be separate from the transaction chain? What are the implications of “off-chain” data storage and what mechanisms need to be in place to accommodate this?
- The goal is to have an open network enabling researchers to publish derivative data at will even if rejected by primary data sources. What measures need to be in place to ensure primary publishers maintain the authoritative views for a given record?
- How do specifics of Topic 8 (regulatory, legal, ethical topics) affect the open sharing of data and exposure of transactions in this fashion? How do we obscure data that should not be seen (threatened/endangered/CITES species, other sensitive data)?
Information resources
- W3C Annotation Working Group: W3C Web Annotation Working Group
- NASEM report – chapter 5 – starting on page 93 (https://www.nationalacademies.org/our-work/biological-collections-their-past-present-and-future-contributions-and-options-for-sustaining-them)
- BCoN Extended Specimen Network report (https://bcon.aibs.org/wp-content/uploads/2019/04/Extended-Specimen-Full-Report.pdf)
- BCoN Extended specimen publication (Extended Specimen Network: A Strategy to Enhance US Biodiversity Collections, Promote Research and Education | BioScience | Oxford Academic)
- BCoN Data integration and Attribution workshop report (https://bcon.aibs.org/wp-content/uploads/2018/05/BCoN-Needs-Assessment-workshop-report-1.pdf)
- Webster 2017 – The Extended Specimen, especially chapters 1 and 13
- Page 2008 - Biodiversity informatics: the challenge of linking data and the role of shared identifiers (Biodiversity informatics: the challenge of linking data and the role of shared identifiers | Briefings in Bioinformatics | Oxford Academic)
- Page 2008 - Visualizing a scientific article (Visualising a scientific article | Nature Precedings)
- Van Rossum 2017 - Blockchain for Research (Blockchain for Research)
- Berendsohn & Guntsch 2012 - OpenUp! Creating a cross-domain pipeline for natural history data (OpenUp! Creating a cross-domain pipeline for natural history data)
- Konig, et al. 2017 - Biodiversity data integration—the significance of data resolution and domain (https://doi.org/10.1371/journal.pbio.3000183)
- Thessen, et al. 2017 - 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration (20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration [PeerJ])
- Data integration enables global biodiversity synthesis (https://www.pnas.org/content/118/6/e2018093118)
- Nico M. Franz, Naomi M. Pier, Deeann M. Reeder, Mingmin Chen, Shizhuo Yu, Parisa Kianmajd, Shawn Bowers, Bertram Ludäscher, Two Influential Primate Classifications Logically Aligned, Systematic Biology , Volume 65, Issue 4, July 2016, Pages 561–582, https://doi.org/10.1093/sysbio/syw023
- Franz NM, Musher LJ, Brown JW, Yu S, Ludäscher B (2019) Verbalizing phylogenomic conflict: Representation of node congruence across competing reconstructions of the neoavian explosion. PLoS Comput Biol 15(2): e1006493. https://doi.org/10.1371/journal.pcbi.1006493