10. Transactional mechanisms and provenance

If an annotator fixes a a typo in recordedBy are you suggesting they would need to republish the whole specimen record? Under a transactional model, the only thing being published should be the hash pointing to the previous recordedBy entity and the new value along with a digital signature of the annotator. The previous recordedBy would then always persist but would not be included within the currently accepted view of the record.

@nelson. Depends who you’re calling the annotator in this scenario. I’m assuming, perhaps wrongly, that a data publisher does not itself make updates to a record as an annotation but continues to republish their data via usual routes. And, annotations made by others reside external to the data publisher in a centralized annotation store with pointers back to individual records and specific attributes like recordedBy. I expect we’ll have a transition period where some providers will continue to publish data via eg traditional Integrated Publishing Toolkits whereas others will use whatever infrastructure will support DES. Regardless, we have previously made mention of “authoritative” views of metadata within DES – the originator of the record – and layers of enhancements on top. It’s that relationship between wholesale or partially updated “authoritative” views from the originator of a record that directly affect previous annotations that were once dependent on particular values of attributes like recordedBy that concerns me. If there are digital signatures to ensure that annotations remain relevant, then that’s fine and good. But, when they’ve been rendered irrelevant, what do we do? It’s as much a technical problem as it is very much a sociological problem. Many such decayed, mismatched digital signatures eventually frustrate the annotator.

There has been some interesting discussion so far that seems to be getting at the building blocks of the DES system, so, in putting together a network of Digital Extended Specimens, what constitutes the building blocks of a “digital specimen” and what types/classes of extensions do we need to keep track of and identify in order to make such a system work?

1 Like

And if I may, how do we maintain links between a digital specimen & the items in its extensions? Whose responsibility is that? Does the DES infrastructure include an army of bots that perpetually trawl links and fire-off alerts to parties that made them such that they can take action by either removing or repairing them against the “currently accepted view of the record”? We are all flooded with data quality issues in our core metadata, but here’s yet another managerial challenge that will undoubtedly require a human in the loop.

1 Like

Please consider adding one or both of these articles to the “Information resources” on top. They deal with issues of establishing and tracking congruence or non-congruence across evolving biodiversity data organization schemes (classification, phylogenies). To me this is missing infrastructure.

Nico M. Franz, Naomi M. Pier, Deeann M. Reeder, Mingmin Chen, Shizhuo Yu, Parisa Kianmajd, Shawn Bowers, Bertram Ludäscher, Two Influential Primate Classifications Logically Aligned, Systematic Biology , Volume 65, Issue 4, July 2016, Pages 561–582, https://doi.org/10.1093/sysbio/syw023

Franz NM, Musher LJ, Brown JW, Yu S, Ludäscher B (2019) Verbalizing phylogenomic conflict: Representation of node congruence across competing reconstructions of the neoavian explosion. PLoS Comput Biol 15(2): e1006493. Verbalizing phylogenomic conflict: Representation of node congruence across competing reconstructions of the neoavian explosion

1 Like

@hardistyar do you have the ability to add these to the resources above?

1 Like

@abentley @nfranz done :slight_smile:

1 Like

Greatly appreciated!

@SarahDavidson raised the same point about responsibility in topic 6.

Instead of focusing on “responsibility” (which always will be loaded with political, social, and governance issues), should we shift our focus on “interoperability”?

The idea here is inspired by Cory Doctorow’s and others writing on “Adversarial Interoperability” – “Adversarial interoperability is the technical term for a tool or service that works with ((“interoperates” with) an existing tool or service—without permission from the existing tool’s maker (that’s the “adversarial” part).” I think Bionomia is a good example of this.

If anyone wants to build an army of bots to trawl links they can use the open protocol. A DES infrastructure does not necessarily need to provide all the services – it can provide the protocol and interoperable layers that can help others to build the tools (for example, a service to deal with data quality issues).

An also in agreement, that we do need to be considerate of not creating another layer of data management and other organisation challenges. However, certain level of cultural and organisational changes are needed along with the new technical implementations.

I found this very interesting with possible corollaries with the ideas and pieces we are looking at developing - https://www.vero-nft.org/.

1 Like

@hardistyar As you did for @nfranz , please add the following reference to the list of resources. If you are hesitant to do so, please do elaborate so I can learn more about your unique perspective on the topic.

Just to share some pragmatic examples of how an existing git-like system is already keeping track of biodiversity data archive versions (using Preston, Prov-O, Hash URIs) and is being used for data analysis studies and biodiversity informatics methods developments.

  1. Last Friday, after learning that my backup disk was full, I biked over to my local Target store to get a bigger hard disk to continue to keep a versioned, verifiable, backup of biodiversity archives (e.g., DwC-As, EMLs) referenced by iDigBio, GBIF, DataONE and various other data networks. This not only includes the versioned data, but also details of the download actions (aka download transaction/events) and associated archives. I was able to migrate the data to a new medium using run of the mill tools like rsync. Because the archive is location agnostic and (cryptographically) verifiable, I was able to quickly copy most of the archive locally (hard disk → hard disk via local IO), and retrieve recent updates via my (crappy) consumer grade DSL internet connection, without losing the ability to (locally) verify the integrity of the copy.

  2. a collaborator in Brazil uses specific versions to study how institutions and collections use DwC approaches to document species interactions (e.g., associatedTaxa, associatedOccurrences, ResourceRelationship). They copied versioned datasets of known provenance to a Amazon EC2 instance to run their analysis on Amazon’s servers. Fellow collaborators can reproduce their results using their own local copies of the exact same data (no need to pay Amazon for transferring data out of their infrastructure at ~ $100 per TB).

  3. a collaborator in Florida was able to continue their work offline during the COVID lockdown by using a verifiably exact local copy (a clone on an external hard disk) of the versioned biodiversity datasets of known provenance in combination with run-of-the-mill analysis tools and platforms running on their laptop.

  4. Some weeks ago, I found a collection unexpectedly dropping off the map. I traced the history and provenance of datasets associated with the collection, and found that the root cause was an IPT was serving html (i.e., an error page) instead of expected DwC-A zip archive, providing a well documented example of a known IPT issue (i.e., expected http status 404 on missing dataset, but got 302 (temporary moved) or 200 (ok) instead. · Issue #1427 · gbif/ipt · GitHub) .

etc. etc.

I am sharing these practical examples in the hope to solicit other practical examples from existing systems that use transactional mechanisms and provenance to solve biodiversity informatics challenges today (e.g., distribution of large datasets, citation of large datasets, change management, limited internet bandwidth, digital data preservation).

I’ve attached an example of the (bigger) medium I recently acquired to continue to keep copies of biodiversity archives (and their provenance) of catalogues indexed by GBIF, iDigBio and other biodiversity data networks. Let me know if you’d like to have a copy too …

A very good question. One reason to keep it is that when the next person stumbles upon that information somewhere, they can easily see that it has been determined to be erroneous and why, so they don’t try to add it back in.

Are the plant and pollinator really considered part of the same specimen? Where is the definitive definition of “specimen”?

I think they all “extend” the digital object and should be treated as the same “thing”.

Amen, we are already overworked

No, just the different collection part is true for those relationships but the same principle holds true

Perhaps not plant & pollinator but definitely bee and pollen.

Digital Extended specimen, not getting exactly, working as scientist in centralbiohub lab, I am professional writer and love to write articles on patient samples.