10. Transactional mechanisms and provenance

Yes, I’d agree keeping a log of changes to distinguish all the parties that contribute is highly desirable. The suggestion was more from the angle that if there are closed “source systems” that most people can’t edit but will continue to undergo internal changes, and an open environment that is available for editing and enriching the records, then we do have that separation to deal with (“source view” versus “community view”). I don’t think that means we couldn’t track the work going on and by whom, but there are multiple representations of the same entity.

@jhpoelen and others: Thanks for the many quotes of parts my topic description and excellent prompting questions in this conversation but I really should point out that they originate from the moderators of the topic @nelson, Bertram Ludaescher, @abentley, and @jmacklin. I mainly cleaned the text and put it in here :slight_smile:.

I wish I had those thoughts! I think it would be interesting to see someone implement such a system so that we could all see what transpires. Although, the reality is that such a system exists (Wikidata) but it seems to me that we do not trust it, even if we use it.

add a 4th view of the record in addition to the current raw (whatever came from the source), verbatim (normalized view of source data), and interpreted (view after automated cleanup).

I’m not sure that adding more “views” is helpful. I am beginning to feel like we are creating multiple versions of reality and confusing everyone in the process. It seems like building a consensus would be a better path? In my mind, this is how the Wiki community (and iNat) work, leaving the community opinion as the current set of “facts”, subject to change with new information.

1 Like

I like “data custodians”! Museums are generally considered custodians (not owners) of the objects in their collection - so this makes sense.

1 Like

Me too. You cannot legally own data - so I’m told by lawyers. You can only ever be a custodian, guardian, steward, or controller of it. By what right could you or your institution then be regarded as the sole authority for or about that data? Even the latter case of being a data controller doesn’t generally confer rights as far as I know, but mainly obligations under, for example data protection laws. Authority is a convention borne of necessity that’s historically grown up as a consequence of having concentrations of expertise in specific collection-holding institutions combined with the missions of those institutions to curate, research and educate. Nevertheless, (again as far as I know) it’s always been possible for other recognised experts to attach additional information to objects they are not custodians of.

I agree we must challenge the conventions, think outside of the box and design/deliver infrastructure that enables new transformed and combined physical/digital working practices effective for and commensurate with the timescales on which collection-holding institutions are typically used to working on i.e., aiming to be fit for purpose for the next 100 years.

@nelson makes an assumption that Digital extended Specimens (DS) should be immutable digital objects. Choosing to go with immutable versus mutable DS digital objects must be an explicit design choice and not an assumption because that choice has all kinds of consequences. It determines what and how you store? It determines how you design your principal digital objects (i.e., the DS) and the related objects (such as different kinds of transaction object) that go along with it. It determines how you process objects, what it means to be a ‘machine-actionable’ object and how you write the software programs that do that processing.

Do we want to keep every immutable object as it is transformed from one object version to the next? Do we want to keep every delta and rebuild the object when it’s needed to the state it’s needed in e.g., at the time it was cited? Do we want a single mutable object that is always current, with access to the record of deltas and transactions that led to it or to any prior version of it? Each of these has different design implications for how the 7. Persistent identifier (PID) schemes must operate.

We must be careful not to put the cart before the horse by jumping immediately to specific technical solutions without first settling the proper model by which the ‘DES layer’ illustrated in the Background and context for phase 2 will function. Especially for infrastructure operating on very long timescales (as I mentioned above) the technologies can and will change.

I agree with @DESchindel that transactions are events in the life of an object (its diary as @DESchindel puts it) and that (like in the art world) the DS’s provenance is the history of those events. Logs and ledgers are the appropriate chronological way to record activities performed by agents on entities (the PROV model) and to visibly attribute those. But the logs/ledgers are separate from and sit alongside the detailed records of the events (loans, visits, annotations, interpretations, amendments, enrichments, etc.) themselves.

The fundamental model we must define now is a FAIR Digital Object / [cloud]event model for the long term that can be both Web-compatible for the medium term and technology-neutral but it definitely won’t be a solely Web-based model in the first place.

1 Like

Very interesting Canonical Workflow Frameworks for Research-meeting yesterday with discussions at the intersection of workflows, FairDigitalObjects, provenance, PIDs and natural history collections.

One of the two presentations is available on zenodo:
Stian Soiland-Reyes: RO-Crate, workflows and FAIR Digital Objects

Also of interest might be the Provenance Week-conference with a presentation (July 22) on provenance and natural history collections.

1 Like

Update to the information on the CWFR-working group and its recent meeting:
until the end of July, the second presentation, the recording of the meeting, the collaborative document and more materials can be accessed online here

I think there is a difference between data associated WITH the specimen and INFERENCES made based on the specimen. I agree that data associated with the specimen will have to be immutable and any annotations or augmentations made to the specimen information will need to include versioning so as to not break that immutability. However, inferences made based on the specimen data e.g. a niche model created by using all records of a particular species where some of those have now been re-identified, are a different beast and would not need to be maintained by the DES system. I don’t see any issue with a record changing over time or being augmented and thus creating new linkages to other elements. I think that is the power of the system.

Yes, I think as is being discussed in the legal and ethical thread, there will need to be an encrypted layer built into the system that would allow us to hide certain records behind a firewall so to speak with digital keys that could be provided to those who have been given “authority” to access those records. It would have to be a one-time access key so as to ensure that it is not shared or reused over and over again. This would allow for records of sensitive species, CITES, etc. or even parts of records (georeferenced information for sensitive plants or fossils) to be obscured from general view. A more tricky scenario is when that record has been previously “released” and then is subsequently hidden. As we all know, once it is out there there is little chance of taking it back.

I think we need to start embracing community curation as a necessary component of modern collections management. We, as collection managers, cannot do it all and as such need to start letting others in to assist in the identification, augmentation and annotation of records that we are stewards of. Yes, this will be a little more work for the CM in terms of curating these and integrating them into the data life-cycle, but the payoff far exceeds the workload - and that is the selling point for anyone who asks - be they institutional administration, funders or the general public. I do agree that we are going to need to work very hard on capacity building to train and educate our workforce to be able to deal with the technical aspects of this part of our job but that is already happening to some degree. CMs can no longer live in a bubble and do their own thing. They are now obliged to publish their data to the outside world to keep their collection relevant (publish or perish) and learn the necessary skills to make that happen. Museum Studies students are in high demand as the next generation of collection managers for exactly that reason - they have these skills in abundance.

1 Like

If the scope of the DES extends beyond museum collections, I think it would be important to consult on this with people who collect data or use them in implementing policy (e.g., field biologists/ecologists, curators, government agencies, tribal groups, and protected area managers) who aren’t necessarily participating in this infrastructure-focused consultation. I’ll attempt to represent some of what I have heard from these parties here: Many ecologists are supportive of open data in general, but hesitant to make their data completely open, due to concerns that there are a lot of ways for complex, heterogeneous ecological data to be misunderstood by people who are not familiar with the biological systems and data collection methods. Some have seen their open data used by others without proper attribution, or had to address stakeholder/PR/media confusion caused by data re-use that reported results they consider incorrect interpretations of data they collected. For complex data, and for data being used for implementing management and policy, data collectors/owners/stewards often see it as critical that users discover and access a full dataset together with other contextual information about the work, clear instructions for attribution, etc. They can present this as they consider appropriate in a museum exhibit or on a web page. However, with APIs and automated data linking, it can become difficult or impossible for this context to be passed on to data users, and allows for individual records to be accessed, mirrored, and used in derivative works and presented out of context. Too much required transparency, or tools operating on an assumption that derivative works hold the same value as those from the people who collected the data, could reduce interest in and adoption of DES infrastructure. I think these stakeholders could provide valuable feedback on how tools could be designed to maximize participation and beneficial data use.

True, but that’s an oversimplification of the ecosystem we’re hoping to build. I’m assuming that the curation of links that make the ‘E’ in DES will be shouldered by the “community”. There is a maintenance burden to these links that cannot be the sole responsibility of collections data managers; the utility of links depends on the stability of the semantic content at both ends of a link. If the ground shifts at either end of a link (because eg the identifying metadata or features about a node has changed through innocent correction of error in transcription), the link ceases to be verifiable and ceases to be trusted. And, you’ve deflated the motivation for building links when they’ve rotted through no fault of the party who built the link.

Versioning of identifying metadata is a possible band-aid to achieve specificity when we cannot define what it is a specimen nor what is its canonical feature set. But how would we ever impose versioning of nodes at both ends of link? We wouldn’t. We couldn’t. What’s in it for an external entity to help us make the ‘E’ in our DES if our ‘S’ part is inherently unstable or versioned? Does anyone link to a specific version of a wikipedia page? A specific version of a wikidata item? A specific version of an identification on an iNaturalist observation? A specific version of a dataset published from an an IPT? I expect there are rare edge cases where someone has linked to a specific version & not the head version but they are not at all the norm.

I suppose what I’m saying here is that we desperately need to formally solidify what is the ‘S’ in DES before we can be serious about what is the ‘E’ and who builds it, whatever the provenance or transactional model. It’s telling that we’ve not yet done this even though organizations like GBIF have been in existence for decades and we have exchange standards to shuttle metadata.

Yes, I agree, an oversimplification but just one example of many with many nuances. I also agree that we need to figure what our units are. I still have a hard time figuring out what represents a unit of what we want to extend. I think part of the problem is that there are so many variables regarding how the data is generated and by whom. I tried to conceptualize my thinking of it in this video but am still not sure I have it right - https://mediahub.ku.edu/media/t/1_49oo4dh4. Comments welcome.

1 Like

I also think we wouldn’t need to version everything. It would only be for those things that may change over time like an annotation or a georeference or something. We wouldn’t necessarily need versioning for a publication or an image or CT scan that would stay stable over time.

If something is truly in error, would we or should we need to keep the erroneous version?

1 Like

I am not sure that versioning is the correct terminology here. The basic unit would stay the same as would the pointer to it but all additions, corrections, augmentations would simply be transactions on that unit that you would be able to trace. In the Wikipedia page example, you would still point at the page but there would be a paper trail of all changes made to that page. That is the S. Of more concern to me is what happens when different parts of a specimen get published by two different collections but are clearly parts of the same specimen - tissue ends up in one collection and voucher in another (same is true for host:parasite or plant:pollinator) and they get published as two separate entities. What is the S in that scenario? Each individual piece or the whole?

I think the power of the DES network as we have envisaged it addresses all of these “concerns” in that it will provide all of the contextual and related information in a completely transparent manner (where possible) in order to inform the correct use and interpretation of the data by all involved. It will also address the attribution side of things using the same mechanisms by providing clear provenance to data, a clear indication of who is using what and for what reasons, and the necessary attribution for contributions to data curation, publication, and dissemination. Currently, this is a minefield that is difficult to trace - especially for collections trying to advocate for their existence as critical research infrastructure.

These are very good questions & they address the very core of what it is that we envision as participating in both ‘S’ and ‘E’ in a DES and what, if any, are the fine divisions between the two. As you write, there are instances where either/or are suitable homes for what we believe to be a ‘link’ and what we believe to be a ‘class’ of object.

If I have correctly interpreted what you’ve discussed here and elsewhere, you tend to think of the participant nodes at the other end of an ‘E’ link as being a class of object: a publication, a derivative, a CT scan, a 3D model, a sequence, etc. An evidentiary object that enhances the circumscription or the knowledge space of ‘S’. To a degree, it need not matter what is the metadata for the parent, proximate ‘S’ node in a DES because there is generally a gestalt appreciation for what is the object.

What I’m also trying to figure out is whether or not other kinds of links – I’ll call them ‘e’ – can likewise participate in this space. These little ‘e’ links are additive enhancements to the very metadata of ‘S’ but do not cleanly represent a join to a class of derivative object like a publication or a CT scan. For example, links to ORCID or wikidata for collectors, links to an entry in Catalogue of Life for scientific namestrings, or links to GeoName for localities are very different beasts. Some of these may be served directly by the publisher of the ‘S’ if we were to incrementally relax the schema of what is an ‘S’ to accommodate them. And, we have a vehicle through TDWG and Darwin Core to do that. To varying degrees, all these potential linking enhancements to the metadata of ‘S’ are stable at the other end; they all bear URIs and there’s organizational commitment to ensure longevity. However, the metadata in ‘S’ – the anchoring points for these small ‘e’ enhancements – are far less stable. In the absence of versioning, any local adjustment to the content in their cells (eg collection manager “fixes” a locality string) risk breaking the truthiness to the little ‘e’ links to people, taxon concepts, or places or any other similar link made to textual, verbatim values in the metadata if the publisher had not created them nor assumed any responsibility for their maintenance. And, it’s entirely possible that these little ‘e’ links may begin to stack up on one another like extensible stove pipes, each emerging from single embers of coal. Persistent identifiers for collectors is a nice example. People have many of them.

Do we need to be more prescriptive in what class of link/object can participate in ‘E’, ones that are forgiving of or immune to shifting metadata that prescribes ‘S’? The little ‘e’ links I’ve described above lean more toward ephemeral annotations or the creation of new properties of ‘S’. Are these eligible for transactions in a DES just as we might have transactions on links created for the more evidentiary ‘E’?

Not necessarily. The E could be an annotation coming from a third party or an augmentation of he data through georeferencing by the collection, or, as you say, a link to an ORCID for the collector, determiner etc.

I do like the “E” “e” analogy though and wonder whether we could use this to somehow classify or better circumscribe the kind of extension as to whether it is a product (E) or whether it is an annotation/augmentation/correction (e) to the original data. Or do we need to make the distinction at all and just treat them all as different kinds of the same thing? I think where it does become a little more tricky is in scenarios here you are linking something that has more to do with the taxonomic concept than an individual specimen (in the case of a taxonomic authority) and I would see that as linking multiple DESs together to answer a broader question as I put it in my video.

I think we can all agree that annotations (or at least supporting them) are good. At the very least, they’d demonstrate that there are people who do want to produce them.

I need help understanding the interplay & the implications for when annotations get built on top of shifting metadata. Here’s a common scenario: Annotator adds new metadata (eg ORCID for a collector) into a pool of annotations, source provider changes data (“oops sorry, made a mistake in transcription - wrong name typed in recordedBy”) & republishes a new & improved ‘S’. Does that little ‘e’ annotation persist in the pool if it was not ever absorbed into the canonical source ‘S’? It was rendered illogical through no fault of the annotator. This is where I see transactions falling short - the events do not overlap in a short-lived, temporal sequence of cause and effect with clearly delineated, shared responsibilities.

Instead, it’s perhaps more fruitful to turn our attention to how to best atomize the anchor points of little ‘e’ links as computed hashes. One end resides in the source, the other identical end resides in the object of the annotation. The moment either of those ends diverge, the annotation is flagged as suspect & in need of repair. Other metadata elements in ‘S’ can be merrily updated from source without affecting the veracity of this little ‘e’ link. And so yes, we need a persistent identifier for the ‘S’ but we also need atomized anchor points for each of the key:value pairs of metadata with computed hashes as standard sockets into which we can plug an annotation. When the voltage changes, we can trip the circuit and pull the plug.