Structure and responsibilities of a #digextspecimen

hardistyar · February 24, 2021, 2:26pm

This is a subtopic of topic 1 on making FAIR data for specimens accessible. It is concerned with discussing the structure (logical and implementation) of a digital extended specimen and the allocation of responsibilities for various parts of such a construct. It begins with this sketch from @hardistyar:

nickyn · February 24, 2021, 2:41pm

Sketch from @nickyn :

In terms of what we do now:

first “authoritative info” layer shared in aggregators like GBIF
data in “supplementary info” also shared but often unlinked (duplicate / derived specimens etc)

@hardistyar your diagram includes sections for images / extensions / permitted operations / payloads - could these interact with any of these three core “layers”?

I suppose this distinction is most relevant when looking at the boundaries for responsibilities & revision control: - if an element in the core (the “what”) is updated, then the related info (if sourced via the taxon) also must change.

@dshorthouse : can you map your questions re responsibilities to this?

hardistyar · February 24, 2021, 5:04pm

I have another diagram or two to add but I’ve not drawn them yet.

In @nickyn sketch the 3 main elements are disconnected whereas in my sketch they are logically integrated under a single PID, even if the implementation is in several parts with separated responsibilities.

The PID kernel information (PID record attributes) contains all you need to direct you (or your software) to any constituent part. I named them as _sections in my sketch and indicate how each relates to the primary, secondary and tertiary extensions of the extended specimen concept. We just have to decide where each part/_section lives, and who/what is responsible for it.

The permanent (immutable) connection to the physical specimen is made in the authoritative a_section. Otherwise, sections are mutable.

hardistyar · February 24, 2021, 5:13pm

“Yes but …” is the answer. In addition to my previous reply, the operations defined for the DS will help you get at what you need without necessarily having to know physically where that lives. So, if you want the images of the specimen you just request them and the request_images operation would return them to you without you having to know whether they have been deposited in (for example) an institutional media management system or a national repository. Some of the images might be one place, some in another.

We have to define important operations we need, and these can have implementation repercussions. For example, an operation that applies a classifier to some images may need computational capacity close to where those images are - such as at a national-level facility.

abentley · February 24, 2021, 6:04pm

@hardistyar Yes, I think this is an important distinction to make in that some of these resources may come from different sources and may involve different actors. For instance some images, CT scans may be produced in-house as parts of digitization efforts whereas others may be produced by the research community as part of research on specimens requested on loan. The same is true of Genbank sequences and other “products”. It is also true that some of these products may have different identifiers (GUIDs, PIDs, DOIs, etc.) than the physical specimen and as such, it is important to make the association between these different elements in order to maintain the breadcrumb chain of custody and association - think DOIs for publications and datasets, Genbank numbers, etc.

nickyn · February 24, 2021, 7:11pm

Good catch. Revised version, where these are nested:
I suppose supplementary & related info may also sync w. institutional CMS

hardistyar · February 25, 2021, 2:41pm

And therein is the nub of the problem because different CMSs are more or less capable of doing that.

The sketch above expands on the previous sketches. It shows how all of the parts (a_, i_, s_, t_, etc.) that go to make up a #digextspecimen or are related to it (such as annotations, loans/visits transactions, etc.) can be collated together under a single DOI for the digital specimen; with each part having its own PID to unambiguously identify it. Even as I look at the sketch after drawing it I can imagine how some of it could be different but it conveys the general idea of how things fit together.

The sketch below explains a bit the content of the PID record.

dshorthouse · February 26, 2021, 12:24pm

Alex @hardistyar and I have been jotting down notes about the allocation of responsibilities for each part of a digital extended specimen. Below are a few of these, but recognize that this is incomplete.

Parts of an openDS & Expected Roles and Responsibilities

PID/Header Info (PID and its PID Record in the Handle System)

Created and maintained by a Handle Registration Agency (RA) acting as an agent on behalf of natural sciences data publishers. The RA captures the authoritative information provided by a data publisher, registers the PID, keeps the PID Record up-to-date, and maintains the connection between the PID/PID Record and the authoritative data. This authoritative information becomes part of the searchable metadata for the purpose of discovery and access.

a. Authoritative information
Generated by a natural sciences data publisher. Includes human-readable* data about the specimen. In the future, this information is expected to be compliant with minimum information requirements (MIDS) and is that commonly found today in Darwin Core with particular emphasis on terms like institutionCode, collectionCode, and catalogNumber that are expected to have stability. Other terms must also be present here that circumscribe the “who”, “what”, “where”, and “when” for the physical specimen. Part of the authoritative information must be an identifier (physicalSpecimenId) and institution (code) details that establishes the connection between the DS and its corresponding physical specimen. Both pieces of information are needed because physicalSpecimenId is not always unique.

Authoritative information cannot be adjusted by any agent except the authority, which is generally expected to be the custodial institution. The authority is expected to synchronize in near real-time this authoritative information with what resides in their local collection management system.

*machine-readable data is also required but who is responsible for preparing this needs further study.

A data publisher is responsible for telling the RA if/when the authoritative data changes, including when the specimen changes its physicalSpecimenId and/or moves to a different institution. In this latter case, this responsibility transfers to the new data publisher. Thus, the RA and the data publisher are contractually jointly responsible for persisting the DS/PS connection. Note, this is the same situation as journal articles and datasets today. The publisher and the RA (Crossref, DataCite) are each responsible for the two different elements - providing the correct metadata and registering the DOI but they are then jointly responsible for dealing with changes.

i. Images
Images are, in general the responsibility of the natural sciences data publisher. However, it is not precluded that third-parties may create images that should be appended to the DS.

s. Supplementary Information
Supplementary information can be added by any appropriately authorized person or organisation.

t. Related Data
Same as supplementary information above.

waddink · February 26, 2021, 1:32pm

Hi David, I think it is close, but I do not fully agree with this view.

I think that it is the responsibility of the data publisher (the custodial institution or person) to keep the PID record up to date, not the RA. E.g. if the custodial institution changes, like when a specimen is moved from one institution to another, then the institution needs to notify the RA to make the change or uses available tools to make the change themselves.
Although the ‘original’ authoritative information should always be there, it should be possible to provide alternative views/versions, for instance a ‘cleaned’ view resulting from GBIF data quality checks. How authoritative these alternative views are is up to the user to decide and we should guide the user with provenance data in their decision. It is not a matter of adjustment but a matter of providing different versions.
MIDS can include both institutionCode and collectionCode and non-resolvable catalogNumbers as resolvable counterparts. like a ROR for a institution or a DOI for a collection. When availabe resolvable identifiers should have preference.
catalogNumber may be a confusing term as we need the identifier of the individual object, not what is described in Arctos as catalogued object (which may be divided in multiple individual objects). In Arctos the catalogued object is also referred to as the specimen, but in ES/DS we use the term specimen for the individually curated object. A survey held during the last TDWG meeting showed a clear preference to see the individually curated object as the specimen, and to use this term already from the gathering event (when not yet accessioned).
5 it looks like for the different sections we need both an authoritative part and non-authoritative part. E.g. the images provided by the institution are authoritative and additional images provided by a researcher who has the object in loan are non-authoritative. This raises an issue with responsiblility: the research then needs to mint a new DO for the image, and that image could be served from e.g. his own Flickr account, but I am not sure if that is a good idea. Perhaps there should be some requirements in using a certified repository in such a case.
it might be good to mention that the image section would only contain images from the object. Other images or media such as sound recordings from a frog that was collected or an image from a fieldnotebook page will be in the s or t sections.

abentley · February 26, 2021, 4:39pm

@waddink @dshorthouse I agree with all of this and the beauty of a transactional system is that it would allow for the variations outlined here. Original records could be updated/edited/augmented through transactions on the original record. These transactions would be exposed to everyone and those CMSs that have the capability could update their records with this information but do not need to as the authoritative record of connections would be kept by the broker system inherent in the Digitial Extended Specimen concept. That way, anyone can contribute to a record with a transaction. A collection could add an image or a researcher could add an image. A collection could create a linkage to a publication or the publisher could do that automatically through unique identifiers, etc.

abentley · February 26, 2021, 4:53pm

I have been thinking for some time about how this then fits into our existing ecosystem of data publishing and use. I tried some time ago to articulate what our community looks like through this image created in Prezi resulting from an early BCoN data integration exercise - BCoN diagram2 by Andrew Bentley. However, the more I think about it, the more I like the analogy of a tree in this image

.

The roots are the data providers (collections in most cases). These then feed into a data store. Currently we have a multi-trunked tree where each aggregator holds its own cache of the data which results in multiple different copies of the same data. Ideally we need to move towards a single data store that it not owned by any of the aggregators but is a stand-alone store and is described by the Digital Extended Specimen transactional concept described in @hardistyar and @nickyn diagrams above. This would provide a single access point for data coming from the collections below and would act as a broker for validating and storing unique identifiers for objects as well as the integration of linkages to other objects related to it coming from numerous different sources. The aggregators are then the branches of the tree and act as a filter and UI on the data store providing data to the leaves of the tree which are the end-users of the data. The aggregators can innovate and alter their UI to represent the data in numerous different ways to the end-user community but the datastore is always the authority for the data. As @waddink mentions, there would need to be a layer of abstraction on this data store to allow for the hiding of certain data behind a gatekeeper of sorts (threatened and endangered species, obscuring lats, and longs for paleo collections, etc.) and the key to that gate could be provided to validate users who need access. Other sources of data (taxonomic authorities, Genbank, Isobank, etc.) could also be linked to the data source in various ways. @dshorthouse roles and responsibilities work obviously feeds into this descibing who fits in where and what social roles the various actors in this diagram have and should contribute. The social aspect of this is going to be extremely important in that everyone will need to play the same game and agree to this system.

JuttaBuschbom · February 28, 2021, 12:37pm

@hardistyar Replying to Making FAIR data for specimens accessible - #57 by hardistyar
This subtopic seems more suitable to my question, thus I am replying here.

Thank you for your explanation and the link to the webpage by Tim Berners-Lee.

My question was at a logical/user level, I recognize now. The question is, if links themselves get an existence of their own. That is, do they get an identifier and “traits” (attributes, children - I am out of my depth here). Thus, once you have a valid link, can this link have meaning of its own?

Concrete example: I am interested in implementing one or more chains of custody. These need to span the path of the specimen and information from the collection in the field, through lab work, genomics, isotopes analyses, morphology-anatomy, etc. and statistical analyses to official reporting for conservation applications, eg. evidence accepted in court trials, management decisions, national planning, etc.

A while ago I started designing a graph visualizing the path of the CoC through the ecosystem of existing biodiversity, genomics, statistical, etc. software and applications. This is a first draft NetworkGraph_20200727.pdf (53.7 KB)

In this graph, the links themselves carry information: Does an API exist or does a user need to manually “carry” the result of one program to another software and import/open it there? Additional information might be: how often do users use this link? Was this link validated by an authority? Can some quality measure be attached to this link? Etc.

As I understand it, so far we are talking about the nodes or objects, that is, the information content and structure associated with a digital object. Will it be important to consider the links between these objects as entities in their own right, too?

JuttaBuschbom · February 28, 2021, 3:03pm

@abentley Thanks for the beautiful visual way for thinking about the DS/ES-infrastructure. I enjoy it a lot.

Here is my question: How does the community see the overall structure of an implemented DS/ES infrastructure?

Will it be a centralized system, build around one or a couple of central data storage facilities, comparable to the INSDC network?

This might be what the tree represents.

Will the infrastructure’s design theme be one of data federation, allowing a myriad of ways to store data (central INSDC-like cloud storage, proprietary nodes, subject-driven storage, policy-driven access points, etc.) in a multitude of ways (normalized vs. denormalized, preferred standards, etc)?

In the latter case, some central indexing hub/facility/agency seems necessary to be able to find data (cp. the decentral backbone of the internet with the functionality of an ICANN - Wikipedia role - extended by in-depth indexing).

Here, you might not have one trunk, but several trunks. The image might change from one tree into a hedge, a digital sibling of Darwin’s “entangled bank”. In this picture the soil is the DS/EC concept - supporting and nourishing all parts of the hedge from decomposing fungus to shrubs and trees, tardigrades to bird and leopard.

Nevertheless, the tree image can work for a model of data federation, too, if the tree trunk is only representative of the abstract DS/EC concept that connects everything and binds it all together. Actual data storage, will be with the data providers and/or the aggregators.

A look across the fence into the realm of human genomics suggests that after two decades the principle of data sharing has run into trouble. Here, it appears that data federation is considered as a way forward.

See the paper by K. Powell this January in a series by Nature commemorating the 20th anniversary of the human genome. The title and subtitle summarize its tenor quite well:
How a field built on data sharing became a tower of Babel -
The immediate and open exchange of information was key to the success of the Human Genome Project 20 years ago. Now the field is struggling to keep its data accessible.
https://www.nature.com/articles/d41586-021-00331-5

Thus, do we build a centralized system, incorporating from the beginning solutions for the problems the biodiversity community and eg. human genomics by now have experienced? Or do we design and implement a federated system that requires its own set of solutions?

Rich87 · March 3, 2021, 7:32pm

@abentley While I like the tree analogy as a representation of the potential linkage between the data providers, a data store, and the user community, I do think that there needs to be some additional thought given to the data interchange between the data store and the data providers. To take the tree analogy one step further, the xylem (up) and phloem (down) structure of typical tree vasculature provides a very good representation of the thought.

While it’s clear that the collection owners will contribute info about the specimens upwards to the data store, how “nutrient-rich” will the information that flows downward to the owners be? After hearing about the idea of the data store, several of my colleagues were wary of the idea - once the data is “out there”, would the collection itself still be considered relevant? I would say “yes” - IF we can attain both a reasonable structure for both attribution and annotation and owners can decide what info they want to repatriate via that downward flow.

abentley · March 3, 2021, 9:17pm

@Rich87 Yes, agreed. Whatever system that is put in place needs to accommodate the transfer of information TO the collections as well as from them. The annotation thread is discussing this in more detail. The transactional system I am envisaging would accommodate that in that all transactions would be completely transparent to everyone and so collections would be able to see not only additions to a record as linkages (Genbank sequences, publications, etc.) but also be able to see all annotations made to a record by any actor in the data chain. The collection CMS could then interact with the extended digital specimen system to repatriate those back to the CMS if necessary. The collection CMS will always hold the authoritative base record information for a specimen.

sharif.islam · March 5, 2021, 8:43am

To give a little more context to the tree diagram and comments from @Rich87, I have been thinking about the idea of evolving and dynamic schemas. I think there are industry standards and best practices used in enterprise systems that would be worth looking into for inspiration.

I understand that we will need to have policies and social contracts in place when data publishers change something. But one of the ideas behind Digital Extended Specimens is the dynamic and evolving nature of the ecosystem. How do we address that in our schema structure? Do we need some sort of schema evolution and enforcement mechanism? Schema enforcement can reject any schema changes that aren’t compatible. This creates high standards but will be restricted for dynamic changes.

For simplicity sake let’s assume we have a data schema and ingestion/harvesting system. We have the following schema for one of the supplementary items and behind the schema, there is an object repository where the values are stored:

{
"properties" : [{"name" : "identifier", "type" : "handle"}]
}

And this schema has been exposed to thousands of clients and agents that are actively using it. After a while, the data provider adds a new field:

{
"properties" :
 [
{"name" : "identifier", "type" : "handle"},
{"name": "creationDate", "type": "date"}
]
}

What happens when the DigExtSpecimen system encounters the new object? A few options:

The system enforces the current schema and rejects the changes.
The system takes the data but rejects the new value.
The systems figures out the intended schema change and updates accordingly. At this point do we have two versions of the schema running? Some clients are using the old schema with new data and vice versa?

This might be more on the implementation side than modelling however we do need to consider these scenarios early on.

hardistyar · March 5, 2021, 11:26am

@waddink What is said by @dshorthouse and myself about responsibilities for keeping the PID record up to date is the general principle and is correct. The RA is responsible for altering the PID record while the publisher commits to notifying the RA of changes needed. Two aspects can affect this basic contractual agreement: i) the available tools by which notification and alteration takes place and ii) delegation of the RA’s responsibility to an Allocating Agent that is the publisher.

hardistyar · March 5, 2021, 12:03pm

@JuttaBuschbom I see the trunk of the tree as being a cloud more or less enveloping the ideas of the ‘DOI for a DS’ sketch I introduced earlier. Within that cloud, different subsystems can be implemented in a centralised or distributed manner, or a combination of the two. The important facts are that i) there are access points/entities at the edge for users, institutions, aggregators, etc. and ii) that topologies of implementation (central, distributed, federated, etc.) are separate from arrangements of governance (centralised, decentralised) (which usually drive decisions about the former).

hardistyar · March 5, 2021, 12:30pm

@JuttaBuschbom Indeed, a link can have an object associated with it that carries all the information about the link and that object has its own definition and PID and would itself be FAIR, hence a ‘FAIR Digital Object’. Note that a Digital Extended Specimen is also a kind of FDO.

Workflows for research give rise to chains of custody as you describe. Steps in the workflow represent your nodes and interfaces between steps represent the links. In current work on a [canonical workflow framework for research (CWFR)]( CWFR Position paper (OSF | CWFR Position Paper) these links/interfaces are addressed exactly as I describe above with a kind of FAIR Digital Object that describes the outputs of one step as the inputs to the next step of the workflow.

We’re currently seeking abstract proposals for a journal special issue on this topic. The deadline is only a few days away I’m afraid but if you think you can come up with a 1-2 page abstract as the basis of an article on CoC and what needs to be done from this CWFR angle that would be quite interesting I think. It has wider applicability than just our present domain of discourse.

jmacklin · March 5, 2021, 2:00pm

@abentley This is the scenario that the FilteredPush team prototyped over a decade ago. We used a centralized graph store and the simple semantics from the W3C Open Annotation standard that we extended for annotating data. We then wrote some simple interfaces that communicated with the store allowing access to annotations about a collection, agent, taxon, etc. These also supported round tripping to the owners via generic web interfaces or embedding in CMSs or other software (e.g., Morphbank). However, at that time we realized that this solution was not really feasible without the PIDs nor scalable if we got millions-billions-trillions of annotations. And, here we are today still trying to make this work

We now have better tech and more cyberinfrastructure but the core challenges still exist and will take a global community approach to conquer. I am very hopeful that the digital extended specimen momentum will be the catalyst!

Topic		Replies	Views
6. Robust access points and data infrastructure alignment Digital/Extended Specimen	32	3054	August 31, 2021
Extending, enriching and integrating data Digital/Extended Specimen	53	3969	April 5, 2021
10. Transactional mechanisms and provenance Digital/Extended Specimen	58	3447	March 17, 2022
Making FAIR data for specimens accessible Digital/Extended Specimen	59	4266	March 5, 2021
7. Persistent identifier (PID) schemes Digital/Extended Specimen	31	4422	July 27, 2021

Structure and responsibilities of a #digextspecimen

Related topics