Structure and responsibilities of a #digextspecimen

I tried to picture the extended specimen concept and digital specimen concept for comparison. Both concepts seem to have a lot in common, but the DS concept focused more on the technical implementation and provenance data while the ES concept focused more on the types of extension and how to add to the value of the physical specimen. So they rather complement each other than are different. They both share a common vision of linking through PIDs and to extend specimen both with data that can be linked to a specimen directly and data that can be linked through a taxonomic name or gathering event. They both have the vision that current domain standards (DwC, ABCD) need to be used as basis but would need to be extended. There are some differences in the original concepts though

  • ES has specimen media in its primary extension while DS has only specimen images in its authoritative section, placing other specimen media in the extended data (e.g. a sound recording taken during the gathering).
  • ES links all data directly to the physical specimen. This would require current physical specimen identifiers to change to a globally unique one if they are not so already. DS links all data to the digital specimen and uses a new DS identifier for that. It tries to establish a link to the physical specimen based on a current, not always globally unique physical specimen identifier (accession number) plus some other data (e.g. a collection identifier).
    -DS aims to use an identifier (handle, DOI) that enables multiple resolution, e.g. the PID can link to a landing page hosted by an institution, a page in GBIF, and other places. ES did not work out this level of technical detail. Similar for DS aiming to implement machine actionable metadata through a FAIR Digital Objects and a PID information Types (PIT).


And this is a picture how a converged Digital Extended Specimen would look like, based on my interpretation of the earlier comments in this consultation. It would look like a digital specimen with the extensions worked out in ES, with data not only linked to but also derived from specimen in the ‘secondary’ part, and with a different section for specimen images. All the component in the object can have their own PIDs to be linked to the DS PID. Taxonomic names and collection events would also need PIDs to be linked.

2 Likes

This is a systems like view, with my interpretation of the original ES Network, where linkages to extended data are made/stored at the Collection Management System (CMS) and then the extended specimen are uploaded to a cloud with aggregated data for access and annotations. In a digital extended specimen view this would be transformed in a ‘mini cloud’ for each specimen where all the transactions are done and the digital specimen evolves independent (but connected) to the physical specimen and data stored in the CMS.

1 Like

@hardistyar I think we mean the same, it is just a matter of wording. In the DiSSCo design principles we state that Each DiSSCo Facility shall be responsible for managing their own NSIds. Perhaps we should use ‘accountable’ for the custodial institution and ‘responsible’ for the RA role with respect to keeping a PID record up to date.

Great paper @JuttaBuschbom thanks for sharing. I see the final paragraph quote as both motivating and guiding.

“Data should be a living thing,” says Haussler. “I want to click on it and play with it immediately. That should be the motivation. If you don’t share your data, you can’t do that.”

And in this spirit I note another layer we need to figure out how to represent in our visualizations when we talk about the DS/EC concept. That is, what we are enhancing as-we-go (for legacy specimens) compared to future collecting where one would expect much richer data from the start. Indeed, such elements as sequences in the field, specimens images from the field, atmospheric and habitat information, may get into a database, before the specimen itself gets into the collection shelf / bottle / cabinet / drawer.

How do we convey these differences in information expectations between these two situations? What do we put into place to ensure we stop growing the legacy pile? And how do we convey that, in a diagram? @sharif.islam your insights are getting toward where I am going (in part) with this when you reference

idea of evolving and dynamic schemas.

Ecologists offer some of what they are hoping for (for future data) that supports the need for the work our community is doing including planning for workflows that recognize we will (hopefully) not always be working to add information “after the fact.” See: Box 1. Guiding Questions to Examine the Sufficiency of Documentation Regarding the Past and Present Conditions of a Focal Place or Resources in Scott A. Morrison, T. Scott Sillett, W. Chris Funk, Cameron K. Ghalambor, Torben C. Rick, Equipping the 22nd-Century Historical Ecologist, Trends in Ecology & Evolution, Volume 32, Issue 8, 2017, Pages 578-588, ISSN 0169-5347, Redirecting

1 Like

@Debbie The great thing about a transactional method of publishing is that records can enter the system in any stage of completion and can be amended/augmented/edited/annotated/enhanced at any stage by any actor after the fact. This will allow for skeletal herbarium records to be published as is, while also allowing for “complete” records to be published directly from the field or collection. This will allow all actors to play a part in the process, increase buy-in from those involved and expose the transactions and those who make them to everyone, thus increasing attribution and advocacy. Now we just have to figure out how to make it work and get everyone “playing the game”.

2 Likes

Yes @abentley transactions are lovely, but they don’t address our expectations (that I hope we have) for data quality, data availability, data fitness for future specimens and related data. Skeletal records can add expense (someone has to go in an update them) while on the positive side are quick to generate and add accessibility too. But the cheapest, highest quality data is from the start when (likely before) the specimens are ever collected. In essence what I’m saying is that we need to add a layer to our visualizations that show / capture our expectations for legacy vs new incoming specimens (already imaged, already having PIDs, already georeferenced, maybe already sequenced, already mapped to DwC, ABDC, or other relevant standards, etc.). It will not be linear (as in first capture legacy data, then add imaging, georeferencing, etc).

1 Like

@waddink I agree that we should probably focus more on the similarities than differences and that most of the differences are semantic and have more to do with other factors at play than the digital object concept. The Extended specimen concept has an underlying focus on the physical specimen due to other aspects of the Extended Specimen Network concept - collecting, digitization, infrastructure, and workforce training and education. The digital component of the Extended Specimen concept is pretty much the same as the Digital Specimen concept and should be thought of in the same light - Digital Extended Specimens. There are some assumptions that are made about the extended specimen that are incorrect - e.g. that it relies solely on DwC, that data linked to the specimen always has its own unique PID and is not part of the digital specimen record, etc. I think at its core the concept needs to articulate our ability to link disparate but connected data sources that are generated from multiple sources and encapsulate the existing concepts of preparations, annotations, and enrichments/enhancements of specimen information using individual unique identifiers that likewise can be minted from multiple sources to produce a transactional system of interconnectedness, transparency, and authentication. I think this is fairly well articulated in the second diagram although the piece that is missing for me is the possibility of multiple PIDs being minted and linked for the products linked or derived from the specimen record and ourt ability to be able to align these using a brokering system.

1 Like

@Debbie Transactions that are completely transparent to all totally fulfill those requirements/expectations and allow the most flexibility to accommodate all scenarios of fully digitized records added to the system by a single actor as well as records that are brought together piece by piece by multiple actors working in unison to complete a record thereby not relying on a single actor to do all the work - releasing some of the burdens on collections personnel to digitize everything - be it legacy records or new incoming fieldwork. In some cases, transactional systems are the only way we will effectively make the necessary connections to extend specimen records to make them fit for use and expand the possibilities of that use for other purposes and other actors.

1 Like

Trying to reconcile and make sense of the information provided by @abentley (Structure and responsibilities of a #digextspecimen - #11 by abentley), @Rich87 (Structure and responsibilities of a #digextspecimen - #14 by Rich87) and @hardistyar (Structure and responsibilities of a #digextspecimen - #18 by hardistyar), as well as everything else everybody posted, and with the help of Federated database system - Wikipedia and Data virtualization - Wikipedia this is my current understanding:

In the implementation layer, there need to be wrappers between the data provider DBs and the data aggregator portals (DBs?) or other data provider DBs. Also data providers and aggregators need to offer UIs and/or APIs for end users (humans, apps, pipelines). All of this has been mostly developing unsystematically with regard to a global scope. Regional initiatives exist, which can be incorporated in a global data infrastructure. Building of connections is slow, if unique wrappers/APIs need to be written specifically for each infrastructure component for which connections are wanted.

Integration into one system is made easier and sped up, if wrappers/APIs can be written towards an agreed upon Digital Extended Specimen (DES) schema. If many infrastructure components use this DES-interface, ideally one wrapper/API might provide easier integration with many other partners in the infrastructure at once. At least this is how I imagine/understand it.

Nevertheless, on the implementation layer, each connection needs to be coded individually, either with regard to connecting two specific partners, or one partner to an aggregator, which then acts as intermediary for connections to many other partners.

On top of this implementation layer there then will be a layer of information flows during use of the infrastructure. Contrarily to the implementation layer, in which potentially all components are connected to all others, in the information flow layer a central (or decentral but synchronized) agency exists. This agency is pivotal in that it provides the data integration via data virtualization. The core functionality of this is to give out PIDs to digital objects and links, which are involved in all transactions. These PIDs bind the data together and build the integrated infrastructure. All transactions have to “go through” the agency, that is, have to be registered by this Registration Agency.

My understanding is that @hardistyar , @nickyn , @waddink , @dshorthouse focus on and discuss the details of the concept for the data virtualization schema, that is, the DES itself. This DES schema is the abstraction, which is implemented/used by the infrastructure partners (data providers and aggregators, end users) to generate interfaces as prerequisites for the Registration Agency’s work. Subsequently, bringing the infrastructure to life, the Registration Agency applies the DES schema to bind together the data via transactions.

1 Like

Yes! The current system leads to more issues than any data provider can handle.

1 Like