Extending, enriching and integrating data

Agreed. This is similar to how Paleontology collections are made. We collect a larger sample that then may have numerous taxa (mega and micro), a part may be split for lithologies, sediment categorization, geochemistry, sedimentary structures, etc. I link these in my CMS to a locality. Numerous collecting events may go back to said locality and add to the data or it may be linked to other collections. These are all things most aggregators lose with the current schemas.

@MikeWebster This is where I think the CMS has a large role to play in facilitating these kinds of connections, either through relationship fields or linkages or through common data fields (e.g. collection event) as mentioned by @RogerBurkhalter below. However, the CMS provides context to these linkages and once that context is lost outside of the CMS itself, the system can break down fairly quickly. This is why we potentially need a broker service that can make, maintain and check these linkages. A blockchain transactional system of transactions on records would potentially do that.

@abentley Yes, exactly – the links are laborious to create and easy to break, very fragile. A service to more easily make the links (automated?), and check/maintain them, would be ideal. I need to better understand blockchain approach, but if that could help it should be explored!

@MikeWebster @abentley @dorsa We’ve made provision for exactly what you describe in our proposals for “open digital extended specimens”, recognising the need to connect supplementary or secondary information that is derived directly from the collected specimen (an audio recording, tissue sample, DNA sequence) and also to connect tertiary information - which can best be described as ‘associated with the specimen but not derived from it’ e.g., habitat data, photographs of the locality, conservation status, etc. The technological tool for doing is globally unique, unambiguous persistent identifiers. (of which, DOI is a specific example). Every identified ‘thing’ has metadata associated with it to provide it’s context.

@MikeWebster @dorsa A blockchain is a digital ledger (record) of transactions shared among (and contributed to by) all participants such that they all have the same view of the record of transactions. It has no overall controller or master responsible for it. Every participant is equal.

For a simple analogy, imagine of a group of people, each with a permanent marker pen standing in front of a whiteboard. All can see what is there. All can write something there. No-one can erase anything.

Oh, and by the way, there’s mechanism to stop a fight breaking out!

@hardistyar Helpful analogy! Thanks…

@hardistyar I would love to learn more about this. And would it also allow for connections/links among specimens/data that are associated via the collecting event? For example, audio recordings or tissue samples from individuals that were not collected (i.e., no specimen), but were from same site/date as specimens that were collected? And also tertiary info, like biotic community information (e.g., lists of species encountered during surveys at same site/date)?

@hardistyar and to keep that analogy going, the whiteboard is the brokerage service that keeps all the records and connections and exposes them to the outside world. The brokerage service would register the record and provide or confirm the unique identifier associated with it. Thereafter any additions/edits/enrichments to that record would be held as transactions associated with that record - new identification, georeference, image, CT scan, citation, Genbank sequence, etc. This then becomes the authoritative data store that is held as part of the common good for the community and that collections, aggregators, publishers, researchers, collectors and the user community feed into or off of.

I just posted this in the Structure and Responsibilities subthread but it is equally important here - Structure and responsibilities of a #digextspecimen - #11 by abentley

If present-day aggregators are lumped into what’s meant by “existing”, how might they prepare for digital extended specimens? The most significant pipeline for aggregators today uses Darwin Core Archives (DwC-A) as source data structures. Would we need entirely new aggregator pipelines? Is it technically possible to retain backward compatibility with DwC-A & to support it alongside digital extended specimens? Would a digital extended specimen need to be deconstructed to serve the needs of an aggregator that needs to remain performant (i.e. remote calls would be bad)? If observational-based resources like eBird or iNaturalist are not eligible to play in this space (they do not have specimen-based data), does this mean that an aggregator like GBIF may eventually have two distinct & divergent codebases?

@dshorthouse I still see a place for DwC as a standard for the data and a role for aggregators as they currently exist. However, as I outlined in my tree analogy in the Structures and Responsibilities thread (Structure and responsibilities of a #digextspecimen - #11 by abentley) there would be a shift in the way DwC is used to describe objects as transactions rather than snapshots. There would need to be some articulation of the transaction type (is citation of, is Genbank sequence of, is determination of, etc.) but the individual elements could and should still follow current DwC fields. There is also no reason why the initial object could not be an image, video or vocalization to accommodate observation type records. There are obviously a lot of details missing here but this is the concept I envisage.

I suspect that for legacy data there would still need to be an initial seeding of the system with DwC-A files to populate the system with historical records but thereafter, any changes to that record would be recorded as transactions rather than the publishing of a new snapshot. The hope is that we could plug into existing audit logs in CMS’s to hopefully expose these transactions at the CMS level. However for other transactions (annotations, sequences, citations, etc.) coming from other sources we would need to get all of these actors engaged in the system and showcase the benefit to them of joining such a system.

1 Like

Yes, I agree that DwC will continue to be immensely useful. However, I was looking at DwC-A specifically. Would we expect this zipped transport structure & its star schema to evaporate?

@dshorthouse Yes, as it pertains to a cached snapshot of an entire collection. There may still be some role for some form of DwC-A files in the new system but unclear as yet as to how it would all function. The details of this system would still need to be worked out. I am not saying this is a mature system ready to go or that this is the system we would eventually end up with. Just trying to get a conversation started to see how this would work. Others may have more ideas regarding details or other proposals.

@dshorthouse In thinking about this more, there is no reason why if a collection is still only able to publish data as a snapshot, that the broker couldn’t act as the transaction creator. It could compare the two snapshots and publish transactions for all changed records rather than overwriting the existing records with new ones. This would allow for a hybrid system that would accommodate “old school” publishing as well as new transactional publishing and transactions being created by other entities to link data from other sources.

I would like to make a contribution to the consultation relating to use-cases beyond trained practitioners of taxonomics and systematics, including collection managers. There is a global shared heritage value to knowledge of the natural world, and a wider group of users (as both contributors and consulters) of biological data in GBIF and DiSSCo means greater support for the work and infrastructures of natural history.

As a preamble, it is important to note that even though I am not a biologist, I have worked as a humanities researcher and research manager inside several natural history museums, and collaborated successfully with biologists in those contexts and in an interdisciplinary manner. Therefore I understand the critical significance of specimen annotations in relation to species determinations and nomenclatural disputes. I also understand the problem of ‘distributed annotations’ which come about because of the distribution of multiple ‘duplicates’ that find themselves in a number of different collections where different research questions may have produced widely differing data that it would be valuable to collate and correlate.

As a humanities scholar, I am not able to go in deep concerning information architectures or indeed the finer points of taxonomic and/or genomic data. But I do know quite a bit about natural historical knowledge that is missing from most databases, be they natural history museum catalogues, or large aggregators such as GBIF and JSTOR Plants, or initiatives such as Barcode of Life. The knowledge that I am referring to is not knowledge that most biologists have to date been very interested in, but many are realising that it is knowledge that is extremely valuable in understanding climate change, biodiversity loss, and the increasingly significant social, historical and cultural aspects of specimens and collections. This is the part of the information iceberg that is below the water, as opposed to the highly specialised and limited datasets that are currently aggregated specifically for taxonomic use.

This is knowledge that is locked up in manuscript documents, labels, records, field notebooks, colonial archives, letters and more — and it is valuable to biologists. It is also forms of knowledge held as heritage understanding in communities all over the world from which centralised specimen collections originate. As a meta-issue, the conditions in which these kinds of knowledge have been produced over 500 years must also be a subject of study if we are to understand why we do science in the way that we do it now, and also if we are to understand what to do in order to mitigate climate change and biodiversity loss.

In the humanities and library/archives sciences, considerable efforts are being led to make these hidden bodies of information machine readable and accessible to computation. This is often being done in partnership with digital humanities colleagues, and as @Debbie points out, there is much that biodata researchers can learn from pilot projects in these humanities areas – not just in terms of the information that is being surfaced, but also in terms of the semantic solutions that are being forged.

Some examples of such projects are:

These nathist knowledge projects are all effected by humanities scholars and digital humanities researchers working hand in hand (and sometimes with biologists), just as is the case for collaborations between biologists, collection managers, and biodiversity informaticians. It would be valuable to both groups to turn systematically to each other in order to share bodies of knowledge, data, and methods. This should happen in long-term well-structured exchanges and collaborations and ultimately the fruits will be manifest in the co-design of ground-breaking data models and the pooling of highly heterogeneous knowledge that has huge interdisciplinary value.

Questions of data enrichment, DOIs, semantic alignment, and notions of extended/digital collection objects are also major drivers in historical and cultural collections management, as can be seen in the UK’s Towards a National Collection project (which also includes natural history) and research being done at the Getty Institute in Semantics and Name Authorities.

It cannot be the sole responsibility of GBIF alone to organise such intellectual and technical collaborations across the sciences and the humanities, but it is incumbent on scientific communities of practice and infrastructures such as GBIF, DiSSCo, and others to consider such collaborations in creating new infrastructures and data models. It would be wonderful to have a structured consultation on this. In the Letter of Intent for Collaboration in a Global and Open Process for Interoperable Enriched Specimen Information Models, we read that there is an ‘Aim to collaborate in a global process, open to participation from all stakeholders’.

‘All stakeholders’ would also include other significant holders of knowledge about the natural world, such as communities of origin from whence these collections came: a consultation worth having as well. Communities of origin in the localities from which natural historical collections have been made over some 600 years also have a keen interest in, and deep knowledge of, the biology and habitats in which they live. This knowledge can be historical as well as contemporary, and also has a place in attribution discussions (viz WIPO TKN and the ethics behind Nagoya). Noting local names of species and places can have value in determinations and disputes. (see also this discussion initiated by @sking on decolonising collections data, to which I have made a contribution, and this comment from @bsterner from the 2020 GBIF consultation, as well as this open question from @austinmast)

Data infrastructures truly aiming to be FAIR and to CARE would incorporate both information and data models that are co-designed with, and co-populated by biologists, communities of origin, humanities scholars and SocSci researchers. It would be wonderful to be able to make a contribution to such a collaborative project!

1 Like

@MAFleming I totally agree that all of these data elements should form part of a broader discussion on extended specimens - from field notebooks to traditional knowledge to humanities collections and libraries. The idea behind such a system would be to allow the seamless linking of all these elements in cyberspace. Ideally, it should not matter what the resource is. We should be able to link it to a specimen record in much the same way as any “traditional” resource within our realm. I have begun to strat doing this within my own CMS with linking scanned copies of our field notes to records. In some cases these field notes contain valuable information that is not found in traditional fields in our databases - local names for species, enviromental data (water temperature, vegetation, etc.), species seen but not collected, drawings of specimens and locations, etc. See here for an example of one of my records that links not only field notes but images, Genbank sequences, citations and CT scans - https://ichthyology.specify.ku.edu/specify/bycatalog/KUIT/46/. I also agree that it is going to take buyin from all actors in the data pipeline in order to create a system that is both supported by and works for all.

@jmheberling @abentley In DiSSCo’s work on open Digital (Extended) Specimens, we’ve distinguished between secondary and tertiary extensions in the sense of Lendemeyer i.e., secondary extensions are data directly derived from the specimen and tertiary extensions are data associated with but not derived from the specimen.

For enrich and extend, we have tended to use the terms in the sense of ‘we enrich a specimen with extensions’.

We discussed this with other NHMUK colleagues in our analytical and molecular labs. We need better mechanisms, both in our CMS and in specimen data to link to rich analytical data (spectrographic, 3D and 2D image stack data) along with device calibration data and reference data. According to Alex Ball https://www.openmicroscopy.org/ are doing a good job of keeping tracking of device standards and metadata (in an analogous way to Raw image formats). At the moment this data is usually stored separately and not provided in a good linked way to specimens on our data portal (it’s not easy for everyone to find internally).

I don’t remember seeing it posted in the previous discussion but RE minimal information for a Whole Genome Sequence see https://www.nature.com/articles/nbt1360 with an example application of it here: https://gold.jgi.doe.gov/ (from Raju Misra).

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.