Summaries - 2. Extending, enriching and integrating data

JoeMiller · February 16, 2021, 9:56am

This is the compilation of daily summaries written by the topic facilitators. The goal of this page is to orient new readers to the topic. Please go to the thread to comment.

Go to Extending, enriching and integrating data

Summary of discussion so far - March 3

Discussion since our last summary has centered around the mechanics of how a transactional system of publishing would work and what role DwC-A would still play in the system. There was some great discussion in the Structures and Responsibilities thread regarding how the Extended Digital specimen concept fits into our current landscape and suite of actors. It is clear that in order for the extended digital specimen concept to function as intended, we will need some system that exposes the transactional nature of changes/linkages to records over time. These changes/linkages could come from multiple sources - changes/linkages generated at the CMS level, the aggregator level, and by the end-users creating products from the use of specimens or digital records. Although a transactional system of publishing that exposes all edits/additions/extensions to a record would require a shift in the way we currently publish our data, there could still be a place for Darwin core and the archives created currently. It all hinges on the broker system and how much of the current functionality could be taken on by such a system. We would be interested in hearing from bioinformaticians who have a better understanding of how such a system could work in practice. We are also still interested in examples of integration in action in your current CMS or beyond.

Summary of discussion so far - February 19

We are off to a great start in this thread. Thanks for your initial comments

Two main themes have been discussed so far:

Examples of integration in action - @jegelewicz provided an example of the integration of Genbank sequence information in Arctos while also highlighting the problems of submitters to Genbank not citing information correctly thus breaking the integration chain. She also posed the question of where best to integrate taxonomic information - at the CMS level or at the aggregator level or both. It would be great to hear more examples of integration in action. How does your CMS facilitate the integration of authorities, products of research, and other entities?
Rethinking our current publishing mechanism and the benefits to data integration - @dshorthouse discussed the dichotomy inherent in attribution to datasets vs. attribution to specimens and how our current publishing mechanism of a cached snapshot of the dataset in some ways negates the incentive and infrastructural requirements to be able to cite individual records. @abentley highlighted the possibility of a more transactional method of publishing (ala blockchain) that would allow open publication of transactions on individual records of all kinds - including annotations, identification changes, loans, accessions, etc. to not only showcase changes to individual records over time (audit log) but also link research products to records and comply with legal and ethical principles such as Nagoya. @abentley also proposed that such a shift in technological complexity for publishing may exclude or alienate more collections than include them due to the higher bar to publishing. We would be interested in hearing people’s thoughts on how such a transactional publishing mechanism could work (if at all) - especially from the bioinformatics community - as well as what benefits could accrue and what data elements could be integrated using such a system? @hardistyar gave some great examples of such systems in action in other domains.

We are also interested in your thoughts on some of the other questions posed during the introductory session:

How do we engage and encourage the various data actors to buy into a system of data integration?
How do we integrate biocollections datasets with specimen datasets and/or occurrence records generated from other projects (i.e., surveillance and monitoring projects) as well as observation data and other kinds of data loosely related to natural history?
Should observation records be integrated with collections data? If so, how?

We are particularly interested in input from other actors in the data pipeline (researchers, aggregators, citation publishers, other end users, bioinformaticians) and from those in other parts of the world (Asia, South America, Africa, Australasia).

dorsa · February 25, 2021, 12:56pm

Sounds promising! As a data provider for the DORSA Orthoptera type specimen dataset, I am interested in sustainability of multimedia data. Unfortunately multimedia data integration seems to be complicated and broke down several times within the lifetime of the DORSA dataset DORSA - German Orthoptera Collections - there should be a gbif mechanism to detect such defects, with subsequent notification of data providers (Systax) and stakeholders (in this case me, Klaus Riede, data generator). In addition there should be a backup mechanism for endangered datasets

abentley · February 25, 2021, 3:59pm

@dorsa I have copied your comment over to the thread so that all will see it. I have responded there.

dorsa · February 25, 2021, 6:13pm

Thank you, but which thread? This is sort of confusing…

nickyn · February 25, 2021, 7:09pm

Hello Klaus
Andy means the main discussion thread for this topic at: Extending, enriching and integrating data (this current thread is a summary)
Thanks for your contribution

Topic		Replies	Views
Extending, enriching and integrating data Digital/Extended Specimen	53	3958	April 5, 2021
10. Transactional mechanisms and provenance Digital/Extended Specimen	58	3435	March 17, 2022
Structure and responsibilities of a #digextspecimen Digital/Extended Specimen	30	4188	June 29, 2021
Background and context for phase 2 Digital/Extended Specimen	0	1090	June 8, 2021
6. Robust access points and data infrastructure alignment Digital/Extended Specimen	32	3047	August 31, 2021

Summaries - 2. Extending, enriching and integrating data

Related topics