Extending, enriching and integrating data

JoeMiller · January 30, 2021, 4:23pm

Moderators: Andy Bentley, Jen Zaspel, Mike Webster, and Keping Ma

Summaries - 2. Extending, enriching and integrating data

Background

Biological collections are generating a wealth of data through digitization initiatives across multiple disciplines and taxonomic units. These data are published through collection management systems (CMS’s) to numerous aggregators and through local portals, making this data available to an ever-increasing end-user community. These disparate datasets are extremely valuable individually but are made increasingly valuable through integration of data sources at various levels to extend, enrich and connect data in various ways. Integrated datasets not only facilitate novel exploration and discovery of collections data by a much larger audience but also improve our ability to answer the pressing questions of our time such as combating climate change and its effect on biodiversity, reducing spread of disease and pandemic mitigation and control and eradication of invasive species. The corollary is that this information provides attribution mechanisms and metrics for collections to advocate for their continued support and management. The extended specimen and open digital specimen concepts rely heavily on the integration of FAIR data.

In order for this to be achieved we need data integration at many different levels from multiple different sources through the work of many different actors in the data pipeline. These actors include institutions, individual collections, data aggregators, publishers, accumulators of related data (Genbank, Isobank, MorphoSource, Macaulay Library, etc.), suppliers of external sources of data (taxonomic, geographic, satellite, etc.), the broad research community (traditional taxonomic/phylogenetic, biodiversity, surveillance and monitoring, ecological, conservation, etc.) and observational datasets (eBird, etc.).

Some of this data integration is already being mediated by existing systems. Within CMS systems employed by collections, all preparations of a specimen will often have the same catalog number and unique identifier (tissue, voucher, skeleton, skin, cleared and stained, etc.) making the connections between these implicit. There are also implicit connections made between media files (images, video, sounds, field notes, etc.) that are directly connected to objects in the collection through the data model. Similarly, some CMS systems have fields or mechanisms to indicate the relationships between various objects between collections such as tissue:voucher, host:parasite, plant:pollinator, predator:prey, commensals, etc. However, there are numerous instances where these interactions are not implicit and have not been made, for instance, where preparations have different catalog numbers, linked specimens are housed in separate, disparate datasets either at the same institution, at different institutions or these datasets are outside of our immediate realm. In these circumstances, connections often need to be made outside of the CMS by other means. Sometimes this is possible through the data alone by using common data found in collecting events or locality fields. An example of this would be the new GBIF Clustering tool that creates these associations based on matching of various common fields of information (taxonomy, locality, collectors, date of collection, etc.). Additionally, there are the connections between objects in the collection and the products of research that may reference these objects (citations, Genbank sequences, CT scans, images, etc.) which are more difficult to make due sometimes to a lack of common fields through incorrect, incomplete or non-existent citation. These connections are important, not only to promote reproducible research, but also to provide important metrics for collections attribution and advocacy. The final piece of the puzzle is connecting our data to external sources of data that add value and allow much broader questions to be answered, such as environmental, ecological, conservation, geographical, observation and other research data.

These connections require both technological and social solutions, with an underlying cyberinfrastructure, connectivity mechanisms as well as best practices and standards adopted by, and with buy in from, the numerous actors involved in the data lifecycle.

Effective integration of data is also reliant on a system of global unique identifiers that effectively identify the various elements being integrated – institutions, collections, collection objects, collecting events, datasets, people, etc. There is a large variety of unique identifiers currently in use (GUIDs, LSIDs, PIDs, DOIs, ORCIDs, etc.) for various elements with no mechanism of ensuring uniqueness or aligning these to implement effective data integration. To avoid distraction from the present topic, discussion about persistent identifier schemes is deferred to a separate consultation at a later stage.

This category differs from the Annotating specimens and other data thread in that we envisage annotations as being opinions on, or additions to, existing specimen records (determinations, georeferences, etc.) rather than extensions or augmentations of a specimen record through the addition of new data elements (DNA sequences, citations, CT scans, images, vocalizations, duplicate specimens, linked specimens, etc.). However, there is some overlap in that once these annotations are made they need to be reliably linked to the original record. Similarly, the Attributing work done (Data Attribution) thread will also pursue a discussion of attribution of people for work performed but there is also data integration for the advocacy and attribution of collections that will rely on the same data infrastructure mechanisms needed for the data integration being discussed here.

Information resources

NASEM report – chapter 5 – starting on page 93 (https://www.nationalacademies.org/our-work/biological-collections-their-past-present-and-future-contributions-and-options-for-sustaining-them)
BCoN Extended Specimen Network report (https://bcon.aibs.org/wp-content/uploads/2019/04/Extended-Specimen-Full-Report.pdf)
BCoN Extended specimen publication (Extended Specimen Network: A Strategy to Enhance US Biodiversity Collections, Promote Research and Education | BioScience | Oxford Academic)
Webster 2017 – The Extended Specimen, especially chapters 1 and 13
Zaspel et al. 2020. Human Health, Interagency Coordination, and the Need for Biodiversity Data. (https://academic.oup.com/bioscience/article/70/7/527/5861522)
Page 2008 - Biodiversity informatics: the challenge of linking data and the role of shared identifiers (https://doi.org/10.1093/bib/bbn022)
Page 2008 - Visualizing a scientific article (https://doi.org/10.1038/npre.2008.2579.1)
Van Rossum 2017 - Blockchain for Research (https://doi.org/10.6084/m9.figshare.5607778)
Berendsohn & Guntsch 2012 - OpenUp! Creating a cross-domain pipeline for natural history data (https://doi.org/10.3897/zookeys.209.3179)
Konig, et al. 2017 - Biodiversity data integration—the significance of data resolution and domain (https://doi.org/10.1371/journal.pbio.3000183)
Thessen, et al. 2017 - 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration (https://doi.org/10.7717/peerj-cs.164)

Data integration enables global biodiversity synthesis New publication, #CiteTheDOI

Questions to promote discussion

How do we bring together all the existing, disparate mechanisms of data integration into a single system that works for all?
- Can we rely on a combination of these existing mechanisms or do we need a stand-alone integration tool?
- How and where should such combinations of value-added data be stored and curated and who should take the responsibility for that?
- How do we engage and encourage the various data actors to buy into a system of data integration?
- What value propositions can be used to promote this?
What existing data infrastructure technology elements may be important for data integration?
- What gaps in this cyberinfrastructure need to be filled?
- How do we mediate connecting all of this data to provide as rich a dataset as possible for use by the community while supporting the necessary cyberinfrastructure for data storage and dissemination?
How do we integrate biocollections datasets with specimen datasets and/or occurrence records generated from other types of projects (i.e., surveillance and monitoring projects) as well as observation data and other kinds of data loosely related to natural history?
- What data should be connected to specimens? Where are the boundaries?
- Should observation records be integrated with collections data? If so, how?
- What use cases exist of data integration in action?

dcblackburn · February 16, 2021, 3:52pm

One important direction that is emerging are annotations directly to new data elements. As digital media file representations of specimens grow, there is becoming a growing community of users creating new data based on those media files (rather than from the physical specimen itself).

abentley · February 16, 2021, 4:05pm

@dcblackburn Thanks David. This comment probably belongs in the annotation thread but is also relevant to integration as I suspect a lot of similar comments will be - Annotating specimens and other data.

jegelewicz · February 16, 2021, 5:02pm

Arctos currently allows collections to use the taxonomy provided by WoRMS via their API. I wouldn’t say this an actual integration as the data from WoRMS is imported to Arctos, but it is fairly close. The process is not simple, as the structure of WoRMS taxonomic data has to be mapped to Arctos and maintained whenever changes occur at either end. Subgenera are a particular problem. See https://arctos.database.museum/name/Cancellaria%20solida#secclass and note how the subgenus causes issues with our method of creating a display name in the “WoRMS via Arctos” classification. WoRMS uses Cancellaria (Pyruclia) as the subgenus, but Arctos expects only Pyruclia, something we as a community have not had time to address.

In addition to complexity, the process requires computational resources if the Arctos “version” of WoRMS taxonomy is to remain up to date.

jegelewicz · February 16, 2021, 5:33pm

Again, not an exact integration, but

“Arctos constantly monitors GenBank, and will report potential uncited specimens under Reports/genbankMIA. We recommend working with the submitter to ensure that these records are properly submitted to GenBank (e.g., they should all link to Arctos from /specimen_voucher) and to the appropriate Arctos Curators.”
See How To Create GenBank Links

The above is a kind of annotation made by an Arctos script that is looking through GenBank for things that look like Arctos collections’ catalog numbers or GUID Prefixes.

When cited properly at GenBank, a link is automatically generated from GenBank. GenBank

For an example of the reciprocal links formed between Arctos and GenBank see: https://arctos.database.museum/guid/MSB:Mamm:55245 and note that the GenBank page provides a link back to the Arctos record.

One of the issues with this process is that specimens are often cited improperly and therefore do not get noticed by the script.

dshorthouse · February 16, 2021, 5:57pm

Another way of phrasing this question is: “What organization’s purpose and business model is entirely dependent on links between disparate entities and that expects these to remain static & reference-able in perpetuity?”. At the moment, when our ad-hoc links in relation to specimen data break or begin slipping into ambiguity, no organization claims fault & no project or institution suffers in a material way. This may be because there’s little return on investment for sharing data in the first place & as a consequence, vanishingly small return on investment to link data that may not persist.

Contrast this with the Crossref member agreements. If a member knowingly or unknowingly breaks the truthiness between link and metadata, there are real, financial consequences. Crossref has services that detect breakage & staff to police these mishaps. This builds trust in the whole network because there is guaranteed quality assurance. Crossref’s business model is almost exclusively dependent on links & the services that these afford.

So, before we point to an organization and suggest that it take responsibility for the integration of data, we may need to ask ourselves as publishers of primary specimen data in museums & collections: Are we ready to shoulder the social and financial responsibility for ensuring that the specimen-level data we publish today can always be found (or redirected) tomorrow in both a human- and machine-readable way? And, if museums or collections are not yet ready to shoulder any of that responsibility, is this a show-stopper?

abentley · February 16, 2021, 6:25pm

Ah, yes, the age-old taxonomy issue!! This gets at a number of data integration issues that are key. The first is, where do we control taxonomic vocabularies? If we control at the CMS and collection level, there are all sorts of management issues that go along with that. The requirement to constantly change names associated with a collection can have shelving repercussions as well as the necessity for having a messaging system that indicates what has changed with each change to allow for the necessary shelving changes. The other alternative is to control for taxonomy at the aggregator level thus allowing CMSs to have varying taxonomic classifications while having a single authority at the aggregator. This is more problematic for disciplines that have multiple authorities like botany. The second big issue is integration rather than duplication. You don’t want to have to duplicate all of the taxonomic metadata (common names, authors, protected status, etc.) at the CMS level when it is already fleshed out at the authority level and could simply be linked. How best do you integrate all this information without having to duplicate it all, and how then do you display it for all to see? This gets at the heart of a fully functional integrated system that allows you to enter the data pipeline at any position and be able to see all the data components as a whole.

abentley · February 16, 2021, 6:33pm

Proper citation is at the heart of this frustrating scenario. It would be computationally easy to facilitate these linkages if all submitters to Genbank cited voucher or tissue materials correctly with correct acronyms for institution and collection. However, as we know this is not the case by a long shot which then places the burden on collections to play Sherlock Holmes in order to make these connections manually. In a true integration scenario I would like to see better standards in place at NCBI that would promote more inform citation of materials from which sequences were derived. Hooking in to the GBIF Collection registry to enforce institutional and collection acronyms would go a long way to remedying this situation. However, this is also where that social contract comes in with researchers collaborating with providing collections during the sequence submission stage to ensure conformity and being more collections advocacy aware and understanding the necessity for these linkages to not only promote reproducible science but also collections advocacy. The same is true for NCBI. They too need to understand the importance of being able to link back to a voucher specimen to confirm ID and vouch for the sequence integrity.

jegelewicz · February 16, 2021, 7:23pm

I think this is key. As long as everyone is on their own, it will remain an issue.

jegelewicz · February 16, 2021, 7:27pm

OOOF, I hear this all the time and I think it is one of those collection practices that needs to change. With barcodes, one can track all of the stuff no matter what family it is in. This is one case where I advocate for letting technology do the work. Why have a single gallon jar on the same shelf with 20 8oz jars? I never let collections complain about space when I see things like that.

jegelewicz · February 16, 2021, 7:40pm

Again, we need the people holding the purse strings to get in on this conversation.

hardistyar · February 17, 2021, 4:37pm

@dshorthouse Maybe a third-party shoulders it for them. This is also relevant in topic 1 Making FAIR data for specimens accessible.

abentley · February 17, 2021, 4:44pm

@dshorthouse I am not sure it is true that there is limited return on investment for linking data or for publishing data in the first place. It has been shown that publishing data increases exposure and use of collections through specimen loan requests and publications using data specifically. The return on investment for linking data has yet to be realized but has real-world consequences. A specific example is DNA sequences. If you link existing sequences to a tissue record you can highlight that a specific gene or set of genes has already been sequenced for that tissue thus negating the reuse of the tissue which is a limited resource. There are numrrous other such examples.

I am also not suggesting that the responsibility for integration should fall on the organization. The whole point is that integration should be facilitated by the network and all actors involved to reduce the effort needed to make the connections. This gets at the point of requiring the necessary cyberinfrastructure and social contracts to ensure that everyone is “playing the game” and making those connections through their everyday activities - publishing, citing, sequencing, imaging, CT scanning, etc. I think a system could be put in place that would have similar checks and balances as CrossRef to check the integrity of such linkages. Think Blockchain!!

dshorthouse · February 17, 2021, 5:16pm

True, but I’m being very particular and specific here. Of course publishing datasets affords visibility and all the goodies acquired via that meta level. What I’m narrowing in on here with a focused lens is the return on investment for the maintenance of specimen records (= digital specimen objects) as stand-alone entities independent from the arbitrariness of datasets. If there was return on investment at this fine level, no one would ever, ever change core pieces of metadata nor unique identifiers affixed to their specimen records. But as you may know, this happens all the time in GBIF-mediated data.

abentley · February 17, 2021, 5:58pm

Not sure I understand. Can you explain further?

dshorthouse · February 17, 2021, 8:13pm

I suspect this might have been directed at me, apologies if not. I’m lost in the threads.

Perhaps it helps to observe where we are now. Presently, many museums publish their specimen records as bundles within the context of a dataset, a Darwin Core Archive. This structure was borne primarily out of a need for efficiencies in transport. These datasets have now taken on an identity, importance, and branding – they receive DOIs. Nonetheless, they are artificial bundles subject to the whims of local administration. Datasets are often split or merged, republished & deleted outright from registries like GBIF. Gradually, many of the metrics and measures of reuse are often tied to these datasets but there are very few examples where metrics of reuse are tied directly to individual specimens as digital entities held within those datasets. My supposition here is that there is presently little incentive to maintain the identifiers & metadata of individual specimen records because the benefits to the institution in sharing these data are not evident at these lower levels. As a consequence, metadata and identifiers (eg institutionCode, collectionCode, occurrenceID) included alongside individual specimens often change, breaking any downstream links that might have been created. Those changes were done to accommodate the need for better branding of the dataset.

And so…

The Extended Specimen or Digital Specimen Object is a significant shift away from datasets as a vehicle for sharing data and more toward the specimen record as the vehicle. There are much different socio-technical responsibilities & commitments required for the latter. Are there incentives now, rather than a mere promise of benefits, that can help guide this shift.

JoeMiller · February 17, 2021, 8:38pm

We recently measured GBIF ID loss over time. GBIF IDs lost due to dataset deletion or changes to local identifiers. Last year was lower but still many lost per year.

I explain this around the 7 minute mark GBIF and the Converging Digital and Extended Specimens Concepts on Vimeo

abentley · February 18, 2021, 3:46pm

@dshorthouse Yes, that was meant for you. Sorry, forgot to tag you but thanks for the explanation. It is true that collections are published as datasets but within those datasets, individual occurrence records are identified through catalog numbers and unique identifiers. In the case of traditional specimen-based use of collections, there are numerous cases of products of research being tied back to individual specimens through material examined sections in publications and tissue/voucher fields in Genbank. Take this record in my tissue collection as an example - https://ichthyology.specify.ku.edu/specify/bycatalog/KUIT/4005/. You will note through the DNA and citation buttons at the bottom of the form numerous Genbank sequences and citations linked to this record. These links are then published to the aggregators - see the same GBIF record here Occurrence Detail 656980275. However, it is true that due to our current practices of citation using institution code (sometimes), collection code, and catalog number instead of unique identifiers, that the onus falls on the collections staff to make these connections. many hundreds of hours have been spent trolling the literature and Genbank to connect over 800 publications and over 17,000 Genbank sequences to my collection records. The Pensoft ARPHA writing tool provides a great example of how unique identifiers can be included in a citation through automation of a material examined section with linkages to GIUDS to make those linkages more concrete and discoverable by machines. The same could be implemented in Genbank.

It is also true that once you start talking about data use, those lines are greyed even further. A researcher downloads a dataset from GBIF for use. It gets a DOI. However, the researcher may only eventually use a subset of that data in the publication - evidenced by the number of plant, mammal, bird and land use citations connected to my fish collection in GBIF. This is sometimes due to general query parameters used (all specimens from a country, etc.) and most definitely due to a lack of a breadcrumb trail of DOI to DOI linkage that would show how records have been filtered, augmented, annotated and used in the eventual publication following the initial download of data. It is also more difficult to tease out individual records from that DOI to link to individual records. In order to also include these in my metrics for my collection, I have included them in a Google scholar profile for my collection - ‪KU Ichthyology‬ - ‪Google Scholar‬. Ideally, individual records in a finalized DOI could be linked back to individual records in the collection using the unique identifiers.

So, how do we solve this? I agree that maybe it requires a mind shift in the way that we publish data and why we have been exploring a blockchain-inspired transactional method of publication rather than a cache-based snapshot approach (the openDS concept is similar). What if individual records were published and all actions on those records were published as transactions - a re-determination would be a transaction, a loan to a researcher would be a transaction, adding an image would be a transaction, a citation or Genbank sequence would be a transaction? These transactions could be published and completely transparent to the user community so that the onus for creating the linkages is shifted from the collection to the community. The downside to this is that the necessary increase in cyberinfrastructure and complexity at the CMS level to track audit logs would be exclusionary rather than inclusive and would present an even bigger barrier to publication than currently exists. The big question is how we reduce that barrier so that everyone can, and will play the game?

hardistyar · February 19, 2021, 12:22pm

@dshorthouse, @abentley. It’s proven by Crossref and DataCite in the scholarly sector, by EIDR in the film/tv industry, and in other sectors that when data connections based on PIDs are made and the graph grows, new values accrue. What was not possible before becomes possible. Examples of this include Crossref’s Similarity Check plagiarism service for publishers and its Cited-by service.

If you want to begin to see what this could look like for extended digital specimens, explore the EIDR registry. Search for the title of your favourite film. Or for something a little more interesting, have a look at the filmography of Hollywood director, John Ford and see how for any of his films the EIDR registry can reveal much that is databased about them, including where you can watch, such as Amazon, Netflix, etc.

Most of what EIDR enables is exclusive to the workings of the film/TV production and distribution industry so we can’t see it. But we do see the benefit as consumers. EIDR helps the sector supply chain to function more effectively in a world that is now entirely digital and no longer reliant on celluloid - although, of course there are still many thousands of celluloid masters locked away in film vaults. EIDR supports accurate rights tracking and reporting down to the level of clips and composites across multiple languages/geographies, universal search and discovery, and detailed consumption metrics. It helps to ensure audiences see the correct language version in their local cinema or on their mobile phone, and that the right people get paid the correct amounts for their work. It’s easy to see parallels based on a similar registry for digital specimens in the natural science collections sector. Whilst we don’t broadcast specimens, we do carry out integrated analyses and synthesis based on the data they contain.

But as @abentley says, it needs everyone to play the same game. The rules and mechanics must be as simple as possible so players can engage easily and cheaply, building up their responsibility over time as it becomes apparent how their ROI increases.

A transactional method of publication, which the openDS concept represents can achieve this. Open Digital Specimens are mutable objects to which operations (transactions) can be applied. Some of those operations attach things to the DS, like annotations while others can modify/improve the object content itself or make links between the DS and other DSs or to third-party data, such as sequence data or trait data. As the number of openDS increases, services that assist linking and exploit it increase in value to the community as a whole.

Whether a blockchain approach is appropriate or helpful depends on multiple design factors. The most important ones are governance model and storage/implementation model. What is needed to govern and implement a ‘cloud of digital specimens’? We should think about and discuss these first, probably in the Making FAIR data for specimens accessible topic.

jmheberling · February 22, 2021, 4:25pm

Sorry I’m late to the game but may be (likely) missing this distinction as I read through this topic – what is the working meanings of “extending” vs. “enriching” vs. “integrating” data. The questions and discussion in this thread seem to deal with integration, at least by name. Does “enriching” = post hoc annotation, including adding new information derived from or about specimen; whereas “extending” = including external data relevant to but distinct from specimen record (eg, environmental data). Maybe it doesn’t matter.

Topic		Replies	Views
Structure and responsibilities of a #digextspecimen Digital/Extended Specimen	30	4238	June 29, 2021
10. Transactional mechanisms and provenance Digital/Extended Specimen	58	3497	March 17, 2022
Analyzing/mining specimen data for novel applications Digital/Extended Specimen	43	2974	April 4, 2021
6. Robust access points and data infrastructure alignment Digital/Extended Specimen	32	3120	August 31, 2021
Making FAIR data for specimens accessible Digital/Extended Specimen	59	4338	March 5, 2021