Extending, enriching and integrating data

Ah, yes, the age-old taxonomy issue!! This gets at a number of data integration issues that are key. The first is, where do we control taxonomic vocabularies? If we control at the CMS and collection level, there are all sorts of management issues that go along with that. The requirement to constantly change names associated with a collection can have shelving repercussions as well as the necessity for having a messaging system that indicates what has changed with each change to allow for the necessary shelving changes. The other alternative is to control for taxonomy at the aggregator level thus allowing CMSs to have varying taxonomic classifications while having a single authority at the aggregator. This is more problematic for disciplines that have multiple authorities like botany. The second big issue is integration rather than duplication. You don’t want to have to duplicate all of the taxonomic metadata (common names, authors, protected status, etc.) at the CMS level when it is already fleshed out at the authority level and could simply be linked. How best do you integrate all this information without having to duplicate it all, and how then do you display it for all to see? This gets at the heart of a fully functional integrated system that allows you to enter the data pipeline at any position and be able to see all the data components as a whole.

1 Like

Proper citation is at the heart of this frustrating scenario. It would be computationally easy to facilitate these linkages if all submitters to Genbank cited voucher or tissue materials correctly with correct acronyms for institution and collection. However, as we know this is not the case by a long shot which then places the burden on collections to play Sherlock Holmes in order to make these connections manually. In a true integration scenario I would like to see better standards in place at NCBI that would promote more inform citation of materials from which sequences were derived. Hooking in to the GBIF Collection registry to enforce institutional and collection acronyms would go a long way to remedying this situation. However, this is also where that social contract comes in with researchers collaborating with providing collections during the sequence submission stage to ensure conformity and being more collections advocacy aware and understanding the necessity for these linkages to not only promote reproducible science but also collections advocacy. The same is true for NCBI. They too need to understand the importance of being able to link back to a voucher specimen to confirm ID and vouch for the sequence integrity.

2 Likes

I think this is key. As long as everyone is on their own, it will remain an issue.

OOOF, I hear this all the time and I think it is one of those collection practices that needs to change. With barcodes, one can track all of the stuff no matter what family it is in. This is one case where I advocate for letting technology do the work. Why have a single gallon jar on the same shelf with 20 8oz jars? I never let collections complain about space when I see things like that.

1 Like

Again, we need the people holding the purse strings to get in on this conversation.

1 Like

@dshorthouse Maybe a third-party shoulders it for them. This is also relevant in topic 1 Making FAIR data for specimens accessible.

@dshorthouse I am not sure it is true that there is limited return on investment for linking data or for publishing data in the first place. It has been shown that publishing data increases exposure and use of collections through specimen loan requests and publications using data specifically. The return on investment for linking data has yet to be realized but has real-world consequences. A specific example is DNA sequences. If you link existing sequences to a tissue record you can highlight that a specific gene or set of genes has already been sequenced for that tissue thus negating the reuse of the tissue which is a limited resource. There are numrrous other such examples.

I am also not suggesting that the responsibility for integration should fall on the organization. The whole point is that integration should be facilitated by the network and all actors involved to reduce the effort needed to make the connections. This gets at the point of requiring the necessary cyberinfrastructure and social contracts to ensure that everyone is “playing the game” and making those connections through their everyday activities - publishing, citing, sequencing, imaging, CT scanning, etc. I think a system could be put in place that would have similar checks and balances as CrossRef to check the integrity of such linkages. Think Blockchain!!

1 Like

True, but I’m being very particular and specific here. Of course publishing datasets affords visibility and all the goodies acquired via that meta level. What I’m narrowing in on here with a focused lens is the return on investment for the maintenance of specimen records (= digital specimen objects) as stand-alone entities independent from the arbitrariness of datasets. If there was return on investment at this fine level, no one would ever, ever change core pieces of metadata nor unique identifiers affixed to their specimen records. But as you may know, this happens all the time in GBIF-mediated data.

Not sure I understand. Can you explain further?

I suspect this might have been directed at me, apologies if not. I’m lost in the threads. :grinning:

Perhaps it helps to observe where we are now. Presently, many museums publish their specimen records as bundles within the context of a dataset, a Darwin Core Archive. This structure was borne primarily out of a need for efficiencies in transport. These datasets have now taken on an identity, importance, and branding – they receive DOIs. Nonetheless, they are artificial bundles subject to the whims of local administration. Datasets are often split or merged, republished & deleted outright from registries like GBIF. Gradually, many of the metrics and measures of reuse are often tied to these datasets but there are very few examples where metrics of reuse are tied directly to individual specimens as digital entities held within those datasets. My supposition here is that there is presently little incentive to maintain the identifiers & metadata of individual specimen records because the benefits to the institution in sharing these data are not evident at these lower levels. As a consequence, metadata and identifiers (eg institutionCode, collectionCode, occurrenceID) included alongside individual specimens often change, breaking any downstream links that might have been created. Those changes were done to accommodate the need for better branding of the dataset.

And so…

The Extended Specimen or Digital Specimen Object is a significant shift away from datasets as a vehicle for sharing data and more toward the specimen record as the vehicle. There are much different socio-technical responsibilities & commitments required for the latter. Are there incentives now, rather than a mere promise of benefits, that can help guide this shift.

5 Likes

We recently measured GBIF ID loss over time. GBIF IDs lost due to dataset deletion or changes to local identifiers. Last year was lower but still many lost per year.

I explain this around the 7 minute mark GBIF and the Converging Digital and Extended Specimens Concepts on Vimeo

4 Likes

@dshorthouse Yes, that was meant for you. Sorry, forgot to tag you but thanks for the explanation. It is true that collections are published as datasets but within those datasets, individual occurrence records are identified through catalog numbers and unique identifiers. In the case of traditional specimen-based use of collections, there are numerous cases of products of research being tied back to individual specimens through material examined sections in publications and tissue/voucher fields in Genbank. Take this record in my tissue collection as an example - https://ichthyology.specify.ku.edu/specify/bycatalog/KUIT/4005/. You will note through the DNA and citation buttons at the bottom of the form numerous Genbank sequences and citations linked to this record. These links are then published to the aggregators - see the same GBIF record here Occurrence Detail 656980275. However, it is true that due to our current practices of citation using institution code (sometimes), collection code, and catalog number instead of unique identifiers, that the onus falls on the collections staff to make these connections. many hundreds of hours have been spent trolling the literature and Genbank to connect over 800 publications and over 17,000 Genbank sequences to my collection records. The Pensoft ARPHA writing tool provides a great example of how unique identifiers can be included in a citation through automation of a material examined section with linkages to GIUDS to make those linkages more concrete and discoverable by machines. The same could be implemented in Genbank.

It is also true that once you start talking about data use, those lines are greyed even further. A researcher downloads a dataset from GBIF for use. It gets a DOI. However, the researcher may only eventually use a subset of that data in the publication - evidenced by the number of plant, mammal, bird and land use citations connected to my fish collection in GBIF. This is sometimes due to general query parameters used (all specimens from a country, etc.) and most definitely due to a lack of a breadcrumb trail of DOI to DOI linkage that would show how records have been filtered, augmented, annotated and used in the eventual publication following the initial download of data. It is also more difficult to tease out individual records from that DOI to link to individual records. In order to also include these in my metrics for my collection, I have included them in a Google scholar profile for my collection - ‪KU Ichthyology‬ - ‪Google Scholar‬. Ideally, individual records in a finalized DOI could be linked back to individual records in the collection using the unique identifiers.

So, how do we solve this? I agree that maybe it requires a mind shift in the way that we publish data and why we have been exploring a blockchain-inspired transactional method of publication rather than a cache-based snapshot approach (the openDS concept is similar). What if individual records were published and all actions on those records were published as transactions - a re-determination would be a transaction, a loan to a researcher would be a transaction, adding an image would be a transaction, a citation or Genbank sequence would be a transaction? These transactions could be published and completely transparent to the user community so that the onus for creating the linkages is shifted from the collection to the community. The downside to this is that the necessary increase in cyberinfrastructure and complexity at the CMS level to track audit logs would be exclusionary rather than inclusive and would present an even bigger barrier to publication than currently exists. The big question is how we reduce that barrier so that everyone can, and will play the game?

2 Likes

@dshorthouse, @abentley. It’s proven by Crossref and DataCite in the scholarly sector, by EIDR in the film/tv industry, and in other sectors that when data connections based on PIDs are made and the graph grows, new values accrue. What was not possible before becomes possible. Examples of this include Crossref’s Similarity Check plagiarism service for publishers and its Cited-by service.

If you want to begin to see what this could look like for extended digital specimens, explore the EIDR registry. Search for the title of your favourite film. Or for something a little more interesting, have a look at the filmography of Hollywood director, John Ford and see how for any of his films the EIDR registry can reveal much that is databased about them, including where you can watch, such as Amazon, Netflix, etc.

Most of what EIDR enables is exclusive to the workings of the film/TV production and distribution industry so we can’t see it. But we do see the benefit as consumers. EIDR helps the sector supply chain to function more effectively in a world that is now entirely digital and no longer reliant on celluloid - although, of course there are still many thousands of celluloid masters locked away in film vaults. EIDR supports accurate rights tracking and reporting down to the level of clips and composites across multiple languages/geographies, universal search and discovery, and detailed consumption metrics. It helps to ensure audiences see the correct language version in their local cinema or on their mobile phone, and that the right people get paid the correct amounts for their work. It’s easy to see parallels based on a similar registry for digital specimens in the natural science collections sector. Whilst we don’t broadcast specimens, we do carry out integrated analyses and synthesis based on the data they contain.

But as @abentley says, it needs everyone to play the same game. The rules and mechanics must be as simple as possible so players can engage easily and cheaply, building up their responsibility over time as it becomes apparent how their ROI increases.

A transactional method of publication, which the openDS concept represents can achieve this. Open Digital Specimens are mutable objects to which operations (transactions) can be applied. Some of those operations attach things to the DS, like annotations while others can modify/improve the object content itself or make links between the DS and other DSs or to third-party data, such as sequence data or trait data. As the number of openDS increases, services that assist linking and exploit it increase in value to the community as a whole.

Whether a blockchain approach is appropriate or helpful depends on multiple design factors. The most important ones are governance model and storage/implementation model. What is needed to govern and implement a ‘cloud of digital specimens’? We should think about and discuss these first, probably in the Making FAIR data for specimens accessible topic.

Sorry I’m late to the game but may be (likely) missing this distinction as I read through this topic – what is the working meanings of “extending” vs. “enriching” vs. “integrating” data. The questions and discussion in this thread seem to deal with integration, at least by name. Does “enriching” = post hoc annotation, including adding new information derived from or about specimen; whereas “extending” = including external data relevant to but distinct from specimen record (eg, environmental data). Maybe it doesn’t matter.

1 Like

@jmheberling Yes, they can be used somewhat interchangeably but I see enriching as adding data to an existing record e.g. adding a georeference or new determination to a record (which, yes, is similar to an annotation), whereas extending would be linking somewhat disparate information to a record e.g. a Genbank sequence, citation, image, CT scan, etc. In the larger scheme of things, I don’t think the semantics of those two terms makes much difference as we are ideally looking for a system that can handle the integration/linking all of these data elements and scenarios. Others may have different viewpoints.

1 Like

@abentley Thanks for this clarification. I agree it probably is semantic but then also wonder if the distinctions are important to make in the structure of the proposed system – differentiating data extensions that are “primary” vs. “secondary” vs. “tertiary” (sensu Lendemeyer et al 2020 Bioscience) – I could envision primary extensions being treated differently, or even prioritized and stored by data publishers, compared to higher layered data that resides elsewhere. The BCON white paper suggests tertiary data be linked to external repositories, for instance, I thought.

Many digitization projects, at least in herbaria in US, follow an “image first” workflow, where images are produced along with a very basic set of skeletal data (to genus or species, perhaps some level of locality info). Many records may remain in this partially digitized state. Would the ideal system welcome these data to be online before fully digitized (transcribed) and enable crowdsourcing of specimens by researchers with specific interests and/or mass transcription by public. Like anything, would require quality control. Not particularly exciting in the area of enriching/integrating data as core digitization but important nonetheless as specimen digitization is far from complete and must be part of extended/digital specimen conversations. Maybe this has already been considered in the many threads above or in Annotation topic threads. Transcriptions of existing primary label content and annotation labels doesn’t not fit well into annotations topic either.

2 Likes

@jmheberling Yes, I think that is a beauty of a transactional system in that any changes or additions to the skeletal record could be recorded as transactions to leave a breadcrumb trail of modification which will not only records all changes but also provide attribution for those doing the work. You can also thereby employ the strength of the community (scientific, citizen science, and collections) to assist in the digitization and annotation of those records.

1 Like

@jmheberling Yes, I think there would naturally be prioritization of low hanging fruit in making the connections necessary for the ES/DS concept and some of the secondary and tertiary connections may take more effort - both socially and technologically, but I don’t think there is or should be any distinction in the underlying technology necessary to make those connections. I think the system (whatever it ends up being) should be able to accommodate all manner of connections i.e. it should be generalistic enough to handle all scenarios.

1 Like

@abentley thanks for the response and information. That’s great to hear that ideal system would be indifferent/capabable of all connections, but presumably they are quit different and therefore require different approaches/capabilities whether it be primary or tertiary (or maybe not necessarily that distinction even). Perhaps not, you know far better than me! Some extensions as I understand sensu ESN require direct linkages or data to be directly associated with record, presumably held at level of specimen database (ie another data/media field added to specimen record such as field images), while others may be broader aggregated information not specific to the specimen itself (e.g. sp range) or are information about the specific context of the given specimen but not derived from or unique to the specimen itself (e.g., climate data linking to PRISM/climate database(s) or something) , right? Others may be best placed in an external repository (e.g. TRY trait database for plants) but the link(s) provided in the specimen record. Not knowing what I’m talking about, but I would guess these different extensions/enrichments/integrations would require thinking through different informatic solutions. Hope that makes sense, is useful, and I am not rambling :smiley: