Extending, enriching and integrating data

True, but I’m being very particular and specific here. Of course publishing datasets affords visibility and all the goodies acquired via that meta level. What I’m narrowing in on here with a focused lens is the return on investment for the maintenance of specimen records (= digital specimen objects) as stand-alone entities independent from the arbitrariness of datasets. If there was return on investment at this fine level, no one would ever, ever change core pieces of metadata nor unique identifiers affixed to their specimen records. But as you may know, this happens all the time in GBIF-mediated data.

Not sure I understand. Can you explain further?

I suspect this might have been directed at me, apologies if not. I’m lost in the threads. :grinning:

Perhaps it helps to observe where we are now. Presently, many museums publish their specimen records as bundles within the context of a dataset, a Darwin Core Archive. This structure was borne primarily out of a need for efficiencies in transport. These datasets have now taken on an identity, importance, and branding – they receive DOIs. Nonetheless, they are artificial bundles subject to the whims of local administration. Datasets are often split or merged, republished & deleted outright from registries like GBIF. Gradually, many of the metrics and measures of reuse are often tied to these datasets but there are very few examples where metrics of reuse are tied directly to individual specimens as digital entities held within those datasets. My supposition here is that there is presently little incentive to maintain the identifiers & metadata of individual specimen records because the benefits to the institution in sharing these data are not evident at these lower levels. As a consequence, metadata and identifiers (eg institutionCode, collectionCode, occurrenceID) included alongside individual specimens often change, breaking any downstream links that might have been created. Those changes were done to accommodate the need for better branding of the dataset.

And so…

The Extended Specimen or Digital Specimen Object is a significant shift away from datasets as a vehicle for sharing data and more toward the specimen record as the vehicle. There are much different socio-technical responsibilities & commitments required for the latter. Are there incentives now, rather than a mere promise of benefits, that can help guide this shift.


We recently measured GBIF ID loss over time. GBIF IDs lost due to dataset deletion or changes to local identifiers. Last year was lower but still many lost per year.

I explain this around the 7 minute mark GBIF and the Converging Digital and Extended Specimens Concepts on Vimeo


@dshorthouse Yes, that was meant for you. Sorry, forgot to tag you but thanks for the explanation. It is true that collections are published as datasets but within those datasets, individual occurrence records are identified through catalog numbers and unique identifiers. In the case of traditional specimen-based use of collections, there are numerous cases of products of research being tied back to individual specimens through material examined sections in publications and tissue/voucher fields in Genbank. Take this record in my tissue collection as an example - https://ichthyology.specify.ku.edu/specify/bycatalog/KUIT/4005/. You will note through the DNA and citation buttons at the bottom of the form numerous Genbank sequences and citations linked to this record. These links are then published to the aggregators - see the same GBIF record here Occurrence Detail 656980275. However, it is true that due to our current practices of citation using institution code (sometimes), collection code, and catalog number instead of unique identifiers, that the onus falls on the collections staff to make these connections. many hundreds of hours have been spent trolling the literature and Genbank to connect over 800 publications and over 17,000 Genbank sequences to my collection records. The Pensoft ARPHA writing tool provides a great example of how unique identifiers can be included in a citation through automation of a material examined section with linkages to GIUDS to make those linkages more concrete and discoverable by machines. The same could be implemented in Genbank.

It is also true that once you start talking about data use, those lines are greyed even further. A researcher downloads a dataset from GBIF for use. It gets a DOI. However, the researcher may only eventually use a subset of that data in the publication - evidenced by the number of plant, mammal, bird and land use citations connected to my fish collection in GBIF. This is sometimes due to general query parameters used (all specimens from a country, etc.) and most definitely due to a lack of a breadcrumb trail of DOI to DOI linkage that would show how records have been filtered, augmented, annotated and used in the eventual publication following the initial download of data. It is also more difficult to tease out individual records from that DOI to link to individual records. In order to also include these in my metrics for my collection, I have included them in a Google scholar profile for my collection - ‪KU Ichthyology‬ - ‪Google Scholar‬. Ideally, individual records in a finalized DOI could be linked back to individual records in the collection using the unique identifiers.

So, how do we solve this? I agree that maybe it requires a mind shift in the way that we publish data and why we have been exploring a blockchain-inspired transactional method of publication rather than a cache-based snapshot approach (the openDS concept is similar). What if individual records were published and all actions on those records were published as transactions - a re-determination would be a transaction, a loan to a researcher would be a transaction, adding an image would be a transaction, a citation or Genbank sequence would be a transaction? These transactions could be published and completely transparent to the user community so that the onus for creating the linkages is shifted from the collection to the community. The downside to this is that the necessary increase in cyberinfrastructure and complexity at the CMS level to track audit logs would be exclusionary rather than inclusive and would present an even bigger barrier to publication than currently exists. The big question is how we reduce that barrier so that everyone can, and will play the game?


@dshorthouse, @abentley. It’s proven by Crossref and DataCite in the scholarly sector, by EIDR in the film/tv industry, and in other sectors that when data connections based on PIDs are made and the graph grows, new values accrue. What was not possible before becomes possible. Examples of this include Crossref’s Similarity Check plagiarism service for publishers and its Cited-by service.

If you want to begin to see what this could look like for extended digital specimens, explore the EIDR registry. Search for the title of your favourite film. Or for something a little more interesting, have a look at the filmography of Hollywood director, John Ford and see how for any of his films the EIDR registry can reveal much that is databased about them, including where you can watch, such as Amazon, Netflix, etc.

Most of what EIDR enables is exclusive to the workings of the film/TV production and distribution industry so we can’t see it. But we do see the benefit as consumers. EIDR helps the sector supply chain to function more effectively in a world that is now entirely digital and no longer reliant on celluloid - although, of course there are still many thousands of celluloid masters locked away in film vaults. EIDR supports accurate rights tracking and reporting down to the level of clips and composites across multiple languages/geographies, universal search and discovery, and detailed consumption metrics. It helps to ensure audiences see the correct language version in their local cinema or on their mobile phone, and that the right people get paid the correct amounts for their work. It’s easy to see parallels based on a similar registry for digital specimens in the natural science collections sector. Whilst we don’t broadcast specimens, we do carry out integrated analyses and synthesis based on the data they contain.

But as @abentley says, it needs everyone to play the same game. The rules and mechanics must be as simple as possible so players can engage easily and cheaply, building up their responsibility over time as it becomes apparent how their ROI increases.

A transactional method of publication, which the openDS concept represents can achieve this. Open Digital Specimens are mutable objects to which operations (transactions) can be applied. Some of those operations attach things to the DS, like annotations while others can modify/improve the object content itself or make links between the DS and other DSs or to third-party data, such as sequence data or trait data. As the number of openDS increases, services that assist linking and exploit it increase in value to the community as a whole.

Whether a blockchain approach is appropriate or helpful depends on multiple design factors. The most important ones are governance model and storage/implementation model. What is needed to govern and implement a ‘cloud of digital specimens’? We should think about and discuss these first, probably in the Making FAIR data for specimens accessible topic.

Sorry I’m late to the game but may be (likely) missing this distinction as I read through this topic – what is the working meanings of “extending” vs. “enriching” vs. “integrating” data. The questions and discussion in this thread seem to deal with integration, at least by name. Does “enriching” = post hoc annotation, including adding new information derived from or about specimen; whereas “extending” = including external data relevant to but distinct from specimen record (eg, environmental data). Maybe it doesn’t matter.

1 Like

@jmheberling Yes, they can be used somewhat interchangeably but I see enriching as adding data to an existing record e.g. adding a georeference or new determination to a record (which, yes, is similar to an annotation), whereas extending would be linking somewhat disparate information to a record e.g. a Genbank sequence, citation, image, CT scan, etc. In the larger scheme of things, I don’t think the semantics of those two terms makes much difference as we are ideally looking for a system that can handle the integration/linking all of these data elements and scenarios. Others may have different viewpoints.

1 Like

@abentley Thanks for this clarification. I agree it probably is semantic but then also wonder if the distinctions are important to make in the structure of the proposed system – differentiating data extensions that are “primary” vs. “secondary” vs. “tertiary” (sensu Lendemeyer et al 2020 Bioscience) – I could envision primary extensions being treated differently, or even prioritized and stored by data publishers, compared to higher layered data that resides elsewhere. The BCON white paper suggests tertiary data be linked to external repositories, for instance, I thought.

Many digitization projects, at least in herbaria in US, follow an “image first” workflow, where images are produced along with a very basic set of skeletal data (to genus or species, perhaps some level of locality info). Many records may remain in this partially digitized state. Would the ideal system welcome these data to be online before fully digitized (transcribed) and enable crowdsourcing of specimens by researchers with specific interests and/or mass transcription by public. Like anything, would require quality control. Not particularly exciting in the area of enriching/integrating data as core digitization but important nonetheless as specimen digitization is far from complete and must be part of extended/digital specimen conversations. Maybe this has already been considered in the many threads above or in Annotation topic threads. Transcriptions of existing primary label content and annotation labels doesn’t not fit well into annotations topic either.


@jmheberling Yes, I think that is a beauty of a transactional system in that any changes or additions to the skeletal record could be recorded as transactions to leave a breadcrumb trail of modification which will not only records all changes but also provide attribution for those doing the work. You can also thereby employ the strength of the community (scientific, citizen science, and collections) to assist in the digitization and annotation of those records.

1 Like

@jmheberling Yes, I think there would naturally be prioritization of low hanging fruit in making the connections necessary for the ES/DS concept and some of the secondary and tertiary connections may take more effort - both socially and technologically, but I don’t think there is or should be any distinction in the underlying technology necessary to make those connections. I think the system (whatever it ends up being) should be able to accommodate all manner of connections i.e. it should be generalistic enough to handle all scenarios.

1 Like

@abentley thanks for the response and information. That’s great to hear that ideal system would be indifferent/capabable of all connections, but presumably they are quit different and therefore require different approaches/capabilities whether it be primary or tertiary (or maybe not necessarily that distinction even). Perhaps not, you know far better than me! Some extensions as I understand sensu ESN require direct linkages or data to be directly associated with record, presumably held at level of specimen database (ie another data/media field added to specimen record such as field images), while others may be broader aggregated information not specific to the specimen itself (e.g. sp range) or are information about the specific context of the given specimen but not derived from or unique to the specimen itself (e.g., climate data linking to PRISM/climate database(s) or something) , right? Others may be best placed in an external repository (e.g. TRY trait database for plants) but the link(s) provided in the specimen record. Not knowing what I’m talking about, but I would guess these different extensions/enrichments/integrations would require thinking through different informatic solutions. Hope that makes sense, is useful, and I am not rambling :smiley:

@jmheberling Yes, there is that distinction of resources that link to a specific collection object (citation or Genbank sequence) as opposed to resources that link to a broader concept (a taxonomic name for a distribution model). I see that as an issue of data in vs. data out. In the case of a citation or Genbank sequence (data in) you are linking external resources back to an occurrence record, whereas with a distribution model (data out) you are accumulating data into a package to then push out to produce a model. However, with the envisaged system, both scenarios should be equally supported as the entire transactional system will be completely transparent and will allow for grouping objects together for a particular function through, for instance, a DOI. Would be great to hear others views on this?

1 Like

There is a great discussion happening on a new thread that has implications for Extending and enriching data for data integration - https://discourse.gbif.org/t/structure-and-responsibilities-of-a-digextspecimen/2533/3. With diagrams too!!

This is a comment that was made on the summary that I think belongs here: Summaries - 2. Extending, enriching and integrating data - #2 by dorsa

1 Like

@dorsa Yes, agree that the general principle of ensuring that links are maintained is an important part of any system but not sure that GBIF should be a mediator of this. Ideally, the system would be able to detect broken links and report these as part of the general infrastructure of the system.

1 Like

Thanks, found the new thread. Who else if not @Gbif could detect broken links, but then the question is to whom report - of course to the data provider in the first place, but imagine a dataprovider with problems? We probably need sort of clearing house for data sustainability and integrity

@dorsa Yes, exactly. We need some sort of independent broker that can mediate all of the records and links in the system. A blockchain, transaction-based system would automatically provide such a broker as part of the system from what I understand. That way all actors in the data pipeline have a role to play in following the rules and providing the necessary linkages between items.

1 Like

I would like to bring up the topic of integrating specimens with observational and other types of data. Much of the discussion here has been centered – for obvious and good reasons – extensions at the primary level (e.g., specimen metadata (including enriched) and images (including CAT scans), etc.). These can enrich the value of the specimen immensely, as has been nicely illustrated. But data extensions at the secondary level can as well, but also bring challenges because they are not always directly linked to a specific specimen. Consider a herpetologist on a field collecting trip. She might make an audio recording of a calling male frog (deposited in media collection), then collect the frog itself (the specimen), take a tissue sample (to frozen tissue collection), and collect ectoparasites from the animal (sent to appropriate invert collection). These are all samples that add value to the specimen itself and should be appropriately linked to it via whatever mechanism. But she might also collect photos of the habitat or lists of other species encountered but not collected (observational data), which could be linked not just to that one specimen but to all that were collected on the same date at the same place. She might also record many other calling males that were not collected, and tissue and parasite samples from frogs that were not collected. These should all go to appropriate repositories, but be linked back to the specimens that were collected on that date/place as they all add value to each other. Hence the need to extend data associated with specimens to that secondary level, and also the need to connect observational data with collections/specimen data. To my thinking, the thing that unites these data/specimens is the collecting event itself – all were collected at the same time/place. What I don’t have a good grasp on (because it is well outside my expertise) is what technological tools might help here. Would love to hear thoughts on both the conceptual issue and the approaches/solutions.