Analyzing/mining specimen data for novel applications

One “fruitful” approach could be to select recent “novel use” publication(s) (e.g. I’m thinking the many growing large scale plant phenological ones, like @libby awesome red maple herbarium study in the US) and brainstorm, perhaps with original study authors, on what would have made the study more efficient or better in some way. I’m thinking phenological scoring tools that are built into herbarium data portals, as one wish that could be enacted ideally in the ES/DS framework. Could also serve as a case study in linking the data back to providers or linking to phenological databases or similar, thereby providing attribution to collections/collectors, data transparency, and enable reuse in other studies.


This is a great idea @jmheberling. relatedly, could think about scaling up studies – taking a study that was only possible on a small scale (due to the manual nature of data gathering/analysis) and see what it would take to ‘globalize’ it, ES/DS style.
The phenology annotations in that maple study only exist in the .csv file from that research and haven’t been ‘round-tripped’ back to any of those collections. :frowning:
Some of the pubs in this compilation (and there are no doubt dozens more – share others that you know of!) could be perfect for this: Biodiversity Crisis Response and Conservation Resources - SPNHC Wiki


The “return” to collections is interesting and important (and probably not in this topic, but more for the Annotation topic thread), but it is an extra step for researchers that may be difficult to sell the importance. More the to point, though, is that there is no mechanism to even do it right now anyway!! Hence this whole concept I guess :slight_smile: It is unlikely that every collection has the capacity to start including fine scale phenological data but I think a more realistic approach would be external databases that link to specimen record.

But to bring my blathering to a close…I think retroactive linking on recent publications like yours could be a good pilot approach to reveal what is really needed for specific use cases from both study author researcher, other researchers, and collection managers perspectives.

1 Like

I reckon there are parallels here with the experience in the world of biological simulation modelling which is occupied (perhaps the word is co-inhabited?) by biologists and modellers. This is also a field that needs good mastery of at least two domains - biology and simulation modelling. Kinda like our challenge where we need skills and experience in both taxonomy/collections management and digital data/acquistion/storage/integration and delivery. In the simulation world you have some biologists that have learned to model and some modellers that have learned some biology…and both approaches can work and have certainly had successes. But some of the very best work (I would argue most) has been generated by teams that have had domain specialists for each skill set and then some boundary spanning folks that like to get excited about both. Based on that I think maybe this is a profitable way forward for the transformation of our collections into truly digitally-enabled research infrastructure. If that’s true then an interesting question becomes: “How do we get digital professionals excited about our emerging digital world?”. What exciting problems can we give them to chew on, and to demonstrate the utility of their exciting new tools… Because in a resource limited environment, we need to convince them to spend their time, effort and indeed $$ in our world. So what’s the question, challenge, problem that will attract them? Which absolutely isn’t to say that we shouldn’t be working hard to provide everyone we can with better digital sills, and certainly new graduates are more and more digitally literate. But I agree its not reasonable or possible to be an expert at everything. So another question I guess is what is the best new skill sets for current collections staff to focus on to make them most able to engage with digital professionals, and what is the base set of info we need to impart on our digital colleagues so they can work effectively in our world?.


I would like to reply to some of the issues raised in the summary (sorry, I haven’t yet gone through to read the entire thread) with regard to annotating, accessing, and analyzing images. To me the answer lies in specialization: images and other media (videos, audio recordings) should go to a collection specialized for handling those types of data (e.g., A media-specific collection has the mission and focus to develop search and analysis tools (including machine learning), annotation tools, etc. A media collection also has the ability to constantly migrate the media forward to new technologies as they emerge, ensuring that the digital files will always be readable (how many of you have old computer or digital video files that you can no longer open?). The expertise and infrastructure needed to handle media is very different from that needed to handle physical specimens. By the same token, specialized collections to handle different types of specimens/data are pretty standard (e.g., botanical versus vertebrate versus invert versus paleontological, etc). Other specialized collections can be envisioned or already exist (e.g., to handle scans). By specializing collections but linking them together, some of the research challenges brought up can be addressed, at least in part. Of course, then the issue becomes one of making sure that the specimens/data are properly connected across collections.

1 Like

…and then this scaling up idea then touches on the application of AI/ML for trait extraction and analysis so we could move from measuring single species to whole of community responses across continents.

1 Like

Its a great point Joe. The specimen as the taxonomic point of truth is what makes all the other associated data that we “extend” it with (genomes, images, sounds, traits etc…) so valuable for whatever end use people make of it. So we do need to makes sure that the fundamental taxonomic work is recognised, valued, attributed, and supported into the future.


Definitely, and substitute ‘another statistical skill’ or ‘any other specialized skill’ and I think this still holds true. It’s a lot to ask of researchers and data users to be able to do it all.

This speaks to appropriate workforce training as well. If a biologist can identify that their work would benefit from modeling and that they can collaborate with someone with those skills (or substitute other disciplines and skills), that’s a win in my book. Familiarity with tools and resources and how to access them (including enlisting collaborative work with others) is a big part of contemporary research.


This is a nice complement to @AYoung’s comment above re: specialization. Just as we can’t expect data users to be experts in everything from research to statistics, etc., maybe our expectations of collections are similarly steep, especially as we implement aspects of ES/DS.
@MikeWebster What would this look like? Would, for example, Macaulay Library store, maintain and provide access to wildlife media for all collections of this type? Perhaps a collection/institution might even prefer to provide some funding to be able to do this as opposed to all that would go into maintaining the media themselves (for reasons you list).

with the DS/ES and a PID for the image object it does not matter anymore where the data is, so an image (the data part of the object) can be hosted by the collection holding institution but also by a specialized institution or national infrastructure or international service. Different versions of the image can even be hosted from different places: from an original TIF archived by a national infrastructure on tape, available per email request to a derived jpg image available from multiple locations.

1 Like

Yes, something along those lines. We are doing this now, at a certain level: audio recordings archived at ML associated with physical specimens at other institutions, with reciprocal links in each other’s database. Here is one example: The challenges come from this being a very manual process: need to exchange the catalog numbers and enter by hand, which often doesn’t happen. We have MANY recordings in ML with “specimen collected” in the database, but no corresponding specimen ID or even institution designator, leading to need for a lot of detective work. Improved collecting workflow and/or automated process would help enormously. By the same token, the manual approach leads to need for lots of person-power, and currently we don’t have the staff to do this at scale. But the concept is there and can be improved.


I have been a part of a team that used images of Late Carboniferous fusulinids for AI/CNN studies. These included images from iDigBio and our local collection. It is a severely under-utilized resource and one that I feel institutions need to promote. The vouchered collections we hold are the authoritative specimens for those taxa and should be the ones used for AI/CNN identification references. The largest pitfalls I ran into using current images were - persistent orientations of specimens in the images used, and convincing publishers that copyright (CC) was sufficient for this use. We also found very few images of the targeted taxa were available on iDigBio or GBIF (at the time). As we transition over from digital images to mico-CT scans of similar objects, virtually no scans can be found outside of those we have generated from our local collections. This mitigates orientation issues while increasing compute time.


What comes to mind to me is the functionality that e-bird had developed.

1 Like

I think the report does a fine job of addressing traditional economic justifications for collections. While monitizing things is always the goal here, there are also economists who talk about the importance of intangible assets. I’ve come to see our collections large and small as incredible intangible assets too. I would love to see our community work with economists to develop this valuation of biodiversity collections and biodiversity science too.


So thinking of novel applications, as we continue to link specimens to images (common) and also add genetic data layers (less common but definitely increasing) a range of potentially interesting and very useful things become possible. For instance, if we extend “extend” our specimens with large(ish) SNP datasets it might be possible to reconstruct kinship estimates and infer specimen pedigrees across time and space… This would effectively turn the world’s collections into a giant quantitative genetics experiment with thousands of observations of individuals of known (estimated) relationship across a wide range of environments (GxE!). Why do this? Well if this were true collections could be used to explore genetic control (hereditability and possibly even genetic architecture) of key traits and their response to environmental selection which can then be incorporated into simulations given more sophisticated predictions of species responses.

1 Like

I think a key statement in the report is that collections should measure impact using metrics that are relevant to their hosting institutions. This is excellent advice I think. Which is not to say that there aren’t, as John suggests, a range of both tangible and intangible benefits to the public, industry and govt sectors, its just that different collections “live” in different worlds in terms if what they are expected to deliver and have different “core purposes”. For example as primarily research collections at CSIRO (with no “front of house” activity) we have a very strong role in undertaking both fundamental and applied science to support areas like biosecurity, environmental monitoring and biodiversity management, but we have much more limited roles in STEM education and public outreach which we generally deliver through collaboration with other institutions that are focused on those areas. This means that most of our performance indicators focus on these applied research outcomes and we generally look for our new applications of collections in these areas. As well in how new technology can change how we operate in terms of collections management, such as digital loans, AI assisted curation and born digital for field observations etc…

1 Like

Have a look at the attribution thread. Would be great to get your additional thoughts on what metrics using extended digital specimens would be relevant for CSIRO.

1 Like

Combining collections experience, a focus on population genomics, questions of statistical quality and reliability for biodiversity conservation and experience in a biodiversity field that is a major economic sector (forestry), my contributions build on the perspective that @AYoung opened up (Analyzing/mining specimen data for novel applications - #37 by AYoung).

We need such GxE investigations for basically everything, from questions connected to adaptation, management, breeding, interpreting monitoring results to archiving the SDGs (food security, livelihoods, equality, creating safe environmental envelopes in which inhabitants can prosper).

One important prerequisite for conducting GxE approaches or genetic monitoring of biodiversity and for being able to interpret their results for global (=distribution range-wide) conclusions are reference datasets. My guess is that this is the only/main reason why early on such extensive efforts were raised and sustained to assemble every few years a more finely resolved version of a standardized, high quality, interoperable reference dataset of global human population diversity. Applied medical research needs them.

It’s worthwhile to consider building such reference datasets for key ecological and economic taxa as a foundational biodiversity data resource and its infrastructure and services.

Core human reference datasets are anchored by physical collections. Compare the cultured cell lines kept at the Foundation Jean Dausset (CEPH) in Paris for the Human Genome Diversity Panel (HGDP-CEPH) or the positively humongous UK Biobank collection and derived infrastructure.

When we talk about novel applications for collections, I believe we should be thinking UK Biobank and similar concept around the world. In forestry (and fisheries, agriculture, …), humanity certainly needs such resources and infrastructures.

I am somebody who gets into flow when it comes to advanced, mindboggling statistical approaches for population-genetics and phylogenomics and their quality evaluations. At the same time, my experience has also been that the ability to successfully conduct such investigations critically depends on the existence and quality of the collection of the original physical specimens.

When I read about another methodological break-through in “big data” population-genomics (hey, ancestral recombination graph!), I think scientific collection, herbarium, museum.

To specify: I am not so interested in reconstructing another pan-genome for an established model organism. My interest is in biodiversity and in non-model organisms getting there.

Here, reference datasets or equivalents might be built over decades and/or simultaneously by a widely dispersed and heterogeneous group of researchers, professionals, engaged citizens, governmental officials, business owners/employees, etc. All those people and circumstances that have been contributing to and building natural history collections.

Among all the exciting applications of reference datasets, the one that I am most interested in is genetic forensics combating environmental and biodiversity crime.

First, it is the most challenging with the most extensive quality requirements. If you can do forensics, the quality - reliability, power, versatility - is there to do everything else. Thus, collections providing integral parts of forensic workflows builds recognition and societal value, in addition to the contributions and values that are already recognized for and associated with collections.

Second, there is (potentially) money in this for collections. The biodiversity part (illegal wildlife and timber trade, illegal fisheries) of environmental crime is estimated to be a global sector with up to 200 billion USD profits annually (UNEP and Interpol report 2016). If only a tiny fractions of this is invested in crime fighting and collections being recognized as providing indispensable services, we are potentially still speaking about serious money. The reality of governmental funding for crime fighting might say otherwise, still for collections it can be a source of self-generated financial income.

@jbates606 mentioned the intangible value of collections (Analyzing/mining specimen data for novel applications - #36 by jbates606). Intangible values are also highlighted in the Dasgupta review (Final Report - The Economics of Biodiversity: The Dasgupta Review - GOV.UK). Nevertheless, providing hands-on, tangible services to society and important economic sectors can give visibility and societal standing to collections and their work, and improve finances.

Add to this certification: the need for companies and the wish of consumers to validate supply chains - all depending on high-quality reference data with a suitable and sufficient taxonomic and geographical scope. At the within-species level, the currently best and most exciting example is the World Forest ID project of Kew Botanical Gardens, FSC, Agroisolab, the US Forest Service and WRI ( Above the species level, the barcoding infrastructures are of importance for certification and forensics.

What forensic and certification applications add to a reference data concept are, first, a focus on often sensitive and at risk data requiring a focus on access and privacy management, considerations of proprietary information and IT security.

Second, these samples and datasets require a full chain of custody, from the collection in the field, to storage and management, lab analyses/genomics/isotope analyses to reproducible statistical analyses and standardized reporting (WorkEnviron_20201128.pdf (2.4 MB) ).

Maybe not explicitly, though all of these requirements are part of the DS/ES specimen concept and FAIR data workflows or they result from their implementation.

1 Like