Analyzing/mining specimen data for novel applications

I would like to reply to some of the issues raised in the summary (sorry, I haven’t yet gone through to read the entire thread) with regard to annotating, accessing, and analyzing images. To me the answer lies in specialization: images and other media (videos, audio recordings) should go to a collection specialized for handling those types of data (e.g., A media-specific collection has the mission and focus to develop search and analysis tools (including machine learning), annotation tools, etc. A media collection also has the ability to constantly migrate the media forward to new technologies as they emerge, ensuring that the digital files will always be readable (how many of you have old computer or digital video files that you can no longer open?). The expertise and infrastructure needed to handle media is very different from that needed to handle physical specimens. By the same token, specialized collections to handle different types of specimens/data are pretty standard (e.g., botanical versus vertebrate versus invert versus paleontological, etc). Other specialized collections can be envisioned or already exist (e.g., to handle scans). By specializing collections but linking them together, some of the research challenges brought up can be addressed, at least in part. Of course, then the issue becomes one of making sure that the specimens/data are properly connected across collections.

1 Like

…and then this scaling up idea then touches on the application of AI/ML for trait extraction and analysis so we could move from measuring single species to whole of community responses across continents.

1 Like

Its a great point Joe. The specimen as the taxonomic point of truth is what makes all the other associated data that we “extend” it with (genomes, images, sounds, traits etc…) so valuable for whatever end use people make of it. So we do need to makes sure that the fundamental taxonomic work is recognised, valued, attributed, and supported into the future.


Definitely, and substitute ‘another statistical skill’ or ‘any other specialized skill’ and I think this still holds true. It’s a lot to ask of researchers and data users to be able to do it all.

This speaks to appropriate workforce training as well. If a biologist can identify that their work would benefit from modeling and that they can collaborate with someone with those skills (or substitute other disciplines and skills), that’s a win in my book. Familiarity with tools and resources and how to access them (including enlisting collaborative work with others) is a big part of contemporary research.


This is a nice complement to @AYoung’s comment above re: specialization. Just as we can’t expect data users to be experts in everything from research to statistics, etc., maybe our expectations of collections are similarly steep, especially as we implement aspects of ES/DS.
@MikeWebster What would this look like? Would, for example, Macaulay Library store, maintain and provide access to wildlife media for all collections of this type? Perhaps a collection/institution might even prefer to provide some funding to be able to do this as opposed to all that would go into maintaining the media themselves (for reasons you list).

with the DS/ES and a PID for the image object it does not matter anymore where the data is, so an image (the data part of the object) can be hosted by the collection holding institution but also by a specialized institution or national infrastructure or international service. Different versions of the image can even be hosted from different places: from an original TIF archived by a national infrastructure on tape, available per email request to a derived jpg image available from multiple locations.

1 Like

Yes, something along those lines. We are doing this now, at a certain level: audio recordings archived at ML associated with physical specimens at other institutions, with reciprocal links in each other’s database. Here is one example: The challenges come from this being a very manual process: need to exchange the catalog numbers and enter by hand, which often doesn’t happen. We have MANY recordings in ML with “specimen collected” in the database, but no corresponding specimen ID or even institution designator, leading to need for a lot of detective work. Improved collecting workflow and/or automated process would help enormously. By the same token, the manual approach leads to need for lots of person-power, and currently we don’t have the staff to do this at scale. But the concept is there and can be improved.


I have been a part of a team that used images of Late Carboniferous fusulinids for AI/CNN studies. These included images from iDigBio and our local collection. It is a severely under-utilized resource and one that I feel institutions need to promote. The vouchered collections we hold are the authoritative specimens for those taxa and should be the ones used for AI/CNN identification references. The largest pitfalls I ran into using current images were - persistent orientations of specimens in the images used, and convincing publishers that copyright (CC) was sufficient for this use. We also found very few images of the targeted taxa were available on iDigBio or GBIF (at the time). As we transition over from digital images to mico-CT scans of similar objects, virtually no scans can be found outside of those we have generated from our local collections. This mitigates orientation issues while increasing compute time.


What comes to mind to me is the functionality that e-bird had developed.

1 Like

I think the report does a fine job of addressing traditional economic justifications for collections. While monitizing things is always the goal here, there are also economists who talk about the importance of intangible assets. I’ve come to see our collections large and small as incredible intangible assets too. I would love to see our community work with economists to develop this valuation of biodiversity collections and biodiversity science too.


So thinking of novel applications, as we continue to link specimens to images (common) and also add genetic data layers (less common but definitely increasing) a range of potentially interesting and very useful things become possible. For instance, if we extend “extend” our specimens with large(ish) SNP datasets it might be possible to reconstruct kinship estimates and infer specimen pedigrees across time and space… This would effectively turn the world’s collections into a giant quantitative genetics experiment with thousands of observations of individuals of known (estimated) relationship across a wide range of environments (GxE!). Why do this? Well if this were true collections could be used to explore genetic control (hereditability and possibly even genetic architecture) of key traits and their response to environmental selection which can then be incorporated into simulations given more sophisticated predictions of species responses.

1 Like

I think a key statement in the report is that collections should measure impact using metrics that are relevant to their hosting institutions. This is excellent advice I think. Which is not to say that there aren’t, as John suggests, a range of both tangible and intangible benefits to the public, industry and govt sectors, its just that different collections “live” in different worlds in terms if what they are expected to deliver and have different “core purposes”. For example as primarily research collections at CSIRO (with no “front of house” activity) we have a very strong role in undertaking both fundamental and applied science to support areas like biosecurity, environmental monitoring and biodiversity management, but we have much more limited roles in STEM education and public outreach which we generally deliver through collaboration with other institutions that are focused on those areas. This means that most of our performance indicators focus on these applied research outcomes and we generally look for our new applications of collections in these areas. As well in how new technology can change how we operate in terms of collections management, such as digital loans, AI assisted curation and born digital for field observations etc…

1 Like

Have a look at the attribution thread. Would be great to get your additional thoughts on what metrics using extended digital specimens would be relevant for CSIRO.

1 Like

Combining collections experience, a focus on population genomics, questions of statistical quality and reliability for biodiversity conservation and experience in a biodiversity field that is a major economic sector (forestry), my contributions build on the perspective that @AYoung opened up (Analyzing/mining specimen data for novel applications - #37 by AYoung).

We need such GxE investigations for basically everything, from questions connected to adaptation, management, breeding, interpreting monitoring results to archiving the SDGs (food security, livelihoods, equality, creating safe environmental envelopes in which inhabitants can prosper).

One important prerequisite for conducting GxE approaches or genetic monitoring of biodiversity and for being able to interpret their results for global (=distribution range-wide) conclusions are reference datasets. My guess is that this is the only/main reason why early on such extensive efforts were raised and sustained to assemble every few years a more finely resolved version of a standardized, high quality, interoperable reference dataset of global human population diversity. Applied medical research needs them.

It’s worthwhile to consider building such reference datasets for key ecological and economic taxa as a foundational biodiversity data resource and its infrastructure and services.

Core human reference datasets are anchored by physical collections. Compare the cultured cell lines kept at the Foundation Jean Dausset (CEPH) in Paris for the Human Genome Diversity Panel (HGDP-CEPH) or the positively humongous UK Biobank collection and derived infrastructure.

When we talk about novel applications for collections, I believe we should be thinking UK Biobank and similar concept around the world. In forestry (and fisheries, agriculture, …), humanity certainly needs such resources and infrastructures.

I am somebody who gets into flow when it comes to advanced, mindboggling statistical approaches for population-genetics and phylogenomics and their quality evaluations. At the same time, my experience has also been that the ability to successfully conduct such investigations critically depends on the existence and quality of the collection of the original physical specimens.

When I read about another methodological break-through in “big data” population-genomics (hey, ancestral recombination graph!), I think scientific collection, herbarium, museum.

To specify: I am not so interested in reconstructing another pan-genome for an established model organism. My interest is in biodiversity and in non-model organisms getting there.

Here, reference datasets or equivalents might be built over decades and/or simultaneously by a widely dispersed and heterogeneous group of researchers, professionals, engaged citizens, governmental officials, business owners/employees, etc. All those people and circumstances that have been contributing to and building natural history collections.

Among all the exciting applications of reference datasets, the one that I am most interested in is genetic forensics combating environmental and biodiversity crime.

First, it is the most challenging with the most extensive quality requirements. If you can do forensics, the quality - reliability, power, versatility - is there to do everything else. Thus, collections providing integral parts of forensic workflows builds recognition and societal value, in addition to the contributions and values that are already recognized for and associated with collections.

Second, there is (potentially) money in this for collections. The biodiversity part (illegal wildlife and timber trade, illegal fisheries) of environmental crime is estimated to be a global sector with up to 200 billion USD profits annually (UNEP and Interpol report 2016). If only a tiny fractions of this is invested in crime fighting and collections being recognized as providing indispensable services, we are potentially still speaking about serious money. The reality of governmental funding for crime fighting might say otherwise, still for collections it can be a source of self-generated financial income.

@jbates606 mentioned the intangible value of collections (Analyzing/mining specimen data for novel applications - #36 by jbates606). Intangible values are also highlighted in the Dasgupta review (Final Report - The Economics of Biodiversity: The Dasgupta Review - GOV.UK). Nevertheless, providing hands-on, tangible services to society and important economic sectors can give visibility and societal standing to collections and their work, and improve finances.

Add to this certification: the need for companies and the wish of consumers to validate supply chains - all depending on high-quality reference data with a suitable and sufficient taxonomic and geographical scope. At the within-species level, the currently best and most exciting example is the World Forest ID project of Kew Botanical Gardens, FSC, Agroisolab, the US Forest Service and WRI ( Above the species level, the barcoding infrastructures are of importance for certification and forensics.

What forensic and certification applications add to a reference data concept are, first, a focus on often sensitive and at risk data requiring a focus on access and privacy management, considerations of proprietary information and IT security.

Second, these samples and datasets require a full chain of custody, from the collection in the field, to storage and management, lab analyses/genomics/isotope analyses to reproducible statistical analyses and standardized reporting (WorkEnviron_20201128.pdf (2.4 MB) ).

Maybe not explicitly, though all of these requirements are part of the DS/ES specimen concept and FAIR data workflows or they result from their implementation.

1 Like

By generating demand for the data and knowledge generated by the basic efforts. When collections, collection institutions and the collection’s community provide indispensable resources and services to society, their work will become essential - no matter if taxonomic expert, or collections specialist, assistant and manager.

This sounds very much like inter- and transdisciplinary cooperation. I hadn’t connected collections with transdisciplinary work, though your thoughts and descriptions now remind me of the i2insights blog, which I find a great resource of information and for inspiration:



Part of the answers might be modular and versatile software with UIs and UXs that can be easily modified and geared towards the specific needs and preferences of projects and users.

I am thinking about the software architectures of R ( and Nextcloud ( Both provide a general-purpose default environment, which can be modified (replacement of default modules) and extended (additional modules) by user-chosen modules.

For example, a new user in a citizen science project might start out encountering a highly simplified interface, which allows the user to quickly and in an intuitively understandable approach add reports, field observations, etc. to the projects dataset. This results/can result in high-quality, standardized data, despite the fact that the new user doesn’t (yet) fully understand the project, input, etc.

Still, over time a digital-non-native might grow in confidence about what they are doing, they now might want to explore the digital environment more, have more options and be more in charge, even start their own projects. Thus, the software at the same time needs to allow different levels of power user-functionality. As an example, consider Inkscape ( I am only using the graphical surface, though all its functionality can be accessed via command line, too.

At the level of project leaders, allow them all the freedom to design their input surfaces and data structures. Though, provide guidance, eg. via templates and context-dependent information, so that it is easy for them to make decisions which have their project adhere to and being compatible with standards and full-filling minimum data requirements.

Enabling users and projects to use the same, already familiar platform again for different purposes, eg. projects, by being able to design their own input forms and workflows, users, no matter if newbie, power user or project leader, will come back, interact with the project, enter data, etc. There is no “oh god, another program that I need to learn first”-moment.

Plus, if the software allows, they can see their data entries in the context of the existing dataset and additional datasets (closely related, of personal interest). They immediately see the progress that their additions added to the project and might thus recognize gaps, which they might be able to easily close.

The considerations for architecture and functionality at the data provider-side should be similarly considered at the data user side for data discovery, use and export/interoperability.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.