Analyzing/mining specimen data for novel applications

Moderators: Libby Elwood, Anna Monfils, John Bates, and Andrew Young

Summaries - 5. Analyzing/mining specimen data for novel applications


We are only just beginning to appreciate the role of natural history collections data in addressing today’s global wicked problems. The extent of data providers, curators, and end users is yet to be fully realized, and the data and analysis tools to actualize the potential of this emerging data resource are in their infancy. Regional efforts will benefit from global partnership yet the potential for a fully actualized global database stands as a challenge to the natural sciences collection community.

By their very nature, specimen databases lend themselves to being integrated for interoperable functions. This is the crux of the extended and digital specimen concepts. Analytical tools with the ability to combine and exploit novel and traditional types of data are enabling novel research on topics ranging from phenology, genotype-phenotype interactions, host-parasite interactions, epidemiological studies, broad comparative phylogenetic analyses, human health applications and linkages between the biological and physical world across different time scales.

The complexities associated with creatively serving these data present an area of active advancement. However, the quality of data is critically tied to the resources and expertise available to institutions to catalog specimens, and to manage and maintain databases. Education of the next generation of collections workers needs to encompass both traditional collections work and to include training to help build and manage these ever-evolving data assets.

This thread seeks to build further from known uses of collections data and to ask:

  • What are the emerging areas of research, management, policy, industry and education that are drawing on novel uses of collections data?
  • What approaches are needed to better serve data in a consistent manner for novel users and end uses?
  • How can we provide the training necessary for data gatherers, providers, curators and end users?
  • How can we better engage and support data providers and users from under-represented parts of the globe and improve worldwide access to data and training?
  • What tools are necessary to support novel applications?

This category is for those interested in discussing innovations of specimen data use discovery, data mining, data analysis, impediments to broad-scale open use, data management, best-practices for diversity, equity, inclusion, justice, tools for data users, training across all levels, and expanding the community of data providers and data users.

This category specifically deals with the end users, existing and potential needs and systems to sustain usage, and planning for innovation and novel data usage.

This category is about realizing the full potential of the extended/digital specimen concepts for the innovative use of biodiversity data to the broadest community of users for addressing a wide diversity of research questions.

Information resources

Additional relevant resources can be found here: Biodiversity Crisis Response and Conservation Resources - SPNHC Wiki

Questions to promote discussion

  • What commonalities exist among research projects that have successfully used collections data?
    • In what ways do existing databases, data portals, and tools support broad use of data?
    • How are current methods of outreach successfully engaging new communities?
  • How do the extended/digital specimen (ES/DS) concepts help to address making specimen data widely usable for novel applications? What are the low hanging fruits of the ES/DS, in terms of novel applications, that we can work on now? And what are the biggest obstacles to adoption of the ES/DS concepts?
  • How do we engage end-users in a virtuous cycle of improvement to fill gaps (e.g., taxonomic, spatial, temporal, institutional), add data layers (especially derived ones) and build linkages between the existing data?
  • What assumptions are we making when designing databases and data standards for a global community? How do we thoughtfully promote, enhance and implement improvements from new users?
  • How do we avoid losing sight of basic efforts (e.g., taxonomic research) that have to underpin using these data for novel applications?

For me the biggest untapped resource are the images we are creating in vast numbers. There have been a number of projects that have shown the potential of machine learning to extract information from specimen images, but these have only been conducted on a small scale.

The real power and novelty will come from being able scan across all images for all collections. This would allow us to…

  • Find labels of a single type wherever they are
  • Find signatures of a single individual
  • Find printed stamps on any specimen
  • Search for a novel species using an image search (e.g. butterfly wing patterns)
  • extract phenological data en masse
  • Validate all sorts of data, from accession numbers to identifications

The prevailing paradigm in image processing has been to create dedicated workflows that output data into a digital specimen. While this is useful, I am suggesting that we have also to turn this around. Have all images accessible as a single corpus so anyone can come along and do what ever processing they want across the whole corpus.

This is however not how we currently have our images stored. Either you have to put all the images in the same place where the processing will occur or you have to create a distributed system where processing would occur in parallel in different places across the whole corpus.


While updating the GBIF slide deck yesterday, I had occasion to note that the GBIF network is now sharing 41 million records with specimen images (and nearly 76 million in total).


Exactly, and this is what got me thinking is that we need an avenue, not just from the data to the image, but from the image to the data.

1 Like

This also points to a need to better capture annotations to these images to build up the labels associated with particular media files. This could include presence of specific traits, measurements, etc.


:smiling_face_with_three_hearts: reminds me that one of the very first @iDigBio working group workshops was DROID ( Developing Robust Object to Image to Data Workflows).

1 Like

:100: These will be key to capturing expert observations about what’s represented in an image, and will (as you already know @dcblackburn) feed machine learning so that scientists can reap the benefits of the growing pile of images / media relevant to their research needs.


One assumption is that all collections staff are concerned about their data and/or have appropriate technology skills.

These are generally the same people who on any given day are skinning a bird, checking a dermestid colony, performing integrated pest management, hunting down specimens for a loan or a borrower to get specimens back, supervising volunteers, interns and if they are lucky an actual paid assistant. Perhaps they learned Python or R while in school, but perhaps those languages weren’t even in use when they were in school. Yet it does seem that we expect that collections staff can manage data in the ways that often are discussed at the level of aggregators and that they all understand the technology and terms being used by the biodiversity data community.

This feeds into:

How do we avoid losing sight of the other important tasks that collections staff do? Maintenance of the physical specimens (which underpin the data) should be part of the puzzle.




yes yes and to directly make the link to the annotation thread where “annotations” can include a variety of information Annotating specimens and other data

1 Like

Such a critical infrastructure, leadership, and professional development question! See my reference to this (and related issues) in Annotating specimens and other data - #29 by Debbie

This ties nicely to a thought that I had a while back. In an average collection that is now at least partially digitized, the CM faces curating two collections - the physical specimens and the digital records about those specimens. In most collections, that would now include adding specimens (both physical and digital) and trying to keep both collections updated - and in sync. In some cases, esp. larger collections, one is now facing the challenge of digitizing major portions of the collection while keeping track of any incoming annotations. Incoming staff will likely need more data manipulation skills - I know what I knew how to do a number of years back is now virtually useless.


There is also the opportunity to extend specimens with images from collection events/localities and link these through our digital databases. This another place where data being “born digital” is be critical going forward. In other words, it is much easier for the collector to make the digital connections than to try to have someone else do it after the fact.


I think it’s important to keep in mind here that ‘single corpus’ doesn’t mean having all the images in one place. That’s expensive and has all kinds of governance and jurisdictional implications.

Images can be physically distributed but should exist as a single logical body. A single access point or access points with a common interface allows that operations can be performed across the corpus irrespective of location of individual images. This means putting in place appliances at the local level, along side image stores where operations can be invoked and executed from a remote source.

Keeping the images in different places also has cost implications if they’re to be processed together.

I think that indeed the first key step is standardized image access. Optimal allocation of where the storage and where the computing happens is a very difficult question to answer in a generic way. Some use cases will function more optimally with centralization, others can be applied locally.


How do the extended/digital specimen (ES/DS) concepts help to address making specimen data widely usable for novel applications?

If we are trying to build a system of very interconnected data, then I would like to see some mechanism to search across ES/DS data using graph style queries, ie emphasising and properly utilising the links between records as much as attributes on the records themselves.


How to engage new communities / “Low-hanging fruit” for ES/DS system

We have a considerable amount of labelled data and motivated experts, and we could make progress by better packaging these up for use by computer science researchers - who are motivated to work on problems but often suffer from difficulties accessing data and expert users.
This could also help with the discussion point re how to engage end-users in a virtuous cycle of improvement - by directing (human) data enhancement to attributes which could be extracted to build training datasets for later application of machine learning techniques.


I’m intrigued… what would graph style queries look like? Is anyone currently doing this?

I’m intrigued… what would graph style queries look like? Is anyone currently doing this?

It might be worth mentioning that the next generation of GBIF API will be GraphQL. This is already used for the GBIF Hosted Portals in pilot. E.g. Data - Legume Data Portal
I know you are probably thinking more expressively than this, but it might play a part.


Example GraphQL query

So, search for the first 10 occurrences that have filled recordedById. For each of those, try to expand the identifier to get the persons name, image and birthdate.

Also do a breakdown of the most frequent datasetKeys. And for each dataset, then expand the dataset title. And for each of those datasets, then tell me who is the publisher and their name and which nodes endorsed them.

This is less powerful than SparQL, but incredible convenient when doing UIs and exploring the APIs in general. I only get the data back I need per resource and I can traverse foreignKeys easily and expand those resources (go from datasets => publishers => nodes etc)


The Smithsonian and the U.S. Department of Agriculture led an effort, commissioned by the Interagency Working Group on Scientific Collections (IWGSC), that resulted in this recent Economic Analyses of Federal Scientific Collections: Methods for Documenting Costs and Benefits.

A few of their Findings and Recommendations may be relevant to consider here, in terms of supporting and validating the use of collections in novel applications:
§ The services offered by a collection determine the benefits generated, such as:
− Preserving and maintaining objects extends their useful life
− Providing user access and data curation expand the pool of potential users
− Education and Outreach increases awareness, appreciation, and public support
§ Accessioning and preserving compete with other services for resources
§ Agencies have a choice of several methods for estimating and documenting benefits generated by their collections
─ Approaches to documenting benefits should reflect agency/collection mission
─ Choice of methods should also consider cost and effort, delays, assumptions and preferences of the primary audience of stakeholders
─ How do stakeholders view: Surveys vs. program data? Qualitative vs. quantitative evidence? Retrospective vs. prospective impact?

In your experience, how do these findings align with those of non-federal institutions or with institutions outside the US?
(@Debbie - Looks like you had a role in this as well :slight_smile: )

1 Like