Analyzing/mining specimen data for novel applications

Exactly, and this is what got me thinking is that we need an avenue, not just from the data to the image, but from the image to the data.

1 Like

This also points to a need to better capture annotations to these images to build up the labels associated with particular media files. This could include presence of specific traits, measurements, etc.

2 Likes

:smiling_face_with_three_hearts: reminds me that one of the very first @iDigBio working group workshops was DROID ( Developing Robust Object to Image to Data Workflows).

1 Like

:100: These will be key to capturing expert observations about what’s represented in an image, and will (as you already know @dcblackburn) feed machine learning so that scientists can reap the benefits of the growing pile of images / media relevant to their research needs.

2 Likes

One assumption is that all collections staff are concerned about their data and/or have appropriate technology skills.

These are generally the same people who on any given day are skinning a bird, checking a dermestid colony, performing integrated pest management, hunting down specimens for a loan or a borrower to get specimens back, supervising volunteers, interns and if they are lucky an actual paid assistant. Perhaps they learned Python or R while in school, but perhaps those languages weren’t even in use when they were in school. Yet it does seem that we expect that collections staff can manage data in the ways that often are discussed at the level of aggregators and that they all understand the technology and terms being used by the biodiversity data community.

This feeds into:

How do we avoid losing sight of the other important tasks that collections staff do? Maintenance of the physical specimens (which underpin the data) should be part of the puzzle.

5 Likes

dcblackburn

1m

yes yes and to directly make the link to the annotation thread where “annotations” can include a variety of information Annotating specimens and other data

1 Like

Such a critical infrastructure, leadership, and professional development question! See my reference to this (and related issues) in Annotating specimens and other data - #29 by Debbie

This ties nicely to a thought that I had a while back. In an average collection that is now at least partially digitized, the CM faces curating two collections - the physical specimens and the digital records about those specimens. In most collections, that would now include adding specimens (both physical and digital) and trying to keep both collections updated - and in sync. In some cases, esp. larger collections, one is now facing the challenge of digitizing major portions of the collection while keeping track of any incoming annotations. Incoming staff will likely need more data manipulation skills - I know what I knew how to do a number of years back is now virtually useless.

4 Likes

There is also the opportunity to extend specimens with images from collection events/localities and link these through our digital databases. This another place where data being “born digital” is be critical going forward. In other words, it is much easier for the collector to make the digital connections than to try to have someone else do it after the fact.

3 Likes

I think it’s important to keep in mind here that ‘single corpus’ doesn’t mean having all the images in one place. That’s expensive and has all kinds of governance and jurisdictional implications.

Images can be physically distributed but should exist as a single logical body. A single access point or access points with a common interface allows that operations can be performed across the corpus irrespective of location of individual images. This means putting in place appliances at the local level, along side image stores where operations can be invoked and executed from a remote source.

1 Like

Keeping the images in different places also has cost implications if they’re to be processed together.

I think that indeed the first key step is standardized image access. Optimal allocation of where the storage and where the computing happens is a very difficult question to answer in a generic way. Some use cases will function more optimally with centralization, others can be applied locally.

2 Likes

How do the extended/digital specimen (ES/DS) concepts help to address making specimen data widely usable for novel applications?

If we are trying to build a system of very interconnected data, then I would like to see some mechanism to search across ES/DS data using graph style queries, ie emphasising and properly utilising the links between records as much as attributes on the records themselves.

3 Likes

How to engage new communities / “Low-hanging fruit” for ES/DS system

We have a considerable amount of labelled data and motivated experts, and we could make progress by better packaging these up for use by computer science researchers - who are motivated to work on problems but often suffer from difficulties accessing data and expert users.
This could also help with the discussion point re how to engage end-users in a virtuous cycle of improvement - by directing (human) data enhancement to attributes which could be extracted to build training datasets for later application of machine learning techniques.

2 Likes

I’m intrigued… what would graph style queries look like? Is anyone currently doing this?

I’m intrigued… what would graph style queries look like? Is anyone currently doing this?

It might be worth mentioning that the next generation of GBIF API will be GraphQL. This is already used for the GBIF Hosted Portals in pilot. E.g. Data - Legume Data Portal
I know you are probably thinking more expressively than this, but it might play a part.

5 Likes

Example GraphQL query

So, search for the first 10 occurrences that have filled recordedById. For each of those, try to expand the identifier to get the persons name, image and birthdate.

Also do a breakdown of the most frequent datasetKeys. And for each dataset, then expand the dataset title. And for each of those datasets, then tell me who is the publisher and their name and which nodes endorsed them.

This is less powerful than SparQL, but incredible convenient when doing UIs and exploring the APIs in general. I only get the data back I need per resource and I can traverse foreignKeys easily and expand those resources (go from datasets => publishers => nodes etc)

4 Likes

The Smithsonian and the U.S. Department of Agriculture led an effort, commissioned by the Interagency Working Group on Scientific Collections (IWGSC), that resulted in this recent Economic Analyses of Federal Scientific Collections: Methods for Documenting Costs and Benefits.

A few of their Findings and Recommendations may be relevant to consider here, in terms of supporting and validating the use of collections in novel applications:
§ The services offered by a collection determine the benefits generated, such as:
− Preserving and maintaining objects extends their useful life
− Providing user access and data curation expand the pool of potential users
− Education and Outreach increases awareness, appreciation, and public support
§ Accessioning and preserving compete with other services for resources
§ Agencies have a choice of several methods for estimating and documenting benefits generated by their collections
─ Approaches to documenting benefits should reflect agency/collection mission
─ Choice of methods should also consider cost and effort, delays, assumptions and preferences of the primary audience of stakeholders
─ How do stakeholders view: Surveys vs. program data? Qualitative vs. quantitative evidence? Retrospective vs. prospective impact?

In your experience, how do these findings align with those of non-federal institutions or with institutions outside the US?
(@Debbie - Looks like you had a role in this as well :slight_smile: )

3 Likes

One “fruitful” approach could be to select recent “novel use” publication(s) (e.g. I’m thinking the many growing large scale plant phenological ones, like @libby awesome red maple herbarium study in the US) and brainstorm, perhaps with original study authors, on what would have made the study more efficient or better in some way. I’m thinking phenological scoring tools that are built into herbarium data portals, as one wish that could be enacted ideally in the ES/DS framework. Could also serve as a case study in linking the data back to providers or linking to phenological databases or similar, thereby providing attribution to collections/collectors, data transparency, and enable reuse in other studies.

2 Likes

This is a great idea @jmheberling. relatedly, could think about scaling up studies – taking a study that was only possible on a small scale (due to the manual nature of data gathering/analysis) and see what it would take to ‘globalize’ it, ES/DS style.
The phenology annotations in that maple study only exist in the .csv file from that research and haven’t been ‘round-tripped’ back to any of those collections. :frowning:
Some of the pubs in this compilation (and there are no doubt dozens more – share others that you know of!) could be perfect for this: Biodiversity Crisis Response and Conservation Resources - SPNHC Wiki

5 Likes

The “return” to collections is interesting and important (and probably not in this topic, but more for the Annotation topic thread), but it is an extra step for researchers that may be difficult to sell the importance. More the to point, though, is that there is no mechanism to even do it right now anyway!! Hence this whole concept I guess :slight_smile: It is unlikely that every collection has the capacity to start including fine scale phenological data but I think a more realistic approach would be external databases that link to specimen record.

But to bring my blathering to a close…I think retroactive linking on recent publications like yours could be a good pilot approach to reveal what is really needed for specific use cases from both study author researcher, other researchers, and collection managers perspectives.

1 Like