Annotating specimens and other data

Moderators: Joe Miller, Gill Nelson, James Macklin, and Rich Rabeler

Summaries - 3. Annotating specimens and other data

Background

Annotations are a way to convey information about a resource or associations between resources (see PLOS paper). Common uses of annotations are to bring the scientific names of specimens up-to-date to conform with current classification and nomenclatural concepts, to dispute/correct the identification of a specimen and/or to make comments on and correct locality, georeference or other specimen information. Scientists and curators want to annotate specimens with the latest opinions and determinations, and they want to see what has been annotated in the past and their status. Collection managers want to be able to review and optionally accept such annotations back into their systems as updates to information already there.

Classic annotation including taxonomic identifications, phenotypic observation, type status, etc., involves associating slips of paper with the physical specimen and is the key documentation path (provenance) for knowledge to move through generations of researchers. However with the advent of high resolution images, this has quickly become less practical and non-scalable.The problem is that the invaluable information in these annotations is often not digitized nor available for researchers to discover without visiting or taking loans from a collection.

Annotation is reliant on round tripping data which we currently cannot do at scale. We are currently able to easily publish data to aggregators thereby making the data publicly accessible to the larger community. When a user works with data, they often have updates and/or suggestions but there is no simple way to send this information back to the collections, especially at scale. Currently many suggested annotations are sent to aggregators (iDigBio, ALA, GBIF) but the aggregators do not have access to the data from the provider (they have a cached copy) in order to make changes nor do they have a mechanism to easily share the annotations with the provider. This is particularly a problem in botany where duplicates of the same collection may be deposited in many institutions - how does one send the same information to many institutions?

For the purpose of this consultation we will address annotation in two phases. First, we would like to develop some strong use cases for a digital annotation system. Secondly, we would like to hear about implementation mechanisms that range from centralized to distributed.

This thread differs from the Extending, enriching and integrating data category because it is not about linking new data types, but rather it focuses on enhancing the data that exists through correcting errors, adding knowledge through refining/enhancing data, and adding data that was not previously recorded/known. Therefore, the addition/integration of data types such as traits, DNA sequences and phylogenies will be considered in the Extending, enriching and integrating data category but this category will focus on annotation of information after links are made and data is available…

Annotation is the implementation of either a desire to fill in information that doesn’t exist or to correct or enhance existing information.

Extension is making available the fields of communication of the needs/wants that will develop —places to eventually be annotated.


Information resources

Questions to promote discussion

Use cases - first phase

  • What is the value of a digital annotation?
  • “Round-tripping” challenges: Is it necessary, worthwhile, possible to “round-trip” data back to the owner/provider?
  • Social challenges: Exposing annotation histories (dirty laundry; privacy concerns); annotation “wars” (disagreement over a subject).
  • Text vs objects: Should we annotate images and other media as well?

Implementation – second phase

  • Is a global annotation store necessary or are there other models such as regional or local?
  • Technical challenges of implementation (i.e. distributed vs. centralized); challenge of “pushing” data back to the provider in a form that they can easily assess and digest it (CMSs not built to do this).
  • What standards and provenance are required to implement an annotation network?
  • Scaling issues, what is most important to track: determinations, georeferences…?

As taxonomies change with new hypothesis for various taxonomic groups, a.) how do you propose consistently applying those taxonomies across multiple institutions’ data? b.) how do you enforce consistency? An example: there are still some entomological collections that have the old orders Isoptera and Homoptera. This may be due to staff not having time to upgrade the collection, etc. How do you “enforce” that their data is annotated when it’s not internally being done? Will we have an annotation “police”?

2 Likes

I believe it is, otherwise collections may continue to publish incorrect or outdated information.

Arctos has a built-in annotation system and could accept annotations from other sources but this would require either integration (with GBIF, iDigBio, and others). This would be very useful for Arctos, but would not solve the problem for the broader community. However, we would be happy to discuss how our system works, its shortcomings and benefits.

For an example of an annotated record see https://arctos.database.museum/guid/UAM:Fish:1704

2 Likes

The community is already annotating images and media files, and that should be integrated into the vision of “digital annotation” for extended specimens.

4 Likes

If we are building and implementing annotation systems, then yes, they need to support comment / observation / assertion (human or machine) about any piece of data (media or text). +1 @dcblackburn

Agree. Also many “images” are actually text-based documents like field notes and preparation catalogs that provide information about a specimen that may or may not be included in the catalog record. Tagging could open up this information.

See https://arctos.database.museum/document/1969-ah-harris-catalog for an example of a field catalog tagged to specimens as well as collection events.

One issue I see is which media gets annotated? The media in the example above can easily be reused, but the tagging probably won’t get propagated to the reuse and the original will probably never be associated with the reuse unless there is some sort of proper citation, so there are definitely both technical and social barriers to overcome.

1 Like

I don’t see annotations as enforceable, however, those collections may find their data changed upon ingestion at any aggregator. I view an annotation as a report that data may be incorrect or out of date, signaling the need for review by the source provider.

However, as you point out, a lack of resources often means that the annotations go un-reviewed or uninvestigated. This is a community-wide resource issue that needs addressing on a global scale.

1 Like

@jegelewicz Do you have any Arctos material showing how the annotations work? Maybe a short screencast on how the Arctos user gets the annotation and decides what to do with it. That would be helpful to me. I did a video showing GBIF ingestion, it is on the landing page or here GBIF and the Converging Digital and Extended Specimens Concepts on Vimeo

Thanks for the post.

Excellent question! Fears of annotations pouring into the inbox of a collection manager has always been in the back of the minds of some CM’s - I actually have heard that used as a reason not to post one’s data! Something to at least ponder going forward.

3 Likes

This is an intriguing part of the DiSSCo model. The annotations will occur at the European level in DiSSCo, if a collection wants those annotations they need to get it back from DiSSCo. The authoritative copy moves from collection to a regional infrastructure. How do collections feel about that? Should it be done globally?

1 Like

Well, that is one thing we haven’t documented very well - and now I have a new to-do.

Here is a brief summary:
Someone creates an annotation in Arctos using the “Comment or report bad data” button that appears at the top of every catalog record. Once a month, annotations are sent via email to the collection’s data quality contact but they may also be accessed at any time by those with appropriate access under the Reports tab.

Our challenge has consistently been that collections personnel do not have time to address the annotations - many of which are generated by scripts within Arctos.

1 Like

Welcome @kmenard2! A few issues to unpack here.

First, we need to separate the difference between a local collection “finding” something in their own holdings using taxonomic names to do so – with the “finding” of something in an aggregated pile of digital objects coming from many places, all of whom may use different taxonomic views (not necessarily outdated, but just different).

Second, what I’m hoping we see are CMS that let you, the collection manager, link to the taxonomy of your choice (e.g. choose an identifier for a taxon name from the Catalog of Life, and use that instead of having to maintain your own taxonomic tables). But even if you do want to maintain your own tables locally …

Third, the aggregator point-of-view needs to support multiple taxonomic viewpoints / opinions and the ability to find digital objects by what ever name was applied. To me, it’s not about consistency, it’s about understanding that “finding” is somewhat different in the local, compared to the global aggregated access point. (Side note too – personal opinion – I think it’s vital, in any global system, to support finding objects by many possible “names” and show the differences of opinion / understanding – so that the public sees a more realistic presentation of how science happens).

Fourth, I’m using the broad concept of annotation (as in machine or human making an assertion, on anything – not just taxonomic identifications) and I note that you raise a salient point about “expectations.”

  • I don’t believe anyone is suggesting you “must” do anything or accept any annotations (no police).
  • Yes, staffing (and automated systems) to address evaluating any annotations is a known issue. In fact, it’s quite easy to generate lots of machine annotations (e.g. assertions that suggest lat/lon values are flipped, or that coordinates don’t fall in country provided, or that something is misspelled, or that the collector named wasn’t alive when the specimen was collected). So easy, in fact, that collection managers already report being overwhelmed and understaffed for addressing these (See SPNHC 2019 talk by G. Tocci).
  • Add to this that humans can also make assertions – (e.g. “unusual leaf margins for this species”, or “this might be a range extension”, or “draw a polygon around part of an image to show an anatomical feature you’ve never seen before”, in addition to “I think this is species Y not X”). Note these are somewhat different from the machine annotations.
  • Keep in mind these annotations above, by humans, also contribute to future machine learning.

Annotations will add value, but also human-burden of evaluating them and deciding which (if any) to accept. Also, some CMS will support ingesting these assertions, others will not.

Then the question becomes, where is the definitive digital specimen object record? And, how to you discover if that record has associated data in / linked to it that you don’t have in your local CMS – but that might be useful to you / or to downstream users. In other words, where will we need to go to discover all information about a given specimen?

Whew! Congrats if you made it through all that – as you can see – I’ve a bit of passion for annotations. :blush:

5 Likes

I would still say that the authoritative copy resides with the provider and until they review and accept or reject (with explanation) an annotation, it is just a suggestion.

1 Like

The dream is very close for some, but there always remains the issue of differing methods for how classifications and name metadata are stored and used in any given system. Translating from one to another and keeping up with constant changes is a technical challenge.

1 Like

YES but this is also a tall order. and needs a solution that everyone could use, or there will always be missed “2nd instars” that are labeled as “protonymph” or “second instar”. See current discussion at Code Table Request - new age class categories for ectoparasites · Issue #3432 · ArctosDB/arctos · GitHub as we try to make sure this doesn’t happen in just a small sample of collections.

1 Like

Hi Teresa,

I"m guessing many will share your take on the “authoritative copy” but I do wonder. For example, in a recent (just finishing up) grant, we have enhanced already published specimen records for Rhinolophid and Hipposiderid bats. We’ll be sharing these enhanced data with each collection. It’s already very likely that some will not be able to do anything with these data (e.g. limits of their CMS, or staffing, or permissions, or …). We’ve georeferenced, updated the taxonomy, likely extended some known ranges for some species, and gotten over 500 people into Bionomia so that these folks all have QID or ORCID. This information will be in Zenodo.

We’d love to see it be ingested into local CMS and the datasets republished. But alas, that may not happen in many cases. So it begs the question … what to do? How to ensure this effort benefits local to global :question: and doesn’t get lost.

@jegelewicz - Thanks for the description. That sounds similar to what is available in a Symbiota portal One can leave a comment on a record and these can be accessed by CMs with appropriate permissions. The “piece” that you have that is not automatic in a Symbiota portal (as far as I know…) is having a portal send those annotations to a CM; if one checks for comments, one sees them, if one doesn’t they might sit for quite a while. Direct contact is still best.

1 Like

Advocate for more resources at the source.
Provide assistance with ingestion of annotations.
Annotate at the source.
Instead of “ingesting” source data, talk to it.

Each of these has challenges, but I don’t think that leaving the primary source data messy or out of date makes sense. People will still be getting information from it and having multiple versions of the “same” information floating around in the world is only going to cause confusion. Never mind that corrections at one aggregator may not be the same as corrections at another.

1 Like

True, but the majority of the notification emails are ignored because there just isn’t time. See Annotation are not being used · Issue #1679 · ArctosDB/arctos · GitHub

Hi Deb:

Interesting situation that takes the concept one step further than most CMs might consider. You are right - the ideal would be to get the data back to the providers but are the providers even acknowledged in the publication? I see the attribution problem rearing its ugly head as well…

2 Likes