Annotating specimens and other data

The Atlas of Living Australia (ALA) has had an annotation system from the beginning. Other Living Atlases might have it as well. I would like it to do much more, but what it does, which is what most people want it to do, it does very well. It is used quite a lot. I personally do not really do anything with the annotations, but I see them coming by (as I subscribe to alerts on annotations) and I can see what people at the source collections do with them.

I think @dhobern’s description of a possible solution is very inspiring, but my heart sinks when I think about how much time, money and effort it would take. Then the most pertinent of @JoeMiller’s questions would then be “What is the use case that would convince a funding source?”. To that I would guess it depends highly on the funding source! Maybe something really broad and vague like “Many data records will be corrected and enriched each month. Better data quality, better science, better decisions.”

I actually think it’s a bit risky for us to try and come up with more specific use cases for annotations. We’re such a specialised group who are already invested heavily in this community, I worry that we can’t see the wood for the trees. I think a data annotation service would be mainly used by data users/researchers/policy makers and collection managers, not by us.

In agile development one tries to build the smallest, simplest possible thing as a first iteration, see how it gets used, and then take the next simple step to build and iterate on it further with the product usage in mind. Saves a lot of time and effort, as it’s often difficult to predict how people will use tech in the real world and how their needs will evolve.

E.g. Iteration 1 for us might be to encourage unstructured annotations on a dataset level, rather than on a record or field level: meaning, a simple dataset feedback form, functionality, storage and public display. This would actually allow users to also provide feedback on record and field granularity levels, as they could describe in free text (copying and pasting occurrenceIDs/collection numbers, etc) what addition or change they think is necessary.

As @kcopas says, it should actually be possible to do this already with something like Hypothes.is. There’d need to be some kind of “Flag an issue or provide feedback using Hypothes.is” prompt on a dataset page, and there could be some functionality to use the Hypothes.is API https://twitter.com/lemayke/status/1075821724539777026 to provide the data owner with periodic notifications.

Data curators could then choose whether to incorporate and/or respond to this information. It’d also be possible for data users to see the thread of feedback and make their own decisions based on it. It’s one simple step up from what we currently have (providing contact details so data corrections can be emailed to curators) as it makes the feedback public.

If this functionality gets used a lot and users are e.g. often providing feedback which is more like structured data on a field or record level, then we could invest the development time in building additional forms and functionality to do this properly. If data curators receive hundreds of pieces of feedback from people each month and are struggling to incorporate suggested changes, then maybe some kind of export functionality to facilitate round tripping would make sense. Or maybe we need to shift more towards a community data curation model. Maybe most feedback will end up being georeferencing corrections, in which case maybe a specialised georeferencing annotation tool will be most useful. But my point is that right now I feel like we don’t really have a good enough grip on what we will really need to make a call. Or maybe the LA users do, and they can advise us.

@NielsKlazenga maybe you can also give us an idea of the volume of annotations made among high use vs low use datasets? As @MatDillen says, it would be nice to understand how much this functionality gets used.

I’ve also been wondering whether AI and humans will have the same use cases and requirements for annotations. When it comes to machine annotations, I feel like we kind of have the start of this already with the GBIF data interpretation and the yellow data issue flags. So maybe it would actually make sense to build on this functionality?

3 Likes

There are 4,557 records with annotations in the Australia ALA. While this may seem relatively low, we should be mindful that there aren’t bulk annotation tools (e.g. mark all these records on a map as suspicious in location or determination).

Great summary, @rukaya.

On the above point, that’s exactly the thought that @dshorthouse triggered for me when he shared a sneak preview of efforts by @jmacklin et al. to develop a dynamic UI for collectors.

Is there any reason to think that we couldn’t take this approach to link and display the annotation layer(s) alongside the verbatim fields? I have the col. and det. fields in the top of mind, not least because we may be able to use existing PID systems to disambiguate, but we will discover there are reasons to link other fields…?

1 Like

This is a great use case and there are so many researchers doing this NOW the need is actually quite urgent to make annotations efficient rather than retroactively piecing together later. Part of the issue is that herbarium-derived phenology work is big, not limited to herbaria, but one small slice of the broader phenological research landscape. Specimen annotations are indeed currently in a global phenological database Global Plant Phenological Data Portal I don’t know which herbarium datasets are included here, presumably from particular paper.

Having not used the database (or contributed to), I don’t know much about – but I don’t think the phenological scores are commonly (if ever) linked back to the specimen records in any way, but they should be linkable.

This is a great use case @jmacklin suggests- the data are complex but with well developed ontology and measurement protocols – though it seems not well decided if herbarium databases should hold this information (using Darwin Core extension fields, seems clunky) or rather link to initiatives like the Global Phenology Data Portal.

As it is now, it is likely the same specimens have been scored for phenology multiple times, with derived data not readily known by the next researcher.

Same goes for other traits – for plants - is TRY a launch point for plant specimen trait annotations? To date, a very small fraction of the traits are herbarium-derived.

Another thought in regards to specimen-derived phenotypic information — should this system be connected to the Open Traits Network or similar initiatives?

This also makes the point that annotations are going to be so different across trait types, let alone taxa!

iNaturalist is remarkably similar to taxonomic annotations that have been happening on physical specimens in museums for centuries. I’m sure many are aware, but for those who don’t – users can add different taxonomic determinations to uploaded photos. There are a set of criteria that then determine if it is “research grade” – at category which also has some disagreement – but it are these observations that get shared to GBIF.

Anyway – keeping all annotations in digital record is a good idea, users can make the judgement and publishers could also declare their “truth” as recommendation to users?

(sorry if this has been covered in comments above…so much to digest!)

1 Like

Scientists and curators want to annotate specimens with the latest opinions and determinations, and they want to see what has been annotated in the past and their status.

I’m interested in the initiator of the annotation: originally a scientist, though the results of their work may be accessed indirectly, through the published literature record. We will have a growing body of annotations that can be sourced from text-mining the published literature - these were scientist-initiated but may be introduced into the annotation system post-publication (perhaps many years post-publication).

Examples:

  • Digital nomenclators can supply annotations regarding the citation of specimens as types

  • Text mining taxonomic treatments can supply “treated in” annotations (to a published concept) and “grouped with” annotations (to other specimens). There is also a link that can be made between the specimens examined and the traits specified in the taxonomic description.

  • Specimens cited in taxonomic treatments may also include indications of phenological state.

In extending the web of data beyond the institutional boundary the ES/DS system would allow the formation of groups of specimens cited in the same published concept, and the comparison of these with later (overlapping) treatments.

As published facts, should these kinds of annotations always be accessible from the specimen, regardless of whether they are accepted by local curators as worthy of altering the local view of the specimen metadata?

Thanks @rukaya - I don’t see what I wrote as a big challenge or needing to take much time - it is really a set of small evolutionary steps against what we already have.

I think that, if enough of us agree it’s what we want and need, most of the core functionality could be working within a year.

GBIF already has the necessary dataset management, specimen data index, citation tracking, …

Some of the collections that publish specimens will need assistance with understanding how to keep their specimen ids stable (basically following best practice for database versioning and record identifiers).

After that, I’d like to see a page that allowed anyone to annotate any record in any of three basic ways: 1) free text comments (as a discussion thread), 2) direct editing of the record fields in a form to propose corrections, 3) providing links to associated information and records.

Then we can work together on designing the social models and logical flows for community decisions on “research grade” views where an alternative view has been proposed, and we can start exploring how automated checking and machine learning could propose annotations that the user communities can either accept by default or review and then confirm or deny.

2 Likes

@jmheberling - thanks for your comments!

One thing that brought up is an interesting point that we all need to consider - the breadth of the possible reach of our data. Before reading your post, I had not heard of either the Global Plant Phenological Data Portal or TRY - not surprising since I don’t work in either area. But, we do need to be aware of how our data might windup on those portals. Another one that I can see likely having at least some of our data is the GlobalBioticInteractions (GloBI) …

2 Likes

I’m sure I’ve responded to more than 8 annotations for our collection on ALA but that is all that is currently listed with assertions. Maybe they all just came in around the same time and felt like more. Most of the annotations I received identified errors that we were then been able to correct in our CMS.
@NielsKlazenga when we reindex our data on ALA do the annotations stay?

That would be fantastic @JoeMiller! And not just because you’re talking Acacia :blush:

We’ve often discussed how it would be great to be able to link the specimen measurement data locked away in appendices of PhD/MSc theses to our specimens. We don’t currently have our CMS set up to handle this, although a table exists in the schema, or the resources to find/input the data. But if we could easily link to the data like we do to sequence data in GenBank, then others could find it, reuse it for similar/different project and accurately acknowledge the data generator and specimen/collection.

1 Like

Our portal makes use of the Annosys system. Over the three years of the portal’s existence, we’ve had 91 annotations made this way on our herbarium specimens. This number does not pose problems for our curators to address them one by one, but as indicated by Arctos users in this discussion, at some level it would. If we could get similar numbers from as many places as possible, we would have better insight into how annotations can and cannot be processed realistically.

I would also like to widen the scope a bit and talk about crowdsourcing. The results from crowdsourcing tasks such as text transcription or species identification can be considered annotations in their own way. Even if the data model of these crowdsourced annotations interoperates nicely with the data model of our CMS (and it often doesn’t), these annotations still require some sort of validation step to be integrated into the ‘authoritative’ copy of the specimen data. This validation can take many forms:

  • No validation. Often with the idea that any data is better than none.
  • Trust of proven volunteers/experts, possibly after an initial ‘trial’ period which was validated manually.
  • Community consensus.
  • Consensus algorithms.
  • Manual validation of each contribution or a representative subset.

These different kinds of validations pose nice parallels to how we could deal with more generic annotations, not bound to the specimens we offer specifically to be annotated in certain, restricted ways on crowdsourcing platforms. Additionally, if we manage to classify annotations in different categories, which are processed (validated) in different ways, we may find some easy-to-implement low-hanging fruit. Annotations can be much more than just a free text e-mail that needs to be read by human eyes.

2 Likes

However, some data about a specimen is useful to researchers and users of the specimen data and less so for the collection curator. And often the CMS does not have anywhere to store such data types. These data need somewhere else to “live” than “with the [“authoritative” data] provider”…

2 Likes

While the data may not need to reside in the CMS, the CMS should “know” where it is. Without that link there is a loss of meaning for anyone who only has one piece of the puzzle. But you also bring up an important point about data life. If we want to really make use of connected and “big” data, we need secure places for it to be stored. Not only secure, but accessible and connected places.

1 Like

I think it is quite clear that the annotations will need to live in their own “store”. This store could be global or regional. The store itself would need to be associated with tools/interfaces to both add annotations to the store and to extract knowledge from the store in relevant formats. The late Bob Morris led our community in generating an extension of the W3C Annotation Ontology standard that specifically addresses data annotation (see https://doi.org/10.1371/journal.pone.0076093), so we have a good foundation to build on. Annosys, referred to above and Symbiota both use this standard in their implementations. In the standard, the concept of evidence for making an annotation is key as this is the only information a data curator will have to judge whether to accept or reject it. However, if we bring in the concept of trust, then it could be that a particular annotator (agent) may also be judged based on a history of producing high quality annotations (or not). As has been pointed out in several messages, if annotations become mainstream then we will quickly have scaling problems that influence both the hardware/software implementations but importantly here, the people who will be deciding whether an annotation is accepted into their CMS or other databases/systems. So clearly, we will need to consider when annotations need to be brought in to “authoritative” databases/systems and this is likely to be a highly variable answer as it is resource dependent at source. Lots to consider :wink:

3 Likes

If we expect annotation to provide better data consistency and quality across institutions I fail to see how we can avoid some kind of enforcement. However, I also believe that full transparency of annotations and indicators of quality can provide a first level approach.

I keep thinking about @JoeMiller 's dream, one I share too (though not for Acacia per se :wink:): Annotating specimens and other data - #43 by JoeMiller re: Annotating herbarium specimens with derived trait data to enable many questions across space, time, and phylogeny, plus within species. As I functional trait ecologist who stumbled into an herbarium, this seems to be a (mostly) untapped area with huge promise.

As of right now, if you extract trait data from a specimen, where does it go? What are the options? I think mostly, the data are tucked away in supplemental info only to be referenced for data transparency or specific re-analyses, perhaps with or without links to specimens to which they were derived. I asked this question on Twitter a while back and got some great responses https://twitter.com/jmheberling/status/1336398032787697666?s=20

DwC Measurement or Fact Extension or similar seems to be one route. It is ideal in that it associates the data derived from the specimens directly with the other specimen data/metadata. The problem however is reuse and a huge nightmare on standardization of many different trait type. Last I heard, the California Phenology Network a US NSF-funded digitization network that is actively digitizing specimens and scoring a subset of species for phenological status (also mentioned in earlier thread of @jmacklin Annotating specimens and other data - #44 by jmacklin), is storing these annotations as JSON file in this field (I think). That is awesome and very useful when looking at a particular specimen record. BUT, difficult for integrative projects where you would want these data easily leveraged across many species, including specimens and observational data.

Other options that seem particularly popular in paleontology (?) (in my limited perspective!) is MorphoBank – from what I can see it can serve as a DRYAD of sorts for complex phenotypic datasets, gives a doi to link dataset(s) back to specimen(s). Archiving is great, but again, I think difficult to enable large scale or integrative reuse down the road.

For plants alone, TRY includes >2,000 traits…each with different units, protocols of measurement, etc etc…how many leaves (+which leaves?) were measured, etc. etc… trait annotations get messy very quick. I struggle to even envision the ideal.

These annotations will be used well outside the collections world so connections to (likely taxon-specific &/or trait-type-specific) repositories are a must and therefore would need collaboration with existing communities (many of which are not natural history collections related whatsoever)

I’m sure there are many great minds working on this but hopefully I don’t sound too uninformed here.

SHARED persistent object/specimen identifiers would be the “glue” linking information stored in different places - and aggregators (such as GBIF) might offer the place to discover additional information (annotations) provided by others.

What if we started with the persistent object identity for the specimens and other things we care about.

The first step before publishing ANY data would be to pre-register a persistent identifier for the specimens (which later on cannot be changed) [which is not the occurrenceID, see below]. And following this train of thought, the source information from the collection about their own specimens would thus in practice just be another (yet highlighted) annotation.

Many Occurrence records are not about specimens. And many Occurrence records about specimens might still continue to be published in GBIF before getting the appropriate persistent “post”-registered specimen identifier. However, eg. DiSSCo could require all specimens to have these persistent “pre/post-registered” identifiers BEFORE counted as part of the “European Collection”.

Another important aspect I think is often overlooked is that the occurrence data records are a heavily denormalized view with information combined together for many different things that are identified by their own persistent identifier. An “Occurrence” includes and describes things identified by materialSampleID, organismID, locationID, taxonID, recordedByID, identifiedbyID, datasetID, collectionID, institutionID, …! All these things have their own identifier and can and should be described in many other places than (only) “inside” the denormalized Darwin Core Occurrence “envelopes”.

Apropos, the occurrenceID is the identifier for the entire denormalized Darwin Core Occurrence view, and the materialSampleID would be the actual specimen identifier. The occurrenceID could thus rather be viewed as an annotation identifier.

An aggregator such as GBIF could even follow (resolve, dereference) some of these identifiers to fetch information to enrich the “occurrence data view” to fill in “missing” information, and to validate suspicious pieces of information (eg. outliers) published inside the “Occurrence” view.

The source collection CMS view of the specimen data would always remain and would remain as relevant and valuable as ever, even if an aggregator such as GBIF would “listen” to other sources of information about the things “inside” the data records published from the source.

1 Like

Pilot data annotation system at GBIF-Norway (2016)

I was meaning (but got swamped this past week) to post some slides presenting a previous data “annotation” system for the Norwegian museum collections a few years back. Which unfortunately went dormant during staff roll-over at the GBIF-node, but exist at GitHub and might perhaps (?) come back online again sometime (in a new form).

The pilot annotation system is presented in screenshots on slide number 26 to 33 in this slide-set