Annotating specimens and other data

trobertson · February 22, 2021, 12:41pm

There are 4,557 records with annotations in the Australia ALA. While this may seem relatively low, we should be mindful that there aren’t bulk annotation tools (e.g. mark all these records on a map as suspicious in location or determination).

kcopas · February 22, 2021, 2:52pm

Great summary, @rukaya.

On the above point, that’s exactly the thought that @dshorthouse triggered for me when he shared a sneak preview of efforts by @jmacklin et al. to develop a dynamic UI for collectors.

Is there any reason to think that we couldn’t take this approach to link and display the annotation layer(s) alongside the verbatim fields? I have the col. and det. fields in the top of mind, not least because we may be able to use existing PID systems to disambiguate, but we will discover there are reasons to link other fields…?

jmheberling · February 22, 2021, 4:56pm

This is a great use case and there are so many researchers doing this NOW the need is actually quite urgent to make annotations efficient rather than retroactively piecing together later. Part of the issue is that herbarium-derived phenology work is big, not limited to herbaria, but one small slice of the broader phenological research landscape. Specimen annotations are indeed currently in a global phenological database Global Plant Phenological Data Portal I don’t know which herbarium datasets are included here, presumably from particular paper.

Having not used the database (or contributed to), I don’t know much about – but I don’t think the phenological scores are commonly (if ever) linked back to the specimen records in any way, but they should be linkable.

This is a great use case @jmacklin suggests- the data are complex but with well developed ontology and measurement protocols – though it seems not well decided if herbarium databases should hold this information (using Darwin Core extension fields, seems clunky) or rather link to initiatives like the Global Phenology Data Portal.

As it is now, it is likely the same specimens have been scored for phenology multiple times, with derived data not readily known by the next researcher.

Same goes for other traits – for plants - is TRY a launch point for plant specimen trait annotations? To date, a very small fraction of the traits are herbarium-derived.

Another thought in regards to specimen-derived phenotypic information — should this system be connected to the Open Traits Network or similar initiatives?

This also makes the point that annotations are going to be so different across trait types, let alone taxa!

jmheberling · February 22, 2021, 5:15pm

iNaturalist is remarkably similar to taxonomic annotations that have been happening on physical specimens in museums for centuries. I’m sure many are aware, but for those who don’t – users can add different taxonomic determinations to uploaded photos. There are a set of criteria that then determine if it is “research grade” – at category which also has some disagreement – but it are these observations that get shared to GBIF.

Anyway – keeping all annotations in digital record is a good idea, users can make the judgement and publishers could also declare their “truth” as recommendation to users?

(sorry if this has been covered in comments above…so much to digest!)

nickyn · February 22, 2021, 8:56pm

Scientists and curators want to annotate specimens with the latest opinions and determinations, and they want to see what has been annotated in the past and their status.

I’m interested in the initiator of the annotation: originally a scientist, though the results of their work may be accessed indirectly, through the published literature record. We will have a growing body of annotations that can be sourced from text-mining the published literature - these were scientist-initiated but may be introduced into the annotation system post-publication (perhaps many years post-publication).

Examples:

Digital nomenclators can supply annotations regarding the citation of specimens as types
Text mining taxonomic treatments can supply “treated in” annotations (to a published concept) and “grouped with” annotations (to other specimens). There is also a link that can be made between the specimens examined and the traits specified in the taxonomic description.
Specimens cited in taxonomic treatments may also include indications of phenological state.

In extending the web of data beyond the institutional boundary the ES/DS system would allow the formation of groups of specimens cited in the same published concept, and the comparison of these with later (overlapping) treatments.

As published facts, should these kinds of annotations always be accessible from the specimen, regardless of whether they are accepted by local curators as worthy of altering the local view of the specimen metadata?

dhobern · February 22, 2021, 10:44pm

Thanks @rukaya - I don’t see what I wrote as a big challenge or needing to take much time - it is really a set of small evolutionary steps against what we already have.

I think that, if enough of us agree it’s what we want and need, most of the core functionality could be working within a year.

GBIF already has the necessary dataset management, specimen data index, citation tracking, …

Some of the collections that publish specimens will need assistance with understanding how to keep their specimen ids stable (basically following best practice for database versioning and record identifiers).

After that, I’d like to see a page that allowed anyone to annotate any record in any of three basic ways: 1) free text comments (as a discussion thread), 2) direct editing of the record fields in a form to propose corrections, 3) providing links to associated information and records.

Then we can work together on designing the social models and logical flows for community decisions on “research grade” views where an alternative view has been proposed, and we can start exploring how automated checking and machine learning could propose annotations that the user communities can either accept by default or review and then confirm or deny.

Rich87 · February 22, 2021, 11:04pm

@jmheberling - thanks for your comments!

One thing that brought up is an interesting point that we all need to consider - the breadth of the possible reach of our data. Before reading your post, I had not heard of either the Global Plant Phenological Data Portal or TRY - not surprising since I don’t work in either area. But, we do need to be aware of how our data might windup on those portals. Another one that I can see likely having at least some of our data is the GlobalBioticInteractions (GloBI) …

GillBrown · February 23, 2021, 2:46am

I’m sure I’ve responded to more than 8 annotations for our collection on ALA but that is all that is currently listed with assertions. Maybe they all just came in around the same time and felt like more. Most of the annotations I received identified errors that we were then been able to correct in our CMS.
@NielsKlazenga when we reindex our data on ALA do the annotations stay?

GillBrown · February 23, 2021, 3:04am

That would be fantastic @JoeMiller! And not just because you’re talking Acacia

We’ve often discussed how it would be great to be able to link the specimen measurement data locked away in appendices of PhD/MSc theses to our specimens. We don’t currently have our CMS set up to handle this, although a table exists in the schema, or the resources to find/input the data. But if we could easily link to the data like we do to sequence data in GenBank, then others could find it, reuse it for similar/different project and accurately acknowledge the data generator and specimen/collection.

MatDillen · February 23, 2021, 10:06am

Our portal makes use of the Annosys system. Over the three years of the portal’s existence, we’ve had 91 annotations made this way on our herbarium specimens. This number does not pose problems for our curators to address them one by one, but as indicated by Arctos users in this discussion, at some level it would. If we could get similar numbers from as many places as possible, we would have better insight into how annotations can and cannot be processed realistically.

I would also like to widen the scope a bit and talk about crowdsourcing. The results from crowdsourcing tasks such as text transcription or species identification can be considered annotations in their own way. Even if the data model of these crowdsourced annotations interoperates nicely with the data model of our CMS (and it often doesn’t), these annotations still require some sort of validation step to be integrated into the ‘authoritative’ copy of the specimen data. This validation can take many forms:

No validation. Often with the idea that any data is better than none.
Trust of proven volunteers/experts, possibly after an initial ‘trial’ period which was validated manually.
Community consensus.
Consensus algorithms.
Manual validation of each contribution or a representative subset.

These different kinds of validations pose nice parallels to how we could deal with more generic annotations, not bound to the specimens we offer specifically to be annotated in certain, restricted ways on crowdsourcing platforms. Additionally, if we manage to classify annotations in different categories, which are processed (validated) in different ways, we may find some easy-to-implement low-hanging fruit. Annotations can be much more than just a free text e-mail that needs to be read by human eyes.

DagEndresen · February 23, 2021, 11:15am

However, some data about a specimen is useful to researchers and users of the specimen data and less so for the collection curator. And often the CMS does not have anywhere to store such data types. These data need somewhere else to “live” than “with the [“authoritative” data] provider”…

jegelewicz · February 23, 2021, 2:55pm

While the data may not need to reside in the CMS, the CMS should “know” where it is. Without that link there is a loss of meaning for anyone who only has one piece of the puzzle. But you also bring up an important point about data life. If we want to really make use of connected and “big” data, we need secure places for it to be stored. Not only secure, but accessible and connected places.

jmacklin · February 24, 2021, 1:21pm

I think it is quite clear that the annotations will need to live in their own “store”. This store could be global or regional. The store itself would need to be associated with tools/interfaces to both add annotations to the store and to extract knowledge from the store in relevant formats. The late Bob Morris led our community in generating an extension of the W3C Annotation Ontology standard that specifically addresses data annotation (see https://doi.org/10.1371/journal.pone.0076093), so we have a good foundation to build on. Annosys, referred to above and Symbiota both use this standard in their implementations. In the standard, the concept of evidence for making an annotation is key as this is the only information a data curator will have to judge whether to accept or reject it. However, if we bring in the concept of trust, then it could be that a particular annotator (agent) may also be judged based on a history of producing high quality annotations (or not). As has been pointed out in several messages, if annotations become mainstream then we will quickly have scaling problems that influence both the hardware/software implementations but importantly here, the people who will be deciding whether an annotation is accepted into their CMS or other databases/systems. So clearly, we will need to consider when annotations need to be brought in to “authoritative” databases/systems and this is likely to be a highly variable answer as it is resource dependent at source. Lots to consider

Steen · February 24, 2021, 3:01pm

If we expect annotation to provide better data consistency and quality across institutions I fail to see how we can avoid some kind of enforcement. However, I also believe that full transparency of annotations and indicators of quality can provide a first level approach.

jmheberling · February 24, 2021, 3:10pm

I keep thinking about @JoeMiller 's dream, one I share too (though not for Acacia per se ): Annotating specimens and other data - #43 by JoeMiller re: Annotating herbarium specimens with derived trait data to enable many questions across space, time, and phylogeny, plus within species. As I functional trait ecologist who stumbled into an herbarium, this seems to be a (mostly) untapped area with huge promise.

As of right now, if you extract trait data from a specimen, where does it go? What are the options? I think mostly, the data are tucked away in supplemental info only to be referenced for data transparency or specific re-analyses, perhaps with or without links to specimens to which they were derived. I asked this question on Twitter a while back and got some great responses https://twitter.com/jmheberling/status/1336398032787697666?s=20

DwC Measurement or Fact Extension or similar seems to be one route. It is ideal in that it associates the data derived from the specimens directly with the other specimen data/metadata. The problem however is reuse and a huge nightmare on standardization of many different trait type. Last I heard, the California Phenology Network a US NSF-funded digitization network that is actively digitizing specimens and scoring a subset of species for phenological status (also mentioned in earlier thread of @jmacklin Annotating specimens and other data - #44 by jmacklin), is storing these annotations as JSON file in this field (I think). That is awesome and very useful when looking at a particular specimen record. BUT, difficult for integrative projects where you would want these data easily leveraged across many species, including specimens and observational data.

Other options that seem particularly popular in paleontology (?) (in my limited perspective!) is MorphoBank – from what I can see it can serve as a DRYAD of sorts for complex phenotypic datasets, gives a doi to link dataset(s) back to specimen(s). Archiving is great, but again, I think difficult to enable large scale or integrative reuse down the road.

For plants alone, TRY includes >2,000 traits…each with different units, protocols of measurement, etc etc…how many leaves (+which leaves?) were measured, etc. etc… trait annotations get messy very quick. I struggle to even envision the ideal.

These annotations will be used well outside the collections world so connections to (likely taxon-specific &/or trait-type-specific) repositories are a must and therefore would need collaboration with existing communities (many of which are not natural history collections related whatsoever)

I’m sure there are many great minds working on this but hopefully I don’t sound too uninformed here.

DagEndresen · February 26, 2021, 6:16pm

SHARED persistent object/specimen identifiers would be the “glue” linking information stored in different places - and aggregators (such as GBIF) might offer the place to discover additional information (annotations) provided by others.

DagEndresen · February 26, 2021, 6:49pm

What if we started with the persistent object identity for the specimens and other things we care about.

The first step before publishing ANY data would be to pre-register a persistent identifier for the specimens (which later on cannot be changed) [which is not the occurrenceID, see below]. And following this train of thought, the source information from the collection about their own specimens would thus in practice just be another (yet highlighted) annotation.

Many Occurrence records are not about specimens. And many Occurrence records about specimens might still continue to be published in GBIF before getting the appropriate persistent “post”-registered specimen identifier. However, eg. DiSSCo could require all specimens to have these persistent “pre/post-registered” identifiers BEFORE counted as part of the “European Collection”.

Another important aspect I think is often overlooked is that the occurrence data records are a heavily denormalized view with information combined together for many different things that are identified by their own persistent identifier. An “Occurrence” includes and describes things identified by materialSampleID, organismID, locationID, taxonID, recordedByID, identifiedbyID, datasetID, collectionID, institutionID, …! All these things have their own identifier and can and should be described in many other places than (only) “inside” the denormalized Darwin Core Occurrence “envelopes”.

Apropos, the occurrenceID is the identifier for the entire denormalized Darwin Core Occurrence view, and the materialSampleID would be the actual specimen identifier. The occurrenceID could thus rather be viewed as an annotation identifier.

An aggregator such as GBIF could even follow (resolve, dereference) some of these identifiers to fetch information to enrich the “occurrence data view” to fill in “missing” information, and to validate suspicious pieces of information (eg. outliers) published inside the “Occurrence” view.

The source collection CMS view of the specimen data would always remain and would remain as relevant and valuable as ever, even if an aggregator such as GBIF would “listen” to other sources of information about the things “inside” the data records published from the source.

DagEndresen · February 26, 2021, 7:11pm

Pilot data annotation system at GBIF-Norway (2016)

I was meaning (but got swamped this past week) to post some slides presenting a previous data “annotation” system for the Norwegian museum collections a few years back. Which unfortunately went dormant during staff roll-over at the GBIF-node, but exist at GitHub and might perhaps (?) come back online again sometime (in a new form).

The pilot annotation system is presented in screenshots on slide number 26 to 33 in this slide-set

NielsKlazenga · February 28, 2021, 1:58am

I’ve had to take some days off from this discussion and @trobertson has already answered the question. As @GillBrown suggests, these numbers are a lot lower than the real numbers. I can find 344 annotations for AVH, which is less than I have made myself. The difference is bigger than can be explained by changing of row keys (as we have not done much of that), so I suspect there is also an indexing issue or maybe only annotations for the last few months are visible.

Nevertheless, as for @MatDillen with AnnoSys, the number of annotations we get through ALA are low enough not to overwhelm collections managers (at MEL we get 1 or 2 every day).

There are a few reasons why the number of annotations we get are relatively low and ways in which we can make it easier for curators to deal with larger volumes of annotations:

As already noted by @trobertson , we do not have tools to make bulk annotations. ALA has been hesitant to add them, as they also make it easier for people to make mistakes. There are however a web service and API keys, so it is possible for external applications to make bulk annotations (that is how I got so many).
While it is possible for external applications to make annotations, the body of an annotation is a free text string, so cannot be processed by applications. In order for applications to process the annotations, the body of the annotations needs to be a JSON string and we need standards or application schemas for different types of annotations. When that is in place, CMS can also be made to be able to process annotations, which will enable curators to deal with larger volumes of annotations.
Due to technical issues in the ALA Collectory (registry), not all collections are able to verify annotations. This might lead to annotations for those collections eventually drying up.

That all being said, the number of annotations we get is still on the rise, so it may just take time for people to discover the feature. In the meantime, I think all our collections curators find the annotations they get quite valuable, as they enable us to fix errors in our data that we are unlikely to find ourselves.

LaurenceLivermore · March 5, 2021, 6:07pm

An adjacent sector example (provided by Raju Misra) of annotation from proteins with a tiered curation/trust of an annotation: Biocuration in UniProt

Topic		Replies	Views
Summaries - 3. Annotating specimens and other data Digital/Extended Specimen	0	1338	February 16, 2021
Background and context for phase 2 Digital/Extended Specimen	0	1087	June 8, 2021
Summaries - 5. Analyzing/mining specimen data for novel applications Digital/Extended Specimen	0	1151	February 16, 2021
10. Transactional mechanisms and provenance Digital/Extended Specimen	58	3430	March 17, 2022
Extending, enriching and integrating data Digital/Extended Specimen	53	3953	April 5, 2021

Annotating specimens and other data

What if we started with the persistent object identity for the specimens and other things we care about.

Pilot data annotation system at GBIF-Norway (2016)

Related topics