Annotating specimens and other data

Two comments on the use cases suggested by James and Joe:

  • these are good examples of what can be done with “our” data that are beyond “typical” uses we often think about.
  • they, esp. Joe’s, might be very difficult to capture in a CMS. Given the problem of “too much to do” that has been raised already, I can see incorporating “extended” specimen information somewhere “down the list”…
1 Like

@Rich87 Yes, good point. This is the reason that we need a global annotation store that can house these assertions, no matter where they are generated. The assertions can then be searched/mined by researchers to address these research use cases. Thus, they do not rely on the information being stored in a CMS, nor do they rely on a given aggregator to store this information. However, a collection manager can also access the annotation store and query for information that has been added about their specimens. They can choose to download this information in a standardized form and “push” it into their CMS either manually or potentially at least semi-automatically. This is what we envisioned for the FilteredPush network. Of course, the reality here is that the annotation store would get very big, very quickly and we would need the cyberinfrastructure to support this :wink:

Is this an area where we could seek to apply emerging but off-the-shelf tech, rather than having to invest, build and maintain it ourselves? Thinking specifically of Hypothes.is, though other options may exist.

1 Like

I see another question arising - can annotations placed in such a global store be edited and if so, by whom??

1 Like

I wonder if we have practical examples of annotation systems and storage solutions for them. In particular, it would be nice to see how and how much those annotations are actually used.

1 Like

I would think annotations should not be edited by other than the contributor. An additional annotation correcting or commenting on a previous annotation is the way it is done in the collection.

4 Likes

@giln Exactly! Annotations would not be editable. Anyone can make a new annotation and assert evidence that a previous annotation is incorrect or more knowledge is available, etc. However, in FilteredPush we did have the concept of annotation “conversations” where information is iteratively improved through additional evidence being provided by other agents/machines (mostly related to data quality improvement).

2 Likes

Yes, I certainly think some proof of concepts or pilots would be valuable. There are several annotation systems out there with quite different approaches. We can dive more deeply into that conversation next week.

1 Like

@jmacklin That’s my hope as well. I know of one instance where data submitted to an aggregator was georeferenced via permissions of aggregator without the knowledge of the data owners - hopefully an aberation.

@Rich87 I was just considering what the equivalent in the physical sense would be to editing an annotation… In the extreme, this might mean making the piece of paper the annotation was on “disappear.” Sadly, I am sure this happens and this is where an image is invaluable (at least in the botanical case). But, I do recall from my CM days seeing an annotation crossed out and someone’s handwriting saying “NO!!!” beside it :grimacing:

2 Likes

@jmacklin @JoeMiller : And that (conveniently…) gets us back to imaging. Should one reimage a specimens each time it is annotated? The easy answer would be “yes”, but…

We’ve have that discussion here at MICH and, beyond taking the image, there is the matter of curating the image and making sure the “current” version is available wherever the images can be viewed. In a collection where getting the initial image taken is/was an accomplishment, re-imaging might be seen as not even possible.

Providing a stable url to the information seems like a good option. Adding the url to a cms shouldn’t be that difficult and loading a big list of record or media identifiers with associated urls shouldn’t be that difficult either.

@Rich87 Re-imaging might be a major challenge for vertebrates that might have been imaged in several views and stacked for maximum clarity. Adding annotation text to the record’s label data should make the content of the annotation sufficiently available.

Not that different I would say. The goal of subscribing to alerts for annotations is to be notified that there are annotations on the occurrence records you are interested in. The emails only contain links to the occurrence records that have been annotated and a link to a query that will find them all. You can safely discard the emails, as you can query for annotations at any time and you can also get them through the API.

I was making the suggested distinction when I said that I did not want the fully automated ingestion of annotations into our CMS and that just our curation team being notified of annotations was enough. This is very much given in by the form annotations that are available to us take at the moment and it also depends on the type of annotations.

For example, if we had a fully structured annotation with a suggested correction for a geo-reference, it would be very nice to be able to accept or reject the assertion within the CMS and have the CMS update the geo-reference – or, better still, create a new geo-reference. Attribution for geo-references is trivial, as we can use georeferencedBy for that, which our CMS has (as does @Rich87’s, as we use the same CMS).

If the annotation is a suggested identification, I will happily store it as a determination record, but one of our (other) botanists will have to look at the specimen to verify the identification for it to become the current determination (the one we deliver in the Occurrence Core) and the current determination will be attributed to the botanist who confirmed the identification. Once all our specimens are imaged, I would love people to be able to do online identifications, but I do not see any way of automating this, as specimens will have to be pulled out of cupboards, annotated and re-incorporated and the work with the specimens will eclipse any amount of work that has to be done in the database.

With Bionomia attributions I get into trouble with our data model, as our CMS stores the Agent IDs at the Agent level, while the attributions are at the Collection Object or Identification level. Also, while I try to automate as much as possible, all matches are verified by our curation team and there are (a very small minority) of attributions that they think are incorrect, or are not sure about, so I am not sure who to blame for the actions we take in our database. However, as long as the annotations will always be in Bionomia and will always be connected to the specimen, I do not think that is an issue. Also, while I cannot (and do not really want to) accommodate the Bionomia annotations in our collections database, I think it would be great if our records in AVH (ALA) could link to them.

There are also annotations that do not have to get back to the data curators. For example, in our online Flora, we use annotations when making the maps. These maps are based on occurrence data from the ALA. Dots on the maps have different icons depending on the value of establishmentMeans. The value for establishmentMeans comes with the occurrence record, or assertions by our Flora editors. The latter are annotations. I do not think these have to go back to the curators of the source data sets at all, but aggregators like ALA could use them to improve the filter on establishmentMeans, which is important to many users, but for which the occurrence data is very incomplete.

We also use assertions that the occurrenceStatus is ‘doubtful’ to prevent dodgy-looking occurrences from displaying on the maps. These assertions mostly indicate that a specimen is probably misidentified, so the curators of the specimen data need to be made aware that these assertions have been made, so they can verify the identification. But even if nothing is done with the annotations at the source, they will still be on the AVH record alerting the user that there might be a problem with the record, so they can decide whether or not to include it in their analyses. Since earlier this week, the ALA Biocache has data quality profiles in which one of the options is to ‘Exclude records with unresolved user annotations’.

I never said anything about severing the communication between source data and annotations. I do not see how you can. There is obviously a lot that can and needs to be improved both in CMS and annotations to enhance a positive feedback loop. However, a positive feedback loop does not mean just storing annotations that are made elsewhere in the source CMS, which is what I was talking about (and thought @JoeMiller was asking about). Also, while this feedback loop is important, I think annotations add great value to data regardless, so having an annotation store is always useful.

I have been rather heavily invested in one particular regional infrastructure, so I might be biased, but I think the more we increase the role of CMS and the more responsibilities we pile onto collection managers or collections database managers, the more we will be leaving smaller and under-resourced collections behind. So, the more responsibilities that CMS and collection managers do not really have to do that can be pushed to shared infrastructures (incl. hosted CMS) the better.

Sorry, this gone on way too long. Apologies to who has to summarise (or even read) this.

1 Like

I’m swamped this week so pardon me if this is already well covered.

We have tried to create digital annotations for many years and systems have largely failed. I believe this has been because of 1) the round-tripping issue (users find data through portals and report issues there, but we have left it to other data systems to make the corrections and re-export the data) and 2) link rot. Together, these two factors have broken all our attempted solutions.

Round-tripping breaks all models that rely on the annotations being processed in the source database - in too many cases, this is impossible or too slow.

Link rot breaks efforts to solve this by leaving the annotations in a specialised repository - we can’t maintain the references back to the source records.

In both cases, there is also the issue that we are trying to implement a state machine that requires a robust transactional solution. We should know when an annotation has been processed, either by making the requested change or by rejecting it. At that point the annotation itself should show as something that is purely historical and no longer needing consideration. In cases where the person who created the annotation disagrees, there may be a need to reopen the issue. At some point mediation may be necessary.

I think it is more helpful for us to think of a solution that allows us in principle to retrieve three different things as structured data:

  1. The current view represented by the supplier of the data - normally to be considered the trusted version.
  2. The complete set of alternative views on offer by all parties interacting with the data - including the supplier/original view, suggested corrections and cleaned versions.
  3. The view that best represents the normal community interpretation of the record - in many cases the supplier view, but it may be that annotated versions or community consensus versions (see iNaturalist research grade) should take precedence.

The existence of 2 and 3 does not steal agency from the supplier. Their version should always be available and judged in its own right. We should however have agreed models and processes that allow the community and even automated indexing services/bots to recommend an alternative view as the better version for wider use.

I am utterly convinced that this model is possible, but it depends on some changes in perspective:

  1. We need to treat the pool of specimen data as a global public good that we are all curating together
  2. We need a mechanism reliably and uniquely to identify every specimen and to be able to know when new versions of its data record are published
  3. We need an architecture that oversees the identities and versioning of these specimen records
  4. We need community models that allow the expert communities to work against these shared specimen identities and openly to develop the community view of the best interpretation of each record

DOs are a component in all this, but it will work best if we have a truly integrated indexing service that brings all the specimen records together, deals with the identity and versioning issues, throws exceptions and enforces handling when identity exceptions are detected, etc. I will say unreservedly that this role should be given to GBIF. Once that happens, the identifiers in the GBIF index and the identifiers used internationally to refer to specimens become one and the same thing. Then the benefits really start to flow:

  1. Annotations and alternative views of records can be attached to this GBIF-managed backbone.
  2. Our shared community can develop transactional annotation management against this backbone rather than against heteogeneous sources - delivering “research grade” views for every record.
  3. We can treat annotations less like emailing data publishers and more like Wikipedia talk pages - every entity can have a page that allows free comments to be added, LOD properties to be connected, structured edits to be proposed, reviewed and handled, etc.
  4. A consistent backbone-based approach would reduce the complexity associated with publisher applications accessing and using these annotations. A wealth of access services can be built around a common core.
  5. The existence of this backbone does not hinder the development of extended indexes and additional services by regional or specialist infrastructures.
  6. We can have DOI-style services at the specimen level to support citation, tracking, etc. We can encourage journals and others to cite accurately - the current GBIF data download citation processes could feed such citations directly onto specimens.
  7. [And this one really matters] GBIF can act as the guarantor that even low-tech institutions can play their part in a world of DOs. Publishing and versioning a collection dataset as DwC-A would be a perfectly viable way to benefit in all the ways I have listed.
  8. This model can be expanded to all occurrence records and to any other digital records we care to manage (COL species records, BHL publication records, etc.).
6 Likes

Re @dhobern point 7:

[And this one really matters] GBIF can act as the guarantor that even low-tech institutions can play their part in a world of DOs. Publishing and versioning a collection dataset as DwC-A would be a perfectly viable way to benefit in all the ways I have listed.

Maybe there’s a parallel here with the #citethedoi tracking: had this been left this to individual organisations, we would have seen a big split between those who could have invested in this tracking and the many more smaller orgs who could not. We must be careful that implementation of digital / extended specimens is equitable across less-resourced institutions & doesn’t cause a divide.

3 Likes

@NielsKlazenga : Excellent detailed summary! A lot of your points about what the fate of different types of annotations are in line with what we try to do at MICH. One additional concurring thought:

  • the regional aspect certainly plays into the abilities (and interest) of the staff in dealing with annotations. In our case, we have a specialist on the flora of Michigan and Cyperaceae; if a suggested annotation comes in for a specimen either from Michigan or a member of the Cyperaceae, I pass it by him before accepting. A new determination for a specimen outside of anyone’s speciality would likely be accepted - with a note that either a duplicate at institution X was annotated by XX or just an indication of who suggested the annotation.

@Rich87 I should probably point out that our specimens (for the most part) have not been imaged, so the people who currently make annotations on our records have not seen the specimen, not even an image. It is not that we do not trust other people’s identifications or anything. We do not have our own botanists verify determinations on returned loans or by visitors. Also, once we have much more of our specimens imaged, we’ll happily accept online identifications – with an identificationRemark that the identification is based on an image – but the work that is required in the (physical) collection will eclipse the work that needs to be done in the database, so it does not really matter at the moment how well CMS deal with this type of annotations. If online identifications really become a thing, which I hope they will, we will have to change our procedures, as currently our procedures around re-determination of specimens are based on the proposition that the specimen is out of the cupboard already.

Another point I would like to make, and have tried to make before, is that we should not let current obstacles in CMS and in the management of physical collections stand in the way of where we want to go with annotations (or anything really). If the annotations are useful to collections managers and sufficiently actionable by machines, CMS will follow. In the end all this should make the job of collection managers easier, not harder.

The Atlas of Living Australia (ALA) has had an annotation system from the beginning. Other Living Atlases might have it as well. I would like it to do much more, but what it does, which is what most people want it to do, it does very well. It is used quite a lot. I personally do not really do anything with the annotations, but I see them coming by (as I subscribe to alerts on annotations) and I can see what people at the source collections do with them.

I think @dhobern’s description of a possible solution is very inspiring, but my heart sinks when I think about how much time, money and effort it would take. Then the most pertinent of @JoeMiller’s questions would then be “What is the use case that would convince a funding source?”. To that I would guess it depends highly on the funding source! Maybe something really broad and vague like “Many data records will be corrected and enriched each month. Better data quality, better science, better decisions.”

I actually think it’s a bit risky for us to try and come up with more specific use cases for annotations. We’re such a specialised group who are already invested heavily in this community, I worry that we can’t see the wood for the trees. I think a data annotation service would be mainly used by data users/researchers/policy makers and collection managers, not by us.

In agile development one tries to build the smallest, simplest possible thing as a first iteration, see how it gets used, and then take the next simple step to build and iterate on it further with the product usage in mind. Saves a lot of time and effort, as it’s often difficult to predict how people will use tech in the real world and how their needs will evolve.

E.g. Iteration 1 for us might be to encourage unstructured annotations on a dataset level, rather than on a record or field level: meaning, a simple dataset feedback form, functionality, storage and public display. This would actually allow users to also provide feedback on record and field granularity levels, as they could describe in free text (copying and pasting occurrenceIDs/collection numbers, etc) what addition or change they think is necessary.

As @kcopas says, it should actually be possible to do this already with something like Hypothes.is. There’d need to be some kind of “Flag an issue or provide feedback using Hypothes.is” prompt on a dataset page, and there could be some functionality to use the Hypothes.is API https://twitter.com/lemayke/status/1075821724539777026 to provide the data owner with periodic notifications.

Data curators could then choose whether to incorporate and/or respond to this information. It’d also be possible for data users to see the thread of feedback and make their own decisions based on it. It’s one simple step up from what we currently have (providing contact details so data corrections can be emailed to curators) as it makes the feedback public.

If this functionality gets used a lot and users are e.g. often providing feedback which is more like structured data on a field or record level, then we could invest the development time in building additional forms and functionality to do this properly. If data curators receive hundreds of pieces of feedback from people each month and are struggling to incorporate suggested changes, then maybe some kind of export functionality to facilitate round tripping would make sense. Or maybe we need to shift more towards a community data curation model. Maybe most feedback will end up being georeferencing corrections, in which case maybe a specialised georeferencing annotation tool will be most useful. But my point is that right now I feel like we don’t really have a good enough grip on what we will really need to make a call. Or maybe the LA users do, and they can advise us.

@NielsKlazenga maybe you can also give us an idea of the volume of annotations made among high use vs low use datasets? As @MatDillen says, it would be nice to understand how much this functionality gets used.

I’ve also been wondering whether AI and humans will have the same use cases and requirements for annotations. When it comes to machine annotations, I feel like we kind of have the start of this already with the GBIF data interpretation and the yellow data issue flags. So maybe it would actually make sense to build on this functionality?

3 Likes