Bionomia: Indexing, displaying links to collectors & determiners

dshorthouse · October 26, 2021, 2:57pm

A recent discussion on Twitter prompted this topic. Let’s explore how Bionomia and GBIF might interact & who might use or benefit from sharing this information about some of the people linked to occurrence records.

There are presently 25M links created or displayed in Bionomia between people with either an ORCID ID (living) or a Wikidata Q number (deceased) for 22M occurrence records from 5547 datasets. The majority of these are created by 166 volunteers or by the collectors/determiners of specimens themselves, but a growing proportion of these are also obtained from source datasets that populate newly ratified Darwin Core terms, recordedByID and identifiedByID. Bionomia obtains specimen (or specimen-like) records from GBIF and refreshes these every two weeks. Here is one example download: Download. The GBIF informatics team created a custom download for this based on the Apache Avro file format. The metadata behind these 25M links indicate what was the action executed – either collected, identified, or both –, who made the link, when the link was created, and when it was modified. There are additional metadata derived from ORCID or Wikidata cached in Bionomia that may be valuable such as aliases of the person and their birth/death dates (deceased, from Wikidata). There are also Frictionless Data downloads for each dataset. Here’s an example for the Canadian Museum of Nature Herbarium. Note that the datasetKey in the URL is the same as that generated by GBIF for its presentation there: Canadian Museum of Nature Herbarium.

Taking a prompt from @trobertson’s Twitter posts, here are questions we might explore:

Should GBIF be merging in the Bionomia links during indexing time?
If yes to the above, should GBIF additionally record metadata on the link such as who created it and when it was created? [Aside: perhaps not all links are created equal if we assign more weight to links created by collectors/determiners themselves or by individuals with an affiliation with an organization that publishes the dataset]
Do publishers to GBIF want these links between people and occurrence records? In what form should this take?
Do users of GBIF data want these links between people and records? In what form should that take?

trobertson · October 26, 2021, 6:45pm

Thanks, David for starting this discussion. As you know I am all for enriching GBIF-mediated data and making the most use of the output of the Bionomia contributors. Having it in the GBIF API and downloads should make it easier to consume for publishers, who (can) already download other enrichments, including GADM.org IDs, quality flags, and synonymy through the backbone.

On 2. my own feeling is that it would be best to capture this as e.g. “recordedById=XYZ according to Bionomia 2021-10-10 (DOI)”. I am only an occasional contributor in Bionomia, so I don’t want credit but equally, I don’t want to be blamed or criticized for my work if I have made a mistake. Similar to my contributions in OpenStreetMap and Apache Foundation software projects, I feel I am contributing to something (the “project”) and we’re building that as a group. I feel to get “credit” for something which in many cases was a super quick exploration of existing open data is weak - and really it’s the sources that I used to make the connection that should be given the credit if anything.

dshorthouse · October 26, 2021, 7:34pm

We differ in perspective on the above. I tend to think of Bionomia as a utility rather than a typical project that generates new, primary statements. It is quite a bit different from iNaturalist, OpenStreetMap, or a specimen transcription service like Notes From Nature.

There is one authentication mechanism in Bionomia for a particular reason. One of the goals in sharing string <=> thing linkages of this kind (i.e. recordedBy = “N. Franz, J. Giron & A. Mazo” <=> recordedByID = “ORCID | ORCID | ORCID”, for Occurrence Detail 3346796240 formatting notwithstanding) is to additionally communicate the origins of that statement. If a consumer of that example sees that it was @nfranz, Jennifer Girón and/or Anyi Mazo-Vargas that made these linkages as a declaration that they collected this specimen, this link says more than if it were merely, “according to Bionomia 2021-10-10”. In fact, it may be according to the very collectors represented in that agent string. I tend to call these links “claims”.

On the other hand, there are indeed many other linkages in Bionomia that are not created by the collectors themselves and as such, I call these “attributions”. In these instances, perhaps it is acceptable not to share who made the link if it does little to communicate the origins of the statement. But, it does discount the effort involved in making a link. Sometimes, it requires considerable research.

That said, what’s missing from all this – and is a deficiency in Bionomia – is an additional statement of evidence for a link. It would eliminate the (assumed) need to defer to an authority. However, I suspect in the majority of cases, such statements of evidence for a link are near impossible to produce. It feels like turtles all the way down. Do we want any turtles at all and if so, which ones? I need more convincing that “according to Bionomia 2021-10-10” is the right sort of turtle.

siobhanleachman · October 26, 2021, 8:21pm

Just putting my two cents worth in - I believe attribution is important for some of the folk who do this work. I recognise that this will result in errors being pointed out and attributed to particular volunteers but I’m of the opinion this is a good thing. Particularly if this information gets fed back to the volunteers themselves and subsequently can help improve their workflow and as a result the quality of their work. In my opinion the statement “I feel to get “credit” for something which in many cases was a super quick exploration of existing open data is weak” is discounting the sometimes time consuming and detailed research needed to ensure that the “super quick exploration” and subsequent linking, is accurate. It also discounts the value of institutional, collector and dataset knowledge - that is, the attribution may be quick because the volunteer has put effort into gaining skills and knowledge about the particular dataset, collector, identifier or institution they are working on. Also, there are collection and data managers who are now undertaking this work in order to assist with the improvement of their datasets and the linking of the same. To not attribute this important curation/linking work is to again to perpetuate the issue of some natural history work not getting the recognition it deserves. I am aware that this was the one of the motivations for Bionomia Tracker being created - to ensure collectors and identifiers get well deserved recognition for their work and give folk the ability to visualise the same. But folk shouldn’t have to be in a paid position to be entitled to attribution. The fact that a lot of this attribution work is done voluntarily and without monetary compensation doesn’t, in my eyes, lessen its importance. And if it is important, the folk doing this work should be attributed. Of course how to attribute is also very much up for debate and may feed into the issue of how to document the evidence or motivation for making the link itself.

DagEndresen · October 27, 2021, 2:31pm

Full support for advancing on enriching data records indexed in GBIF with annotations made by other sources than the original data publisher (such as Bionomia)!

Annotations made in Bionomia are sometimes corrected/included at the source by the original data publisher. And erroneous annotations are sometimes corrected by other Bionomia volunteers. When the source datasets react/respond to the annotation and add the respective recordedByID and/or identifiedByID to the source dataset, would the initial Bionomia annotation still be exposed in the GBIF portal? And a URL link back to Bionomia with the possibility to respond to an annotation that the GBIF portal visitor thinks is erroneous might be very useful?

(And, as a side-note, a feature request directed to Bionomia could be to include information that annotations made in Bionomia since made have been included into the source dataset. And maybe even include functionality for multiple Bionomia users to express support and lack of support[!] in annotations made → enable users to flag suspicious annotations without going all the way to delete an annotation. Efforts made by the original source to include Bionomia annotations are efforts that Bionomia could highlight and encourage? Efforts made by other Bionomia users to fix eventual erroneous annotations is often hard work that Bionomia might want to highlight and encourage?)

dshorthouse · October 27, 2021, 6:58pm

Thanks @DagEndresen for these questions and thoughts. It makes us think more deeply & precisely about what it is we’re trying to do here. I see two questions:

Do we want to facilitate the #roundtrip flow of annotations (= links)? [And if yes, who are the necessary parties and actors, details of synchronization notwithstanding?]
What is minimally necessary for GBIF to index & then display, depending on how we answer 1? [And then, where on GBIF is it most logical to present these?]

There are deficiencies in the Bionomia data model, perhaps there are deficiencies in the GBIF data model, and there are most definitely deficiencies in the exchange of such annotations, especially in how we adequately present these to publishers or consumers who may have very different expectations. And, multi-party push/pull synchrony while maintaining expected levels of local performance within the confines of our budgets – and absence of occurrence-level stable identifiers – hurts my brain.

What if instead of presenting such annotations on an occurrence page in GBIF we backed-off a bit and did something akin to a Frictionless Data download off dataset pages? I assume this would be far less intrusive, easier to implement, and steps around the highly stochastic nature of annotations visible on an occurrence page.

DagEndresen · October 28, 2021, 8:29am

I, for one, would rather love to see annotations displayed directly on the occurrence pages in the GBIF portal - but somehow distinctly displayed as an enrichment - originating from Bionomia - and with #roundtrip if practically possible.

Full acknowledgment (!!) for your unfunded efforts with Bionomia (!!) and the urgent need for the emergence of persistent specimen (materialSampleID) identifiers!!! Maybe simply restrict the functionality of the #roundtrip back to Bionomia only for data records with stable and persistent materialSampleIDs?? …if possible to monitor such persistence at the GBIF portal side #featurerequest. Maybe we simply might need to pre-register (…somewhere) for a persistent materialSampleID (or maybe a “digital specimen identifier”) before publishing these inside the datasets on the IPT??

(Another possible and obvious discussion could be if the occurrence page is the appropriate place in the GBIF portal and if we [wishfully & dependent of eventual further development of the portal data model] rather would desire to display Bionomia annotations on a future specimen/material sample detail page instead).

sharif.islam · October 29, 2021, 11:58am

I see @trobertson’s point but agree with @dshorthouse and @siobhanleachman about attribution. Maybe there is a way to acknowledge the source and the agent (using PROV?).

I am also thinking about a “global” attribution framework that could be integrated with the GBIF api. There are other use cases for such attributions and credits.

I did a crude experiment with nanopub (using the tool nanobench. I published a statement saying that I helped with the following record in Bionomia.

The tool created an assertion and published it as a Nanopublications (it is very simple so please ignore the lack of details). Each nanopub gets a purl, published as rdf statement, digitally signed and assigned to my orcid id. This way the assertion is available independently of gbif or bionomia.

trobertson · October 29, 2021, 12:38pm

The two primary uses would presumably be to 1) to be able to find records by person IDs (requires fields like recordedByID set in the document in the search index ) and 2) for people downloading records (including the publishers) to be able to group records by those IDs, and follow the links to metadata about the people if they so wish.

Edited to add: also on the occurrence detail page (of course)

dshorthouse · October 29, 2021, 2:44pm

This is interesting, @sharif.islam. How might we additionally circumscribe sub:assertion to indicate that the prov:wasAttributedTo is particularly constrained to the action the subject undertook, i.e. collected or identified the object?

sharif.islam · November 2, 2021, 9:45am

Good point. I need to look at this more closely but created a few more tests. Nanopub allows creation of customised templates. For this test, I picked a simple template describing an arbitrary triple.

Example 1:

I used gbif occurrenceid, dwc:identifiedBy, and wikidata identifier to create the statement.

sub:assertion {
https://www.gbif.org/occurrence/1055761406 sub:dwc:identifiedBy https://www.wikidata.org/wiki/Q3822242 .
}

Example 2:

Here I used occurrenceid, prov:wasAttributedTo and wikidata id.

sub:assertion {
https://www.gbif.org/occurrence/1055761406 sub:prov:wasAttributedTo https://www.wikidata.org/wiki/Q3822242 .
}

The provenance part of the graph here refers to the nanopub (not the identification or collection event) so it is attributed to me:

sub:provenance {
sub:assertion prov:wasAttributedTo orcid:0000-0001-8050-0299 .
}

We can probably create a new template to combine these assertions.

system · December 2, 2021, 7:46pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

dshorthouse · May 9, 2022, 1:52am

In the spirit of driving links into collections management systems (the source), I’ve been convinced by @trobertson that there are several ways of tackling the problem. In the interim, it’s one of visibility; traffic to GBIF occurrence pages exceeds anything we can do alone. So, we’ve reopened the dialog in this thread and, time- and resource-permitting, let’s see how far we can push this with “recordedByID=XYZ according to Bionomia 2021-10-10” as an annotation from the Frictionless Data packages. Formatting alone will be nut to crack. Here’s one such example @trobertson: https://bionomia.net/dataset/a4e7ff0a-f9c0-481a-88ed-5986ec86b24a/datapackage.json. Note that the datasetKey here is the same as that known at GBIF’s end. But, I expect there’ll be a suite of adjustments that need to be made at Bionomia’s end to make these more pleasing to consume. Open to whatever might need to be done.

Topic		Replies	Views
Collections catalogue (GRBio)	52	6433	June 28, 2020
When does evidence of impact become too onerous to track? research-data , tracking , citation , impact	11	329	November 10, 2024
Investigating taxonomic issues on GBIF.org Data Publishing NodesSupportHour	6	320	February 13, 2025
Search, download, analyze and cite (repeat if necessary) - GBIF Data Blog data-blog	15	2994	September 15, 2021
The strange case(s) of the missing identity	23	239	September 8, 2024

Bionomia: Indexing, displaying links to collectors & determiners

Related topics