Webinar 2: Recording estimated values in assertions (Guillaume Body)

The following question(s) were asked in the Collection Management Systems Webinar and will be answered here.

Guillaume Body: How do we record confident intervals (and/or standard deviation etc) of estimated “Assertions” ? e.g. population density estimated through statistics ; weight/length estimation ?

Response:
Assertions include a parentAssertionID, which is meant specifically to enable Assertions about the Assertions themselves. The best way to explain how this works is with an example. Let’s start with an EntityAssertion, for which we are going to need an Entity.

Let the Entity be the Beaufort Island Emperor penguin colony in 2006 (doi:10.1007/s00300-007-0317-8) with the following properties:
entityID: BIP2006
entityType: dwc:Organism (dwc:Organism can have a scope that is at the population level)

The measurement of interest is the predicted count of adults from a remote sensing algorithm and might be as follows:
entityAssertionID: ea1
entityAssertionType: predicted adult count
entityAssertionValueNumeric: 1764
entityAssertionProtocol: see doi:10.1007/s00300-007-0317-8

The predicted adult count has a 95% confidence interval with lower and upper limits. The lower limit would be a new Assertion with the predicted adult count Assertion as its parent, as follows:
entityAssertionID: ea2
parentEntityAssertionID: ea1
entityAssertionType: lower 95% confidence value
entityAssertionValueNumeric: 1275
entityAssertionProtocol: see doi:10.1007/s00300-007-0317-8

The upper limit would be a new Assertion with the predicted adult count Assertion as its parent, as follows:
entityAssertionID: ea3
parentEntityAssertionID: ea1
entityAssertionType: upper 95% confidence value
entityAssertionValueNumeric: 2512
entityAssertionProtocol: see doi:10.1007/s00300-007-0317-8

Thank you for this response, and for rising up the existence of the parentAssertionID. I think it solves the issues and allows more potential to the unified model.

I would still suggest some changes, mostly to have a better control on vocabulary.

entityAssertionID: ea1
entityAssertionType: predicted adult count
entityAssertionValueNumeric: 1764
entityAssertionProtocol: see doi:10.1007/s00300-007-0317-8

entityAssertionID: ea1
entityAssertionType: dwc:individualCount
entityAssertionValueNumeric: 1764
entityAssertionProtocol: see doi:10.1007/s00300-007-0317-8

entityAssertionID: ea1
entityAssertionType: dwc:lifeStage
entityAssertionValue: adult

The predicted adult count has a 95% confidence interval with lower and upper limits. The lower limit would be a new Assertion with the predicted adult count Assertion as its parent, as follows:
entityAssertionID: ea2
parentEntityAssertionID: ea1
entityAssertionType: lower 95% confidence value
entityAssertionValueNumeric: 1275
entityAssertionProtocol: see doi:10.1007/s00300-007-0317-8

The upper limit would be a new Assertion with the predicted adult count Assertion as its parent, as follows:
entityAssertionID: ea3
parentEntityAssertionID: ea1
entityAssertionType: upper 95% confidence value
entityAssertionValueNumeric: 2512
entityAssertionProtocol: see doi:10.1007/s00300-007-0317-8

entityAssertionID: ea2
parentEntityAssertionID: ea1
entityAssertionType: x_0.025
entityAssertionValueNumeric: 1275
entityAssertionProtocol: see doi:10.1007/s00300-007-0317-8

entityAssertionID: ea3
parentEntityAssertionID: ea1
entityAssertionType: x_0.975
entityAssertionValueNumeric: 2512
entityAssertionProtocol: see doi:10.1007/s00300-007-0317-8

By splitting the “predicted adult count” into fundamental notion: “predicted”, “adult”, “count”, we reduce the vocabulary by avoiding combinations.
I have a doubt whether we should stick to the original darwin core column terms “lifeStage” “individualCount” and write them as other type: “life stage”, “individual count”.
There is also one question about the “predicted” that I have not yet explore, if it would be in metadata, or in eventType (or managed at least as a third category “HumanObservation”, “MachineObservation”, “StatisticalAnalysis”

For confidence interval, we could also use a grammar x_p, where p is the probability (decimal mark: dot) which is generic to any quantile. I think it also come from an international reference such as the International Statistical Institute (ISI) or an ISO.
By the way, the ISI has a multi-lingual glossary that can help us harmonizing the statistical related vocabulary ([https://www.isi-web.org/isi.cbs.nl/glossary/](https://ISI Glossary))

1 Like

This seemed like a relevant place to get some feedback on a similar idea. If there is anyone who is actively testing this out, I have a good dataset as a use case and I would be happy to try to apply it.

I have a remote sensing vegetation dataset, in which an AI algorithm predicts a polygon for a tree, assigns a species, estimates tree height, and crown diameter, each with its own confidence value. Is there anything notable in this example that is worth discussing because it is different from the example above? I can’t think of anything, but it’s worth checking.

I know the new data model is still in development, so are there any suggestions how I should publish these confidence values in the meantime?

Also, @tfroeslev, there is overlap here with the DNA community’s need to convey confidence of putative taxonomy. The difference is that the MixS standard and DNA extension have specific terms where this is generally tossed in. But I’m thinking this is a good opportunity to explore a generalized way to describe confidence of assigned taxonomy, since this is a challenge that is common across DNA, Imagery, Acoustics, etc.