Absences and how they fit in the new model

For the OBIS community (and I expect many other communities) where a species is looked for and not found can be almost as important as where it is observed. The absence of ghost crabs on beaches can be an important indicator of beach ecosystem health. Locations of staghorn coral are important but so are the places where they are no longer found. A trawl with nothing in the net is a very important data point. In our evaluations of the emerging data model we would like to ensure the concept of an absence is included. We are curious to know, are we losing explicit documentation of absences? Will absences only be inferred? Curious to know others thoughts on this topic.

A few of us in the OBIS community would like to create an absence use case for consideration. If you would be interested to contribute to this use case, please let me know.

4 Likes

Hi @abbybenson
I am curious about this as well.
How/when do you figure we might be losing explicit documentation?
I assume that absence data would have to be specifically upladed (dwc Occurrence status absence).

For what it is worth - in https://globalbioticinteractions.org/ - there’s a way to support and refute interaction claims.

So, when applied to interactions, you can express stuff like:

Cat eats mouse is supported by observation made on 2022-10-06 by Dr. Abbas in Cairo, Egypt.

Cat eats mouse is refuted by observation made on 1969-07-20 by Neil Armstrong on the moon. (cats and mice have not yet invaded the moon, and were not observed during the moon landing).

This method can also be used to document disagreements based on expert opinion.

I am sure that others have considered this topic and I am curious to hear their experience and inspect explicit examples/ use cases.

Related issues include - Refute interacton · Issue #552 · globalbioticinteractions/globalbioticinteractions · GitHub introduce presence/absence for interactions · Issue #562 · globalbioticinteractions/globalbioticinteractions · GitHub .

btw - what is the state of the “new” model? I am still curious to hear more about specific examples / real life use cases / responses to feedback (see e.g., Use Case: Biotic Interactions - Sottunga Island Melitaea cinxia Population Study - #2 by jhpoelen ).

I think Abby’s point is that unless presence=absent is explicitly encouraged and used, the vast majority of records will remain presence-only. Very few people report absences, and those unreported absences are lost data.

The Darwin Core comments for occurrenceStatus (Darwin Core quick reference guide - Darwin Core) currently read:

“For Occurrences, the default vocabulary is recommended to consist of “present” and “absent”, but can be extended by implementers with good justification.”

That’s flexible enough for Darwin Core, but any new data model needs to reflect the fact that “present” means “present”, but “not found” doesn’t mean “absent”.

Examples I’ve used when explaining the problem:

(1) You looked for elephants by day in a 2ha patch of African savanna. You didn’t see any. “Not found” means “truly absent”.

(2) You searched for an uncommon plant or animal on a sampling plot. Your search might have turned one up after many hours of looking, but you only had one hour on the sample plot, and you didn’t find the target. “Not found” means “possibly absent”.

(3) You searched for a plant or animal that’s small and cryptic, or seasonal, or condition-dependent. You looked very carefully and didn’t find any. You are far from confident that “not found” here means “absent”, because the target could be well-hidden, at the wrong life stage for an ID, or waiting for the next rain to emerge. “Not found” here needs to be qualified as “impossible to be sure that it’s absent”.

So yes, it would be valuable in a new data model to record “not found”, but that should not be equated with “absent”, and there should be scope (as in Darwin Core) for recording “almost certainly absent”, “possibly absent” and “impossible to be sure that it’s absent”.

2 Likes

@abbybenson I had a similar question, see here Diversifying the GBIF data model - intro - #11 by DeboraArlt, but haven’t really got a satisfying answer. When I have been thinking of absences before I have been into combing presence data set with a checklist used for the inventory, this way you can infer absences. The alternative is reporting all absences explicitely, which to me seems a waste of storage space. I am definitely interested in working with a use case.

Really interesting question wheather absence is a waste of storage. I am not sure I totally agree, seeing how important it can be. And the inference can be tricking since it is definitely not true absence data.
But no matter your view on that question I find it interesting that abscense does not seem to be included in downloads even when they exists.
I have just done an upload of some historic literature data that include absences and when now looking at in GBIF, the absence are being filtered as a standard:

Thanks all,

Some of the discussion on this thread relates to existing standards, but I think the original question was targeting the exploration of a new model. In the model as it stands, I could foresee one way that absences may fit in as events (e.g. observations) targeting a feature of interest (e.g. locations) using a structured protocol where the absence of evidence could be explicitly recorded (how?) or inferred with sufficient context (i.e. a protocol that allows it).

@abbybenson @DeboraArlt - documenting one or more case studies is the best way to get started here to capture the needs and to map them to the model, or adjust the model. Thank you for your interest in preparing one, and it would be a really good addition to have.
Please note that the SCAR group has also started thinking about this so there may be an opportunity to collaborate on it.

1 Like

Thank you all for the comments and discussion. Much appreciated! Tim is correct that my main concern is not with the current implementation of Darwin Core, which does allow for explicit documentation of absences and which the OBIS community does make use of, but instead the new data model.

It does seem that the next step is to document a case study. @DeboraArlt I’ll attempt to reach out to you here via a direct message to include you in the drafting of the case study. If anyone else would like to help draft the case study, please send me a direct message with your email.

If it helps for a case study, I’d be happy to trial mapping a dataset I uploaded a few months ago to the new model. This is one of the most absence-riddled datasets in GBIF and summarises 27 years of near-nightly light-trapping with a target set of species counted each night. More than 90% of these records are therefore absences (and now very helpfully summarised in event views and coloured red in occurrence views inside GBIF). The dataset is here:

1 Like

Just for completeness, I also want to emphasise that - both under the old and the new model - we need to encourage the use of presence=absent only in the context of a sampling event (even if the sampling event is for one species) that can be used to tie the occurrence to a method which would have led to the observer recording an occurrence with presence=present if any of the species had been found.

I see from analysing ALA data that some datasets assign every ad hoc observation to its own sampling event. These data are hard subsequently to filter out but corrupt analyses that seek to use the increased statistical power of sampling events.

For sampling events and specifically for the presence flag, we need to do a good job (in multiple languages) of explaining why we look for such data and what a data publisher is asserting if they use these elements. If a contributor does not understand the biological and statistical implications of these assertions, the record should be published as a plain presence-only occurrence.

@dhobern and @trobertson. It’s excellent to see Tim write

“absence of evidence could be explicitly recorded (how?)”

because the opposite of “present” is not “absent” in a sampling event or in occurrence data, it’s “not found”, and absence of evidence is not evidence of absence. Both in Darwin Core and in any new model, if this distinction is not made then the data has only one use: for a statistical analysis with binary values, 1 = present and 0 = absent. The data should be useful for more than that.

1 Like

@datafixer. The intention behind presence=absent is only clear for sampling event data, not for occurrence records individually. Within a single sampling event with an eventID and a reference to the sampling protocol, presence=absent means that the species was absent from the sample. It cannot mean that the species is absent from the site. In that context, is a binary condition, but it can also be seen as the low limit on a scale if the occurrence records in the sampling event include individualCounts or organismQuantity and organismQuantityCount.

Inference of “true” absence at the a is then left as a question for statistical analysis based on the data (observations and measurements) actually carried out at the site (or through modelling across space).

1 Like

@dhobern. Thank you for making the case to have different “presence” categories in sampling event and occurrence data. Unless I misunderstand you, you are saying that a data model for sampling events could have “recordedStatus” with the binary values “yes” and “no”. Sounds fine to me, and that avoids confusing this result with “occurrenceStatus” with its “present” and scope for alternatives to “absent”.

@datafixer I’m not sure I was advocating anything new, and as far as I can see, presence just needs the two values. Anything else will be too subjective and difficult ever to reinterpret. My comments were just based on how sampling event data already works in Darwin Core.

I believe we should never seek to document absence except in the context of some planned/standardised effort to collect data. If the observer for your elephant example above has a protocol for recording elephant numbers from 2 ha units and visits several or many of these recording elephant numbers, we need a way to state that (for some of those samples) no elephants were found. Rather than leaving this implicit in the fact of a sampling event (i.e. a survey of a 2ha site) having no records, we can make the lack of elephant detections explicit by saying that elephants were absent from this sample (i.e. presence=absent).

If on the other hand, an observer with no such plan is at a site and spends time (for example) recording species in iNaturalist but sees no elephants, even if it seems surprising to them that they saw no elephant, they should not seek to record an absence, because there is no context to interpret such an asserted absence. Absence only makes sense in the context of a sampling methodology.

In other words, I do not believe it ever makes sense to label plain occurrence records (those not in a sampling event) with presence=absent. Absence in Darwin Core is always absence from a sample, not absence from a location/date.

3 Likes

@dhobern. “Absence only makes sense in the context of a sampling methodology… Absence in Darwin Core is always absence from a sample, not absence from a location/date.”

By implication, or explicitly? Here is the explicit explanation of “occurrenceStatus”:
“A statement about the presence or absence of a Taxon at a Location.” (List of Darwin Core terms - Darwin Core) Given that explanation for this term, I could enter “absent” in the elephant sense, or “possibly absent” in the inadequate search sense, or “uncertain” for the highly cryptic taxon search. “occurrenceStatus” is independent of methodology.

I don’t see occurrenceStatus as the same as recordedStatus = yes/no, which is what I thought Abby Benson wanted. Are you sure you want to allow the two to be conflated?

1 Like

Donald’s description and explanation is exactly the same as mine and almost identical to offline descriptions I have provided to Bob on my views of occurrenceStatus. Perhaps the definition of occurrenceStatus needs to be updated (but also we have an opportunity to define it exactly as Donald has stated in the new data model) but that is exactly how everyone I interact with in providing that term uses it. No one I know uses it in the bigger sense of the entirety of a taxon at a location. Furthermore, how can it possibly mean that when for an occurrence you must have a date associated with coordinates and a taxon. To me it must always be asserting whether that taxon was observed at that location on that specific date and doesn’t say anything about other dates.

I’m getting a little confused at this point. The title is “Absences and how they fit in the new data model”, and @abbybenson wanted to encourage absence data to be part of any new data model. Absences are data and they should be part of any sampling record.

There are, however, two kinds of sampling result that can be (and have been) described as “absence”. One is a “not found” result, which is an operational fact: the recorder either did or did not find the indicated taxon during a sampling event.

The second result is an assessment of the occurrence status of the taxon at the locality and on the date of the sampling event. This has nothing to do with other dates or with the general area. This assessment is an interpretation of the “not found” operational result, and as I hope I’ve demonstrated with examples, “not found” does not imply “not occurring at that locality on that date” (which is the same as “absent”). The taxon could be present at that locality on that date, but was missed in the sampling.

This assessment is not a guess as to whether or not the taxon was there. It is an assessment of the difficulty of deciding whether it was there. It is an assessment of the uncertainty of a “not found” result.

These two different sampling results - found/not found, and present/uncertainty of absence - should IMO be part of any new data model as separate categories. Darwin Core does not distinguish these two, with the result that “not found” is equated with “absent” in species distribution models and other statistical exercises, adding to the many other uncertainties in SDM modeling.

An assessment of uncertainty is a requirement of biological recording. We’ve used it for IDs, locations and time/date. What is the objection to applying uncertainty to a sampling result?

Is there a particular objection to using occurrenceStatus for sampling uncertainty? I’ve used it in this discussion because occurrenceStatus is currently flexible enough not to be binary (“can be extended by implementers with good justification”) and it’s “A statement about the presence or absence of a Taxon at a Location”, which (to me, at any rate) covers “uncertain about presence or absence because sampling was time-limited” (for example).

I informally proposed “recordedStatus” (yes/no) as a term or category for the operational result found/not found, recorded/not recorded. Logically and in other ways this is NOT the same as present/absent, because absence of evidence is not evidence of absence. I wouldn’t object to an alternative term to “recordedStatus”, so long as it means the same.

1 Like

An expression of occurrenceStatusUncertainty is an interesting proposal. OccurrenceStatus could even be treated as a probability expressed between -1 (certain absence) and 1 (certain presence). But this would be a major change from the current model, and would deserve it’s own discussion thread.

It seems to me that “not found” is implied whenever occurrenceStatus!=present. OccurrenceStatus=absent is a useful way to encode the common use case of “we looked for this and feel confident it is not here”. Current models have to infer absences, and this is a major limitation for them. Documentation of certain absence within a given spatiotemporal frame is extremely valuable as we continue to watch species go extinct.

A use case that provides a potential solution to the challenges of aggregating organism inventories with information about presence, abundance, and absence of detection is now ready for broader review [1]. The original data set used in the use case consists of target marine species from trawls in a larger campaign. Some of the targets include life stage characteristics of interest. The first version of the use case [2] was reviewed within the context of the OBIS and Humboldt Core Task Group. The current version accommodates challenges encountered when trying to apply a previous solution to data sets with distinct methods of encoding abundance (e.g., from eBird) and with distinct levels of complexity with respect to the organisms targeted (e.g., categories of organism exes and lengths).

In summary, the solution encodes separately the targets and the results of inventories (protocols executed at a place and time) in such a way that abundances and absences of detection can both be assessed while minimizing the data that need to be stored. Discussion is welcome both here and as comments in the current use case document [1].

[1] Use Case: Humboldt Extension Inventories, with Absence Data version 2

[2] Use Case: Humboldt Extension Survey and Absence data version 1

1 Like