Data models and standards for improved usability

From the white paper:

Over the last two decades there have been enormous efforts to mobilize biodiversity data, which have resulted in the availability of massive amounts of published data that can be readily discovered, accessed and freely used for onward applications. The process of building meaningful EBVs that can inform indicators needs data that can be reliably tracked across not just organism, space and time but also provenance; the latter includes relevant, complete and searchable metadata about the inventory process and the methods that produced those data. Much of the data shared through biodiversity data platforms lack one or more of those four components, which limits or excludes their use in the creation of EBVs and biodiversity indicators. Furthermore, much of the data currently shared correspond to incidental records and lack any defined inventory or survey methods.

For EBVs that use multi-variable analyses to aggregate and homogenize data across species, space and time, a taxon name, an event date and a set of coordinates are not enough to account for any bias or deficiencies in the available data. One way to help to overcome these biases is to publish occurrence and event records with metadata that describes the collection methodology and processes that are as rich as possible. However, this type of correction will only be useful for certain types of analyses. Many species occurrence records that represent only the presence of the species (i.e. incidental records) will still not be useful for EBVs that must account for data that enables inference about absence of species. For these EBVs, well-documented monitoring or inventory event data is needed.

As more monitoring data becomes available, expanded best practice guidelines should include, but may not be limited to, how to share quality metadata containing details of the sampling methods employed, the scope, and descriptions and provenance of the collected data. To make this practical, biodiversity data platforms will need to review and amend current data-sharing standards and practices, and upgrade their infrastructures to host and display new types of data and data formats such as is the case with GBIF´s ongoing consultation to review its current data model. An example of a new standard that is under review for implementation is the Humboldt extension to Darwin Core (Guralnick et al. 2017, Sica & Zermoglio 2021). Furthermore, data publishing institutions could be encouraged to create “sub-collections” of their data that meet these metadata requirements that they could publish separately from their larger corpus of data. An increased focus on the publication of past and current monitoring and inventory datasets with the expressed purpose of supporting EBV and biodiversity indicator creation would require strengthened ties with the research and monitoring communities that produce those data.

A key aspect is that the primary mechanisms for collecting useful data on biodiversity have changed over time. I see three overlapping eras:

  1. SPECIMENS - Prior to the middle of the 20th century, the vast majority of information we have on biodiversity comes from the work of collectors. A small workforce delivered data collected primarily from the most accessible locations (no sampling methodology) but with very broad taxonomic scope. The model could never scale to deliver planetary modelling but gives us our earliest useful data.
  2. HUMAN OBSERVATIONS - From the 20th century onwards, the vast bulk of our data is from field observations either by professional scientists (ecologists, etc.) or volunteer efforts (bird atlases, bird banding/ringing, citizen science, etc.). The taxonomic scope is often narrower than with specimen collection, but (for the taxa that can be recorded by amateur naturalists) data volumes can be very large (though often still with insufficient thought given to sampling methodology).
  3. MACHINE OBSERVATIONS - We are near the beginning of a third era, in which the simplest, most cost-effective and scalable way to collect biodiversity data will be through machine solutions: eDNA, AI processing of webcam, UAV and satellite images or of acoustic recordings, etc. Such methods are much more amenable to broad-scale sampling approaches and can (at least with eDNA) cover most organism groups.

The coverage and quality profiles of these three categories (and of the associated recording eras) are fundamentally different. Successful integration will require us to find ways to cross-calibrate these diverse signals.

Is there not also work within each of these areas to improve data flows, for example, for specimens, a large amount of data still remains undigitized, how can we incentivise the digitisation of these remaining collections. For human observations, how do we increase the taxonomic representativeness of data collection and make sure that it is collected in a more systematic manner? And for machine observations, are standards and formats sufficiently mature for the incorporation of these new data, I am thinking of a specific example of remote sensing data being used to identify individual tree species.

Yes, there is. It is likely to help us if we partiton our strategies for data quality around approaches that best fit the characteristics of each data source, probably subdivided into smaller categories. I would like us to do a better job of documenting for each dataset what elements are associated with a high level of confidence and what elements may be less definite.

For a large natural history collection without an active curator for a particular group, it would be helpful to know that species identifications in fact simply document determinations from the middle of the 20th century and are likely not to reflect current taxonomy, but for a collection that is maintained by the world expert in the group, all specimen records may reflect the very best knowledge. Treating these two collections as equivalent would be a mistake. Asking questions and collecting metadata during publication of the datasets could allow us to get much more usable information from the data.

Similar issues arise e.g. with processing of camera trap images. Some of these are what I would consider HUMAN OBSERVATIONS - an automated process took the image, but the identification is by a human viewing the image. It is then useful to know whether this identification was crowd-sourced from the public or provided by a biologist with experience in the given habitat and region. Others may be MACHINE OBSERVATIONS generated by fully automated AI pipelines. This too should be documented, along with metadata on the algorithm and training image set.