Transparency of data and methodologies used in indicators

From the white paper:

Assessing the quality of primary biodiversity data that meets the standards needed for further use in indicators is critical. Even if we were to satisfy the need for better quality data, questions would remain about how homogeneous and repeatable the treatment of the same data can be in different contexts. The same data, from multiple sources, is being used by distinct organizations or collaborations to build EBVs and indicators. Stakeholders developing a given EBV or indicator treat the data independently, apply their own filters and quality checks, and perform their own taxonomic harmonization process, which may be more or less similar to those used by other stakeholders. If biodiversity data platforms could prepare and share species occurrence data in advance for EBV and indicator creation, as EBV-usable datasets, or make the workflows to process data available for example, better consistency and transparency might be achieved. GBIF is exploring ways of assisting this process, for example through pre-filtered versions of GBIF-mediated data exported regularly to public cloud environments.

A second opportunity of equal importance is to improve the communication pipeline between data provider and data user. Data and communications about these data, tend to flow in one direction, from local data collection and mobilization to scientists and policymakers, with little to no communication in the opposite direction. GBIF and other biodiversity data platforms have made commendable efforts to track downloads of data and to report the citations of published works back to data publishers when they are made public through the use of Digital Object Identifiers (DOIs). Improved communications build trust across the data provider network by communicating back to organizations and individuals at the local level about the uses of data. These communications could occur in many ways, including notifications that alert data publishers when their data have been used in the creation of EBVs, biodiversity indicators and other high-level policy documents, using tools similar to the GBIF citation widget. Another effective communication strategy could be the presentation of specific examples that demonstrate how high-quality data and associated metadata are being used to influence science and policy as a part of capacity-building activities and other public events. These possibilities will remain only possibilities, however, without greater transparency.

A third opportunity to work towards greater transparency and traceability across the entire information supply chain is to document all steps taken to create indicators. In this complex process, it is not uncommon for the processes and analyses used to generate these synthezised data and policy products to remain undocumented or hidden from public view. Similarly it is equally difficult to know exactly which data were used in the processes and how. The CBD Secretariat and UNEP-WCMC, are currently working on standardizing the metadata requirements for the proposed headline indicators (see example for the Species Habitat Index, UNEP-WCMC 2021; this must include clear reporting of datasets (DOIs) used and data providers consulted to improve traceability even further.

We need to consider the various dimensions that contribute to data quality and fitness for use in assessment and monitoring processes. Our goal is to understand what biodiversity actually exists/existed at a given time and place. Every measurement associated with each data record (coordinates, date, taxonomic identification, etc.) may be subject to precision and accuracy issues, and we generally focus on how to assess these aspects (precision and accuracy) for the data that finally make their way into GBIF and other aggregators. This assessment is difficult or impossible since we lack an omniscient perspective. However, we can more readily explore other aspects.

When a user views or downloads data, how faithful are these data to what the original observer recorded? A large proportion of the data errors I find in GBIF/ALA indicate loss or corruption of information (reduced precision and/or accuracy) along the data chain. Better collection of metadata and processing instructions could enable aggregators to handle the data more faithfully. Improved taxonomic datasets would also help enormously.

I also see issues around the suitability of some data sources to be incorporated and aggregated, at least in the form that they are aggregated today. Structural fit does not mean that the data should be included without question. We need clearer documentation of what our aggregated data models are intended to mean, and we need processes that involve intelligent consideration of whether including each dataset supports this intent. As an example, it is natural to organise machine detections of tagged individuals as sample events, but these samples are of tagged individuals and not of local biodiversity - importing these data as sample events in GBIF/ALA implies suitability for purposes they cannot support.

Clearly, we also need to consider the relevance of the data for each user need. This too depends on clarifying and tightening the definition of what the aggregation pipelines do and what data commitments are made by the aggregator regarding the expected use of the aggregated data. Note that it might be appropriate for GBIF and other aggregators to offer separate views, with different data commitments, for use in biodiversity monitoring and e.g. for use as a global virtual natural history collection.

Overarching all this is the question of the transparency of the whole pipeline. If we wish to deliver FAIR data, we must be transparent concerning how data are modified and transmitted along the chain.

The result of increased focus on faithfulness, suitability, relevance and transparency would be elevated fitness for different uses and increased trust in the data and in products using the data.

1 Like

One of the points raised on the webinar that surprised me relates to the transparency aspect @dhobern highlights.

Paraphrasing from the video (1hr 46m 20s) it was mentioned that:

The CBD allows indicator level data submitted with 1) a dataset supporting the report 2) no supporting data but a description of what data was used, or 3) a report with (effectively) no data.

The open data community has matured significantly over the last decade and it is now accepted practice to cite source data that is made openly available, something for which all the tools and infrastructure are largely in place.

Making data available allows for scrutiny and further research, raises trust in the conclusions to avert skepticism (“does climate change exist?“), and helps those providing the underlying data demonstrate to their own funders the impact of their sharing.

It left me wondering - at what point do we need to unite and move to a position that any report needs to have at least some data available to support it, and that data be available under a suitable open license to be accepted? Even if that data were a compilation of opinions, if properly structured (spreadsheets, GIS etc) it could be refined and improved over time.