Transparency of data and methodologies used in indicators

We need to consider the various dimensions that contribute to data quality and fitness for use in assessment and monitoring processes. Our goal is to understand what biodiversity actually exists/existed at a given time and place. Every measurement associated with each data record (coordinates, date, taxonomic identification, etc.) may be subject to precision and accuracy issues, and we generally focus on how to assess these aspects (precision and accuracy) for the data that finally make their way into GBIF and other aggregators. This assessment is difficult or impossible since we lack an omniscient perspective. However, we can more readily explore other aspects.

When a user views or downloads data, how faithful are these data to what the original observer recorded? A large proportion of the data errors I find in GBIF/ALA indicate loss or corruption of information (reduced precision and/or accuracy) along the data chain. Better collection of metadata and processing instructions could enable aggregators to handle the data more faithfully. Improved taxonomic datasets would also help enormously.

I also see issues around the suitability of some data sources to be incorporated and aggregated, at least in the form that they are aggregated today. Structural fit does not mean that the data should be included without question. We need clearer documentation of what our aggregated data models are intended to mean, and we need processes that involve intelligent consideration of whether including each dataset supports this intent. As an example, it is natural to organise machine detections of tagged individuals as sample events, but these samples are of tagged individuals and not of local biodiversity - importing these data as sample events in GBIF/ALA implies suitability for purposes they cannot support.

Clearly, we also need to consider the relevance of the data for each user need. This too depends on clarifying and tightening the definition of what the aggregation pipelines do and what data commitments are made by the aggregator regarding the expected use of the aggregated data. Note that it might be appropriate for GBIF and other aggregators to offer separate views, with different data commitments, for use in biodiversity monitoring and e.g. for use as a global virtual natural history collection.

Overarching all this is the question of the transparency of the whole pipeline. If we wish to deliver FAIR data, we must be transparent concerning how data are modified and transmitted along the chain.

The result of increased focus on faithfulness, suitability, relevance and transparency would be elevated fitness for different uses and increased trust in the data and in products using the data.

1 Like