Filtering isn't cleaning

@waddink

I’m sorry if I gave you that impression. The records I’m concerned about are both the legacy records in the CMS and the new digitisations enabled by collaboration with DiSSCo. In both cases the institution has the primary responsibility for generating the data. In both cases there should be help available to the institution so that either the CMS dataset or the “proto-DES” dataset (not yet exposed anywhere) or both can get fixes before a DES goes public.

IMO it does not help if “more experts, citizen scientists and date fixing services have access to it”, i.e. to institutional data wrapped in a DES. The dream of biodiversity informatics people for 15+ years has been: “We will expose the data on the Web and the community will identify any problems”. The dream is crazily optimistic and has failed spectacularly for two broad reasons. First, the “community” generally lacks the skills to identify systemic issues as opposed to details in individual records, e.g. “this moss does not occur here, either the ID or the location must be wrong”. Second, institutions don’t quickly fix their data, as can easily be demonstrated by following the issues flagged by GBIF through successive versions of institutional data.

As I have argued before, DiSSCo represents a new opportunity to improve biodiversity data quality, but that opportunity will only be realised if DiSSCo focuses on quality at source. It doesn’t matter whether this is done for legacy data, for new digitisations of existing holdings or for entirely new records. It needs to be done before the record goes downstream and becomes part of a DES.

If the identification and fixing of records is left for “later”, by unspecified DES users with unspecified goals, then you are committing DiSSCo to the same failed quality model that we suffer with today.

Please also understand that the potential for machine discovery of problems, whether in the raw material for a DES or in the DES itself, is very limited. Developing software that can replace even some of what data specialists can do would be a slow and expensive process. It is cheaper and quicker to get people to clean data, and as 2025 approaches I suggest (again, as I have in previous years) that DiSSCo would be wise to organise this option. I asked in an earlier comment “Are DiSSCo and the participating institutions adequately resourced for this?”

@waddink, this discussion so far has been about generalities, so may I ask some specific questions?

You work at Naturalis, and Naturalis has a Mollusca dataset shared with GBIF. GBIF has flagged numerous records in the dataset with issues. In a quick look at the source archive yesterday I found a very large number of within-field, between-field and between-record data problems not flagged by GBIF. Of the 738023 records in this dataset, 83% have a modified field date in 2015.

  • Have the Mollusca dataset curators acted on GBIF flags?
  • Will the dataset be cleaned before DESes are built from it?
  • How will that cleaning be done?
  • If the dataset is not cleaned before DESes are built from it, what proportion of the data problems flagged by GBIF do you expect will be fixed in the CMS as a result of annotations or DES-type corrections?
1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.