Filtering isn't cleaning

poster

[Image by author; thoughtful woman component by Evgenii Naumov from here.]

GBIF-mediated datasets are often very messy. Users can choose either to clean messy records or to filter them out of the dataset.

For some kinds of messiness, filtering can be done as part of a record search, but more targeted filtering can be done after download using an R package, other command-line tools, a table editor or a spreadsheet.

But filtering means that records are discarded, which is a drastic step and often unnecessary. For example, the Smithsonian neglected to enter an eventDate in a million records with date entries in other fields, and the records were flagged “Recorded date invalid” by GBIF. Cleaning those records by building eventDate from other fields would be better than throwing them out.

Users of Atlas of Living Australia have long complained about data quality issues, and ALA’s response has been to filter out records by default in a records search. The default filters can be turned off or modified by users, but the default results sometimes reveal surprising failures.

The Western Australian Museum is a case in point. ALA as of 2023-08-22 excluded 49959 out of 54155 WAM fish records (92%; here) and 98198 out of 168494 WAM herpetology records (58%; here), in both cases mainly because of “location uncertainty”. Many of those exclusions are repairable using data within the record.

A little-discussed feature of filtering is that it disproportionately excludes records shared from collections. The table below tallies HumanObservation with PreservedSpecimen occurrences indexed by GBIF, as of 2023-08-22. The numbers are the proportions of records with the indicated spatial data issue, times 100 000 to make the numbers easier to compare:

table

There are similar results for other data issues. If you filter a mixed bag of occurrence records, you are far more likely to exclude museum and herbarium records than citizen-science ones, because collections have messier data.


Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

2 Likes

One point to consider is that while filtering might seem like a not ideal step, it often stems from the complexities involved in handling such data. Building a clean dataset requires meticulous annotation, curation, and even automated processes. For instance, the issue of missing eventDate is a good example. What do you do when eventDate is not in place? if eventDate is missing, can I use automated algorithms that attempt to infer it from other available data within the record? If some cases yes, most cases, no. Without a clear solution in place, filtering becomes a pragmatic choice to at least go to the next step (publishing, citing, linking). It’s a balance between working with what’s feasible now and aiming for more comprehensive solutions in the future.

1 Like

@sharif.islam, of course there is a balance between what is known now and what might be known in future. But that should not be a reason to filter rather than to clean. The argument “I can’t clean all the records, so I can’t clean any of them” is not only logically indefensible, it’s bad science.

“What do you do when eventDate is not in place? if eventDate is missing, can I use automated algorithms that attempt to infer it from other available data within the record? If some cases yes, most cases, no.”

I would disagree with both your “most cases, no” and your omission of human effort (“can I use human effort or automated algorithms…”), and especially the latter. I’ve discussed that issue here: Cleaning GBIF-mediated data: "manual" vs "automated" is a false dichotomy and would be happy to talk over the details with you privately by email.

What you might like to discuss further in this thread is this question: Is it better to publish/cite/link low-quality data as soon as possible, or to publish/cite/link data when that data reaches best-possible quality after careful investigation? Filtering suits the first, cleaning the second.

It’s always easier, quicker and cheaper to clean data at source rather than downstream. Nevertheless, according to some of the collection managers I’ve corresponded with, the digitisation and mobilisation of collection records is and has been promoted (and funded) largely to “get the data out there ASAP”, with cleaning to be done “later” (no date specified), by persons and with funding sources unknown.

The result is that the ocean of collections-based biodiversity data is grossly polluted. Cleaning is left to users, who cannot easily return cleaned data to that ocean. I’m not surprised that filtering has become popular (cf ALA, rgbif), but it needs to be seen as a last resort when cleaning fails, not as a first choice for data use.

1 Like

Yes, we need to add more effective #sharedknowledge options. How can human action be closer to the source (#wikidata? #genealogy models as being explored by NHM London). What are ways we can improve this?

As @datafixer points out too, many of the “fixes” suggested / asserted by automated processes may also need to be checked, by staff that doesn’t currently exist or have the time.

One suggestion, might be helpful, is to begin to add #round-tripping to our digitization efforts as stand-of-practice expectations. That is, write into our grants / our grant programs and standards (e. g. the Specimen Data Mgmt Plans now required by NSF with our Data Mgmt Plans) for our work that have some period (a year?) post digitization of a certain collection for specifically managing automated and human-mediated annotations / assertions. IOW, put this in the plan/s from the beginning. Getting data out the door isn’t the end, it’s really only the very start of the existence of these objects in a “public” space.

If the grant plan includes support for a post-sharing response to assertions (e.g. GBIF flags), there need to be clearly defined ways to make that response. I’ve previously made fun of the ineptitude of collections staff in handling digital fixes, but I recognise this is a serious problem. The grant ask should specify how staff will get help to do fixes.

Please note that the current divide between the way collections and other data are shared and the way they originate and are maintained (e.g. in a CMS) is a big one. The divide exists because the shared form, Darwin Core, was devised as a way to get around that huge diversity of original data structures. For many data holders, DwC is nothing like a subset of their “real” data. The latter have to be extensively modified to fit DwC standards. As a result, I think many data holders see DwC as “for external use only”, something to be built because the institution or platform is obliged to share data. Internally, it’s not needed.

This divide creates a conflict when data needs fixing. Should the data holder fix just the DwC dataset, to satisfy DwC users, or fix the source data, and check that the new DwC dataset has no more flags or other issues? The first choice is easier and quicker than the second, which makes round-tripping a little ambiguous.

1 Like

@datafixer we certainly do need to be kind. It’s a tough situation many of them are in. And our world and demands for mobilizing data are growing and evolving. Looping in @sharif.islam I also note other communities have similar / parallel experiences to share and learn from.

See IMLS Data Quality Evaluation Project where the focus is on downstream issues, still relevant to us here in this conversation.

@Debbie, many thanks for that link, which is superb. I couldn’t say better for biodiversity informatics what is outlined here for academic and business librarians.

Thanks to @Debbie for the IMLS report and to @datafixer for the input.

I have been thinking about what we can do in a stepwise/agile manner given our current constraints. A data literacy training and capacity-building framework and proposal would be ideal. I think GBIF and iDigBio are already doing some of these, and this is also in the roadmap of DISSCo. However, we need something like curated and maintained training materials and tools that we can point to in order to improve data quality. Perhaps something as simple as a curated GitHub page (similar to the digitisation guides here: https://dissco.github.io/) could be a good start?

At some point we did talk about organising a regular global online helpdesk/office hour for collection managers/curators but that requires some planning and commitment.

1 Like

@sharif.islam, I recall your comment on this topic last November, and please consider my response at the time.

It would be difficult for already-busy collections staff to find time to train themselves using online materials. Resources of that kind already exist but are not focused on the needs of collection managers. I suggest the “roadmap of DiSSCo” needs both capacity building or off-loading (see here) AND a dedicated helpdesk to assist with problems as they arise.

As you say, this requires planning and commitment. It won’t happen if the goal is simply generating DESes, with data quality to be dealt with “later”.

As a reminder, although they’re not solely focused on collections GBIF’s training courses remain current and have been used to support training in dozens of events around the world, working within our broader capacity enhancement framework. @larussell and @mgrosjean ran a half-day collections-specific adaptation at SPNHC earlier this year and could provide more details if it’s helpful.

Also, seems at least tangentially à propos to mention that, even though it’s focused on nodes support, our technical support hour series will also restart in September with a session on citation tracking from @dnoesgaard.

1 Like

I’ve been following this conversation with a lot of interest, and I’d like to toss in a related idea, although I’m not sure how viable it is. Training and annotation/flags are two ways to attack this problem, and I wonder if “pull requests” might be another. Because no matter how perfect the training, or how well-staffed the organization, there will always be inconsistencies found by users.

It isn’t unusual for someone to improve the quality of the downloaded data instead of filtering it (although I agree that filtering it the default for most of us). @datafixer showed a good example of where this is possible, provided you have the resources to do it.

In GBIF, there is a mechanism for minting a DOI for derived data, but no mechanism to submit any code/QAQC work to the data originator for incorporation into the dataset. Likewise, issues can be flagged but not repaired by users. Here is an example of an issue that would have been a simple PR, but the person who found it couldn’t do more than flag it as an issue. So, we have github issues for DQ, but no equivalent of github pull requests. It seems a shame to let this work go to waste.

This suggestion won’t fix all DQ problems, and there are some “improvements” that would be more appropriate as an annotation. But I believe it’s worthwhile to take advantage of work that is already being done but doesn’t have a mechanism for integration. In my vision, these improvements only require data stewards to click buttons and review records, like in github, which would reduce the necessary people hours and technical training. I’m guessing the infrastructure of what I’m suggesting is more difficult than I can imagine, but I thought I would throw this out there for folks to poke holes in.

2 Likes

@sformel, that’s an excellent idea, and you could imagine DiSSCo data stewards specialising in this job according to tags assigned by the annotators or by DiSSCo staff, like “nomenclature”, “coordinates”, “dates” etc, similar to the tags GBIF currently assigns on their issues portal.

1 Like

Hi @datafixer,
As you said, DiSSCo needs both capacity building and a dedicated helpdesk, and this is already planned. In fact a dedicated helpdesk was already created but so far only used for transnational access calls through ELViS in the Synthesys+ project (see https://www.dissco.eu/services/#hd), as DiSSCo will not have resources to operate it until DiSSco has become operational (probably somewhere 2025). We also already started implementing support for off-loading, at least for data after it has been loaded in a digital specimen, where experts (with help of machine services) outside the institution are able to assist with improving data quality.

However it is important to understand that in principle part of the digital specimen is the CMS data itself. Although for development we start with a copy with an ingest route and public data only (similar to GBIF), we aim to move to a situation where each change in the CMS is reflected instantly in the digital specimen and vice versa (bidirectional, event based synchronisation). That makes it rather a view on the CMS data (or the CMS a view on the digital specimen data), than a copy. That requires changes in the CMS systems though plus support for restricted access data and these things will take time to establish. A trust model also has to be established to allow changing the data using the annotations after a community curation process. Newly created specimen records during a field trip or in a digitisation workflow can first be created as digital specimen and from there loaded into the CMS, for instance after an accessioning or digitisation process.

The digital specimen data in DiSSCo can be supplied to aggregators like GBIF once e.g. a certain quality or completeness is reached. In that sense DiSSCo acts as a traditional GBIF data publisher.There will still be differences between the CMS record and the digital specimen where the digital specimen is extended with additional scientific data that may not be supported by the CMS (hence it is a new object and gets its own PID), and the digital specimen will not contain asset management data (e.g. on which shelf in which building the specimen object can be found or when a jar needs to be refilled with alcohol). We aim to serve records to GBIF at some point through the digital specimens rather than through the local CMSes as this allows DiSSCo partners to serve better quality and extended data. The new GBIF unified datamodel will be of help with this.

In principle the digital specimen could be served by the local CMS directly, but in DiSSCo for high availability, extended data and community curation reasons we opted for serving these through a centralised data infrastructure. Small organisations may use the DiSSCo infrastructure itself as their CMS, but would then need to have separate solutions for their asset and loan management. Larger organisations are probably better serviced by a dedicated CMS solution that is digital specimen compliant.

Kind regards,
Wouter Addink

1 Like

@waddink, many thanks for this overview of the current planning at DiSSCo.

That’s a great idea, but until it’s implemented successfully for all participating collections, DiSSCo will be using copies of CMS data. A great deal of the DQ improving could and should be done during this “intermediate” phase, and most of that work would be with legacy data. Are DiSSCo and the participating institutions adequately resourced for this?

1 Like

A very interesting idea to use GitHub and pull requests as part of annotation and data quality checks.

There is an emerging discussion about ‘data contracts,’ where a pull request can be seen as an agreement that states, ‘Here are the changes or corrections I am suggesting, here is the justification, and I have tested/verified it. Please review and then merge it into the dataset.’ Just thinking aloud here, these pull requests can be stored as annotation.

Within the DiSSCo sandbox, we are storing these ‘corrections’ as annotation objects with their own persistent identifiers. Thus, GitHub pull requests can be incorporated, something like this:

Some of these terms are derived from the W3C annotation data model.

{
  "id": "Annotation PID",
  "type": "Annotation-Type-PID",
  "attributes": {
    // other attributes...
    "motivation": "correction", // or "suggestion" or "pull_request"
    "target": {
      "id": "target-PID",
      // other target details...
      "selector": {
        "type": "FieldValueSelector",
        "field": "field_selected"
      }
    },
    "body": {
      "type": "TextualBody/Other",
      "value": "Suggested correction for location: [link to pull request]",
      "reference": "Your Reference URL",
      "justification": "Your Justification for the correction"
    },
    // other attributes...
  },
  // other sections...
}

This way, we have a link to the annotation value and target, a structured way to describe the annotation.

2 Likes

@sharif.islam, recording individual annotations (and corrections) in a structured way is of course a good idea, but while you are thinking about it, please consider

  • suggested corrections that apply to multiple records (> 1000s)
  • suggested corrections that involve multiple fields at the same time
  • suggested corrections that involve multiple records at the same time (“These records contradict each other”)
  • suggested corrections that identify strict duplicate records
  • suggestions without explicit corrections (“This can’t be right, please check”; “There is a missing-but-expected value here”; “One or more of these values is incorrect, please check”; see here)

Atomic annotations for individual values in individual fields are suited to TDWG-type “assertions” but not for the real-world checking that I and other data auditors do, and presumably will also do when DiSSCo and participants begin their own checking.

An alternative type of record-keeping is to have an original dataset version, a corrected or queried dataset version and a diff that applies to the whole dataset. Each diff could be given a PID, and the new versions would be generated by both the data publisher (DiSSCo participant or DiSSCo staff) and data checkers.

Yes, this is information-dense and requires substantial storage, but it is less complicated than tracking individual annotations/corrections.

3 Likes

@datafixer thanks for this list. I think we have foreseen all these cases except an annotation that targets multiple records (e.g. “these records contradict each other”) . We may not be able to support that with our current model, however that may not be an issue if you can annotate a digital specimen object with something like: this digital specimen contradicts those three digital specimens. You would than not find that annotation if you look at one of these other three digital specimen though, unless we build in a function that also show annotations that mention the specimen pid but are not directly linked to it. But let’s start simple.

Kind regards,
Wouter

1 Like

@waddink, please note that was a “thinking out loud” list and incomplete. Recently I have been uncovering more of these anomalies in biodiversity datasets, and they are hard to both test for and describe.

So your current plan is not to build corrected DESes (with documentation of the corrections), but instead to simply add on annotations with questions and corrections? How is that more simple than fixing or round-tripping records with a request to fix, so that the DES is fixed before it needs annotating?

Please also note that I am not suggesting fix/versioning/diffs as the “late-stage” DES model for DiSSCo, which is what both you and @sharif.islam seems to be thinking about. I am suggesting that DiSSCo can save everyone a lot of work and greatly improve DES quality if you do as many fixes as possible before you build the DES, which was the subject of this comment.

1 Like

@datafixer you should do data fixing as early as possible in the data lifecycle, and our aim is to create the digital specimen as early as possible in the data lifecycle to support that process.

You seem to think only of a legecy situation where the data is already in the cms. In that case the data may not have been fixed yet because there is no capacity or expertise within the institution to do so. In that case the des may still help because more experts, citizen scientists and date fixing services have access to it.

Once it is validated/fixed it can then go to an aggregator like GBIF. In case a GBIF user (or a GBIF validation process) discovers that there are still errors or other issues, that user can annotate that themselves in the des, the community or specimen provider can approve it after which it can be merged in the data and republished in GBIF.

In the case where a des is created in a digitisation process the des will exist before the data goes into a cms. In that case you usually only do a quality check like: “is the image created of sufficient quality“ before you create the des, at that point containing not much more than a PID and an image, where then services or people can annotate it in the des for example with ocr output or georeferencing. There will be limited time for that though as you probably want to have the data as soon as possible in the cms for management purposes so further fixing may occur after the data is already in the cms.

We aim to have a process like (machine-) annotate, then validate annotation, then change the data by merging the annotation and store the diff. Some annotations are just remarks though and can stay as annotation.

Kind regards,
Wouter

1 Like

A multiple-field, multiple record check would be an interesting case for a machine annotation service in dissco. In that case instead of doing it once at the command line, you would write a script that does it and wrap that in a little code to make it available as annotation service for any user in dissco. The user can then select a number of specimens that should be compared and run your script on them with the click of a button after which your script adds an annotation in case anomalies were found.