Filtering isn't cleaning

I’ve been following this conversation with a lot of interest, and I’d like to toss in a related idea, although I’m not sure how viable it is. Training and annotation/flags are two ways to attack this problem, and I wonder if “pull requests” might be another. Because no matter how perfect the training, or how well-staffed the organization, there will always be inconsistencies found by users.

It isn’t unusual for someone to improve the quality of the downloaded data instead of filtering it (although I agree that filtering it the default for most of us). @datafixer showed a good example of where this is possible, provided you have the resources to do it.

In GBIF, there is a mechanism for minting a DOI for derived data, but no mechanism to submit any code/QAQC work to the data originator for incorporation into the dataset. Likewise, issues can be flagged but not repaired by users. Here is an example of an issue that would have been a simple PR, but the person who found it couldn’t do more than flag it as an issue. So, we have github issues for DQ, but no equivalent of github pull requests. It seems a shame to let this work go to waste.

This suggestion won’t fix all DQ problems, and there are some “improvements” that would be more appropriate as an annotation. But I believe it’s worthwhile to take advantage of work that is already being done but doesn’t have a mechanism for integration. In my vision, these improvements only require data stewards to click buttons and review records, like in github, which would reduce the necessary people hours and technical training. I’m guessing the infrastructure of what I’m suggesting is more difficult than I can imagine, but I thought I would throw this out there for folks to poke holes in.

2 Likes