Cleaning GBIF-mediated data: "manual" vs "automated" is a false dichotomy

datafixer · June 17, 2022, 9:19am

There is a growing literature on methods for cleaning GBIF occurrence records. A recurring proposition in these studies is that “manual” cleaning might be suitable for small datasets, but that “automated” cleaning is necessary for larger ones:

Manual data cleaning based on expert knowledge can detect these issues, but it is only feasible on small taxonomic or geographic scales, and it is time-consuming and difficult to reproduce. As an alternative, automated filtering methods to identify potentially problematic records have been proposed as a scalable option, as they are able to deal with datasets containing up to millions of records and many different taxa.

Zizka et al. (2020) No one-size-fits-all solution to clean GBIF. PeerJ 8:e9916 (https://doi.org/10.7717/peerj.9916)

Even for experts, identifying and cleaning such errors and data quality issues manually is in many cases impracticable, given that data sets may contain thousands to millions of records. Therefore, targeted DC [data cleaning] strategies based on well-explained instructions and automated DC tools that produce standardized and reproducible high-quality data, especially for inexperienced users, are in high demand.

Fuehrding-Potschkat et al. (in review) Influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models. Ecology and Evolution (https://www.authorea.com/doi/full/10.22541/au.163420130.04455423)

“Manual” vs “automated” is a false dichotomy fallacy, because it suggests just two possibilities:

Detect data problems by eye, perhaps in a spreadsheet
Pass occurrence records to a program, such as OpenRefine or one of the R tool suites, that’s designed to detect particular, well-defined data problems.

In fact, both data cleaning methods involve human (“manual”) and code (“automated”) processing. The first method, done well, will make use of the search, sorting and filtering routines that are coded into a spreadsheet program. The second method involves thinking about the errors discovered and fixing them appropriately (unless the tools are used only to discard problematic records, which is data filtering, not data cleaning!).

There’s a middle path between eyeballing data tables and feeding them to specialised computer programs, namely checking a data table on the command line with fast, simple and reliable utilities like shell tools and AWK. These utilities constitute a toolkit that can be used adaptively to check for any data problems, not just selected ones like (for example) invalid coordinates. The utilities work quickly and efficiently with any number of records broken into any number of data fields, and they have simpler syntax than R or Python.

To demonstrate, I’ll use several of the recipes from A Data Cleaner’s Cookbook that I apply regularly in checking GBIF datasets. The dataset to be checked is MEMO-Monitoring exotic mosquitoes in Belgium. This is an exceptionally tidy Darwin Core archive with an event.txt and an occurrence.txt. The tables are Darwin Core-compliant and at first glance the field entries are properly formatted and normalised. I chose this dataset because its quality is so high — maybe it doesn’t need cleaning?

The events file has a header line, 7350 records and 21 fields. With the fieldlist command I can generate a numbered list of events.txt fields and I’ll use those numbers for field checks. (See screenshot below, the result of launching the command fieldlist ev. I renamed events.txt to “ev” for checking because “ev” is easier to type and my Linux computer doesn’t need filename suffixes.)

Checking the contents of each field progressively with the tally command, I find everything looks OK except locality, where there are character encoding issues in several of the entries. In the tally below, each unique entry in locality is preceded by the number of records with that entry. (See footnote for more on the mojibake.)

Particularly useful are what I call one2many commands. When a data item in one field (or pair of fields) should have one and only one corresponding data item in a second field, one2many returns the exception, the rule and the number of records with each. Here I’ve found a problem: the same latitude and longitude is assigned to two different locality entries:

Are the “Raeren cemetry” coordinates a mistake here? Apparently, since 66 other entries for that place name have different coordinates:

occurrence.txt has 20 fields and 10762 records.

While eventDate is uniformly formatted as an interval date YYYY-MM-DD/YYYY-MM-DD, dateIdentified is unfortunately a mixture of D/MM/YYYY and YYYY-MM-DDTHH:MM:SS.SSSZ formats. Checking needs to be flexible here, and AWK does the job nicely. It quickly finds the 22 out of 10762 records where dateIdentified is earlier than the later date of the eventDate sampling interval:

Note the anomalous “2010” in dateIdentified in the last screenshot. Is that a one-off error? Yes…

…but it suggests that “sanity checks” might be worth doing for year, month and day values in the date fields.

Two kinds of work were demonstrated here. The “automated” (code) kind used shell tools and recipes from A Data Cleaner’s Cookbook. The “manual” (human) kind involved deciding what to check, checking the results, and thinking of further checks to pinpoint individual data problems. Both kinds of work are needed in data cleaning, and command-line utilities allow you to mix the two productively.

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

Further reading: People are the best data cleaners (blog post)

Note on mojibake. How did Liège become LiÃƒÆ’Ã‚Â¨ge, Noidré and Marché become NoidrÃƒÆ’Ã‚Â© and MarchÃƒÆ’Ã‚Â©, and Büllingen become BÃƒÆ’Ã‚Â¼llingen? By passing repeatedly in and out of different encoding locales.

I’ll use Liège as an example. Suppose the word was first entered in a Windows program, where the character è is represented by a single byte of code with the hexadecimal value e8. The data file then went to a computer where the locale used UTF-8 encoding. In UTF-8, è is two-bytes, c3 a8.

Next, the text file went back to a Windows program, which interpreted each byte in c3 a8 as a separate character: Ã¨.

Back to UTF-8, where each of the characters in Ã¨ requires two bytes: so c3 83 c2 a8.

Back to Windows, where each of the four bytes became a single character: ÃƒÂ¨.

Back to UTF-8, where the four characters have the eight bytes c3 83 c6 92 c3 82 c2 a8.

Back to Windows, where the eight single bytes are interpreted as ÃƒÆ’Ã‚Â¨.

Finally, the data file is encoded in UTF-8 in the IPT harvested by GBIF. Liège now looks like LiÃƒÆ’Ã‚Â¨ge. The original è is now eight characters with the byte code c3 83 c6 92 c3 86 e2 80 99 c3 83 e2 80 9a c3 82 c2 a8.

There’s no easy fix for mojibake confusion like this, unless you already know that Liège was the original word. A practical strategy is to check your data after every migration between computers running different operating systems, to make sure that nothing weird has happened.

dimi · June 24, 2022, 8:02pm

Cool! I will check this post again and use it to improve this dataset more! I only know I tried to have everything in UTF8

system · July 25, 2022, 6:03am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filtering isn't cleaning Data Use	22	1318	October 3, 2023
Biodiversity Data Quality scripts for Open Refine	6	1424	October 19, 2019
Finding citizen science datasets on GBIF - GBIF Data Blog data-blog	5	971	November 27, 2018
Data Use Club Practical Session : Data Quality Data Use	1	802	December 14, 2022
100 GBIF datasets, improved Data Publishing	5	1881	October 4, 2021

Cleaning GBIF-mediated data: "manual" vs "automated" is a false dichotomy

Related topics