How to avoid using a spreadsheet when preparing GBIF data

Spreadsheets are commonly used to compile datasets for later harvest by GBIF. I know this as a data auditor, because many of the plain-text datasets I receive are stamped “Mangled by Excel” on the underside. The manglings include:

  • records breaking over several lines, thanks to unnoticed line breaks within data items

  • items that aren’t dates appearing as dates

  • dates appearing with unwanted format

  • leading zeroes disappearing

  • codes containing colons appearing as times

  • fields shifting left or right

  • items copied down or across too far

  • incremental fill-downs (“WGS84, WGS85, WGS86…”)

  • superfluous blank fields at the ends of rows

  • formula artifacts (“G188:O188”) within data items

  • field names confused because column headers were merged

  • multiple quotes appear from repeated CSV-ing and re-importing (“”“Collected at “””“”“The Springs”“”“”“”“”)

  • blank fields or lines from “prettifying” spreadsheet data with blank rows or columns

Data diseases like these can be avoided by careful spreadsheet use, but prevention is better than cure, and there are other ways to compile and check data tables. Three of them are considered here.

Modern CSV

All that most data compilers need is a simple, visual, table editor: a GUI program that displays data items neatly in separate cells. The program should also allow copying down and copying across; searching and replacing; editing, duplicating, moving and deleting data items; moving and sorting of one or more rows or columns; and filtering of data.

Ideally, the program would handle very large datasets, it would import/export data in a variety of text formats and encodings, and it would never, ever mess up dates.

This program exists. It’s called Modern CSV and there are free and “premium” versions for Windows, Mac and Linux. If you have never heard of Modern CSV, I recommend you visit its website:

https://www.moderncsv.com/

https://www.moderncsv.com/documentation/

Text editors

If your computer runs Windows, it can run Notepad++, which is a free (no charge) and sophisticated plain-text editor. In some ways Notepad++ is a hybrid of a word processor and a spreadsheet, and among its features are auto-completion of entries, row and column editing, row and column sorting and custom macros.

https://notepad-plus-plus.org/

https://npp-user-manual.org/

Notepad++ also allows easy conversion of line endings and character encoding, and there is a useful set of Notepad++ plug-ins:

https://github.com/notepad-plus-plus/nppPluginList/blob/master/doc/plugin_list_x64.md

I don’t run Windows, and my preferred text editor is the free (no charge) Geany, which is available for Windows, Mac and Linux. Geany’s most useful feature for me is its ability to open and edit very large data tables (100,000s of records). Column selection and copying down is possible in Geany but not as easy as in Notepad++.

https://geany.org/

There are also numerous plug-ins for Geany:

https://plugins.geany.org/

Command line

All my data auditing is done on the command line with plain-text versions of data tables (see next section). Although there are R and Python approaches to doing this, the speed, simplicity, reliability and flexibility of UNIX-type text-processing utilities and AWK make them the command-line tools of choice for data checking and cleaning.

It’s also possible to compile data tables with command-line help, but this is no faster than compiling in a spreadsheet. My preference has been to build a simple data-entry form as a GUI dialog with a shell script, which then appends each record (after validation) to a growing text file. (For a Linux example, see this blog post from 2014.)

Escaping a spreadsheet

Spreadsheet programs have export and “Save as” options but there is a much simpler way to convert your spreadsheet data to a plain text table. Select all the active cells in the table, copy them to the clipboard and paste them into a new, blank document in your text editor (see above). The result is a tab-separated (TSV), plain text version of your data. No data items will be quoted (apart from quotes within data items) and any leading apostrophes (used to force text formatting in the spreadsheet) will disappear.

Before doing this, you should ensure that there are no tab characters in the spreadsheet, otherwise these will create new, unwanted fields in the plain text TSV.

You can further prepare your data for general use by making sure that the file encoding is UTF-8 and that line endings are plain (linefeed only), not Windows-style (carrage return + linefeed). It is usually easier to do encoding conversions and line-ending changes in a text editor after copy/pasting than in a spreadsheet at the export or “Save as” stage.

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

[If you have a technical question arising from this post, please email me directly rather than create a thread in the comments.]

8 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.