Excel? No, thanks. CLI? Yes, please

It is a truth universally acknowledged that a biodiversity data worker will be tempted, at some time in their career, to compile or manage their dataset in Microsoft Excel.

This is an easily avoidable mistake. Sooner or later, even careful data workers will suspect that a spreadsheet is simply the wrong tool for these jobs, and if they don’t carefully check what they’ve done in Excel, well…

Let’s get back to basics: biodiversity data tables are plain text. You compile them with letters, numbers, punctuation and spaces, neatly arranged in rows and columns. A computer knows how to separate the rows because there’s an invisible “start a new line” character between the rows. A computer also knows how the columns are separated. There’s a visible character like a semicolon in a CSV, or an invisible tab in a TSV, that tells the computer “old column ends here, new column begins”.

This mass of structured text is complete [1]. It contains all the information you put in, plus all the necessary structure. It doesn’t need anything else to be useful or share-able. You don’t need the huge library of functions and the annoying auto-formatting of Excel to manage a plain text table, nor do you need to pay a rich American corporation for the privilege of laying out, examining and editing your own data.

There are fast, powerful and completely free table editors for working with biodiversity data, with each data item in its own little cell, just like in a spreadsheet. Search for “CSV editor” to find them, or accept these recommendations for “best in class”:

(Not a spreadsheet. This is a screenshot from Modern CSV.)

Using a table editor means that your plain text data always remains plain text. It isn’t gobbled up and obfuscated in a proprietary format, so that you have to open a spreadsheet program or export or “Save as…” to view the data. You don’t even need a table editor to do data-tweaking. You can open a plain-text dataset in any text editor to search for and edit data items, like expanding “DOI:10.3897/biss.9.147432” to “https://doi.org/10.3897/biss.9.147432”.

Table editors also allow you to easily choose the encoding and line ending style of your data table, and please note that for sharing, biodiversity data should always be UTF-8 encoded with plain (not Windows) line endings [2].

But table editors have their limits. They can do only very simple programmatic “cleaning”, and they can’t do the kind of intensive data checking outlined on the Darwin Core table checker website. (And neither can Excel!) For that you need a command line interface (CLI), like a Mac or Linux terminal application.

The good news is that you already have the raw material for CLI work. Text-processing tools for the command line originated in the UNIX era for working with plain text. Some of the tools have had 40 years of development and optimisation. They’re extraordinarily fast, reliable, easy to use and flexible.

The less-good news is that CLI work, unlike spreadsheet use, isn’t generally taught in upper-primary and secondary schools. If you haven’t worked on the command line before but would like to take advantage of its power and efficiency in managing biodiversity data, please email me for advice or training.

I’ll close with a favourite quote:

Have you ever noticed in the movies when the “super hacker,” — you know, the guy who can break into the ultra-secure military computer in less than 30 seconds — sits down at the computer, he never touches a mouse? It’s because filmmakers realize that we, as human beings, instinctively know the only way to really get anything done on a computer is by typing on a keyboard!

Most computer users today are familiar only with the graphical user interface (GUI) and have been taught by vendors and pundits that the command line interface (CLI) is a terrifying thing of the past. This is unfortunate, because a good command line interface is a marvelously expressive way of communicating with a computer in much the same way the written word is for human beings. It’s been said that “graphical user interfaces make easy tasks easy, while command line interfaces make difficult tasks possible” and this is still very true today.

(from William Shotts, The Linux command line)


Robert Mesibov (“datafixer”); robert.mesibov@gmail.com


  1. Plain text files don’t have a file header or signature. ↩︎

  2. If I had an Australian dollar for every dataset I’ve audited that arrived as an Excel file from someone who thought that .xlsx is a data exchange format, I’d have enough to fly to a tropical Queensland resort for a holiday. Add another dollar for every dataset that was exported as text from Excel and arrived in Windows-1252 encoding (or UTF-8 plus horrible mojibake) with Windows line endings, and I’d be staying for a week in the resort’s swankiest villa. ↩︎

5 Likes

Excellent, I use ModernCSV on Mac for months and strongly recommend it.
Thanks

A cool tool that I find helpful for browsing data files in the terminal is csvlens. Simple, no editing capabilities, but good to get a glimpse of things. hjkl and / are nice for vim users.

Also, just took a look through the darwin core table checker, wow do you have some nice helpful stuff on there!

@mark-pitblado, many thanks for the tip about csvlens, I’ll review it in my data blog. For viewing-only jobs I’ve been using csvlook from the csvkit package of utilities. Like csvlens, csvlook adds line numbers, which can be very helpful.