Excel? No, thanks. CLI? Yes, please

datafixer · February 19, 2025, 6:03am

It is a truth universally acknowledged that a biodiversity data worker will be tempted, at some time in their career, to compile or manage their dataset in Microsoft Excel.

This is an easily avoidable mistake. Sooner or later, even careful data workers will suspect that a spreadsheet is simply the wrong tool for these jobs, and if they don’t carefully check what they’ve done in Excel, well…

Let’s get back to basics: biodiversity data tables are plain text. You compile them with letters, numbers, punctuation and spaces, neatly arranged in rows and columns. A computer knows how to separate the rows because there’s an invisible “start a new line” character between the rows. A computer also knows how the columns are separated. There’s a visible character like a semicolon in a CSV, or an invisible tab in a TSV, that tells the computer “old column ends here, new column begins”.

This mass of structured text is complete ^[1]. It contains all the information you put in, plus all the necessary structure. It doesn’t need anything else to be useful or share-able. You don’t need the huge library of functions and the annoying auto-formatting of Excel to manage a plain text table, nor do you need to pay a rich American corporation for the privilege of laying out, examining and editing your own data.

There are fast, powerful and completely free table editors for working with biodiversity data, with each data item in its own little cell, just like in a spreadsheet. Search for “CSV editor” to find them, or accept these recommendations for “best in class”:

Modern CSV (Windows, Mac and Linux) - https://www.moderncsv.com/
EmEditor (Windows only) - https://www.emeditor.com/

(Not a spreadsheet. This is a screenshot from Modern CSV.)

Using a table editor means that your plain text data always remains plain text. It isn’t gobbled up and obfuscated in a proprietary format, so that you have to open a spreadsheet program or export or “Save as…” to view the data. You don’t even need a table editor to do data-tweaking. You can open a plain-text dataset in any text editor to search for and edit data items, like expanding “DOI:10.3897/biss.9.147432” to “https://doi.org/10.3897/biss.9.147432”.

Table editors also allow you to easily choose the encoding and line ending style of your data table, and please note that for sharing, biodiversity data should always be UTF-8 encoded with plain (not Windows) line endings ^[2].

But table editors have their limits. They can do only very simple programmatic “cleaning”, and they can’t do the kind of intensive data checking outlined on the Darwin Core table checker website. (And neither can Excel!) For that you need a command line interface (CLI), like a Mac or Linux terminal application.

The good news is that you already have the raw material for CLI work. Text-processing tools for the command line originated in the UNIX era for working with plain text. Some of the tools have had 40 years of development and optimisation. They’re extraordinarily fast, reliable, easy to use and flexible.

The less-good news is that CLI work, unlike spreadsheet use, isn’t generally taught in upper-primary and secondary schools. If you haven’t worked on the command line before but would like to take advantage of its power and efficiency in managing biodiversity data, please email me for advice or training.

I’ll close with a favourite quote:

Have you ever noticed in the movies when the “super hacker,” — you know, the guy who can break into the ultra-secure military computer in less than 30 seconds — sits down at the computer, he never touches a mouse? It’s because filmmakers realize that we, as human beings, instinctively know the only way to really get anything done on a computer is by typing on a keyboard!

Most computer users today are familiar only with the graphical user interface (GUI) and have been taught by vendors and pundits that the command line interface (CLI) is a terrifying thing of the past. This is unfortunate, because a good command line interface is a marvelously expressive way of communicating with a computer in much the same way the written word is for human beings. It’s been said that “graphical user interfaces make easy tasks easy, while command line interfaces make difficult tasks possible” and this is still very true today.

(from William Shotts, The Linux command line)

Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

Plain text files don’t have a file header or signature. ↩︎
If I had an Australian dollar for every dataset I’ve audited that arrived as an Excel file from someone who thought that .xlsx is a data exchange format, I’d have enough to fly to a tropical Queensland resort for a holiday. Add another dollar for every dataset that was exported as text from Excel and arrived in Windows-1252 encoding (or UTF-8 plus horrible mojibake) with Windows line endings, and I’d be staying for a week in the resort’s swankiest villa. ↩︎

andre · February 19, 2025, 4:00pm

Excellent, I use ModernCSV on Mac for months and strongly recommend it.
Thanks

mark-pitblado · February 20, 2025, 12:43am

A cool tool that I find helpful for browsing data files in the terminal is csvlens. Simple, no editing capabilities, but good to get a glimpse of things. hjkl and / are nice for vim users.

Also, just took a look through the darwin core table checker, wow do you have some nice helpful stuff on there!

datafixer · February 20, 2025, 1:12am

@mark-pitblado, many thanks for the tip about csvlens, I’ll review it in my data blog. For viewing-only jobs I’ve been using csvlook from the csvkit package of utilities. Like csvlens, csvlook adds line numbers, which can be very helpful.

system · March 22, 2025, 11:12am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to avoid using a spreadsheet when preparing GBIF data Data Publishing	1	890	October 2, 2022
What I've learned from 500+ biodiversity data audits Data Publishing	0	229	April 25, 2024
How I check Darwin Core datasets Data Publishing	1	594	March 10, 2023
New website: Darwin Core table checker Data Publishing	0	359	December 21, 2023
A NBSP mystery Data Use	31	1038	March 27, 2024

Excel? No, thanks. CLI? Yes, please

Related topics