A NBSP mystery

Look at the two scientific names below. Do you see a difference between them?

nbsp1

We can’t see a difference, but a computer can. The spaces in the first name string are plain spaces. In the second string, some of the words are separated by an invisible formatting character called a “no-break space” or “non-breaking space” (abbreviated NBSP):

nbsp2

In the plain-text biodiversity datasets that GBIF shares, NBSPs are unnecessary because plain text isn’t formatted. Worse, NBSPs can cause data processing errors.

GBIF replaces NBSPs with plain spaces in processing, but they remain in the original Darwin Core archive files that GBIF harvests, and in verbatim.txt files. In the last 89 dataset checks I’ve done for Pensoft Publishers, 25 datasets (usually occurrence.txt files from an IPT) had NBSPs.

Strangely, in 20 of those 25 NBSP-infected datasets the NBSPs were exclusively or almost exclusively found in scientific name strings. For example, they appeared in the scientificName, acceptedNameUsage, originalNameUsage, verbatimIdentification and previousIdentifications fields.

I formerly suspected that data compilers were copy/pasting taxon names into their datasets from sources where the names had been NBSP-formatted. However, when I checked the name lists cited by the compilers, none had NBSPs. There are also no NBSPs in widely used online name sources, such as Catalogue of Life and Wikipedia.

Another difficulty with my copy/pasting idea is that copy/pasting doesn’t always work. NBSPs are retained when pasting into word processor documents and spreadsheets, but are replaced by plain spaces when pasting into text editors.

So:

(1) In the datasets shared with GBIF, why are NBSPs mainly found in scientific names?

(2) Where do the NBSPs in those names come from?


Robert Mesibov (“datafixer”); robert.mesibov@gmail.com

2 Likes

@datafixer I know we have a lot of people in the US community who are more comfortable working in Excel than in text editors or IDEs. I suspect that pasting into Excel (especially maybe on Windows machines?) is probably at least a partial culprit. Have you had a chance to test/think about Excel specifically?

@sformel No, but I’ve tried copy/pasting text with NBSPs into LibreOffice Calc and Gnumeric. In both cases the NBSPs are replaced by plain spaces with “save as”, “export as” and copy/pasting out of the spreadsheet. I don’t know if that would also happen with Microsoft Excel. (I’m not a Windows user.)

NBSPs also cause problems in Excel. See this article for a good discussion on removing them.

Where do you think the NBSPs might be coming from?

One way non-breaking spaces can originate is from combining multiple columns in a spreadsheet such as Excel combining them to construct a single DarwinCore field. For me that applies to descriptions and habitat. Many of us encouraging students and faculty to share collection data have little background with different programs and find explaining what the different fields mean in the terms which are familiar to the people we work with takes most of our time.

1 Like

I tried to create a test file using the CONCAT and TEXTJOIN but the spaces came out as plain spaces. I got a chuckle out of it though, because usually the trouble is getting rid of them, and now I’m struggling to create them…

Anyway, I ended up coding it explicitly with char(160) for the delimiter. I don’t have an answer why these show up in scientific names, but I did want to share a simple find and replace that people can do in Notepad++ as part of their QAQC process prior to publishing.

image

Recommending this as part of the QAQC is a no-brainer. If the data are coming direct from a CMS with the NBSPs, that’s a different story. Any commonalities you see in the datasets that do share this problem (e.g. common publisher/infrastructure)?

1 Like

@Barkworth Many thanks for that suggestion, but as @sformel points out (next comment), just combining columns with a formula (CONCAT, TEXTJOIN, &) will not automatically put a NBSP between the joined items instead of a plain space. In Excel you would need to deliberately specify the NBSP, for example with Alt+0160. Is that what you do when combining columns?

I was talking about things I had found in the past. It came up when I tried to upload data from a spreadsheet and some names that were correctly written, I thought, were rejected. I did not ask the person writing the checklist how the name had been generate but I did discover that deleting the supposed space and then adding a space addressed the issue. It was some time ago.

~WRD0000.jpg

@sformel Many thanks for recommending Notepad++ for editing text files, including those exported from or “saved as” from Excel. Also thanks for showing lines with plain line endings (LF) instead of Windows line endings (CRLF), which also cause data processing errors.

If you are using Geany text editor for Windows, Mac or Linux and your text file is UTF-8 encoded, you can search with the escape sequence “\u00a0”. In the screenshot, the first line of “file” is spaced with a NBSP, the second with a plain space. When it finds a NBSP, Geany shows it as a gray vertical bar (as does LibreOffice Writer).

screenshot

1 Like

Very interesting, I’ve also come across them, but never found the culprit.

Remind me of OCR using cryllic characters for latin text. That’s a lot of fun too.

FWIW, I checked so see how RStudio interpreted the NBSP. Using both read.csv and readr::read_csv it read them in, and displayed them as spaces in RStudio. But when you write it back to csv the NBSPs are still there. So, here is a line of R code that will find and replace all NBSPs with plain spaces in character columns:

df |> dplyr::mutate_if(is.character, ~ stringr::str_replace_all( ., "\\u00A0", " " ))

1 Like

Microsoft Word has a tendency to insert NBSPs at the end of sentences, though I’ve never rigorously experimented to figure out when/why. That could explain why they crop up in some-but-not-all names with authorities. If (a) the text was laundered through Word at some stage; and (b) there’s an abbreviation with a period in the name, it wouldn’t surprise me at all to see NBSPs magically appear in there.

1 Like

Ah, another hint: many text editors with GUIs (even word-processy-oriented editors) can be induced to show whitespace, and will display NBSPs as a different-than-regular-space character. Further, if the editor doesn’t support useful escapes for searching, it’s often possible to deliberately find or create an NBSP in the text, then copy/paste that into the “Find” field. The NBSP you just created in the document then also serves as a positive control for your find/replace.

@pentcheff Thanks for contributing, and for the Word idea. I’ll ask the next Pensoft author with a NBSP problem if Word was involved.

@pieter See this section of the Cookbook, which shows Cyrillics buried in both a genus name and an author name from a real-world dataset. I regularly check for Cyrillic/Latin mixes in my auditing.

It’s pretty common. Another trick is to create a frequency table of all characters used in a body, and consider the long tail as errors. That way you’ll pick up on non-breaking spaces as well, even if you were unaware of them.

1 Like

@pieter Create a frequency table of characters with their hex values, or with their Unicode representations:
https://www.datafix.com.au/cookbook/characters1.html#3
Tally specifically the invisible characters:
https://www.datafix.com.au/cookbook/characters3.html#1

The long tail of low-frequency characters may or may not include NBSPs, and in the hundreds of biodiversity datasets I’ve audited only a few of the long-tail visible characters were errors, as opposed to valid characters appearing rarely (think of non-ASCII characters in scientificNameAuthorship). These errors were typically elements of UTF-8/Windows-1252 or UTF-8/Mac OS Roman mojibake.

For CLI users interested in visualising NBSPs, I suggest either “charfindID”, which is described here and which shows the NBSP as a yellow-highlighted space, or the simple function “nbspvis”:

nbspvis() { sed ‘s|\xc2\xa0|\x1b[102m\xc2\xb7\x1b[0m|g’; }

which replaces each NBSP with a mid-height black dot on a green background. [“nbspvis” is demonstrated in tomorrow’s BASHing data 2 post (2024-03-01), along with other notes on NBSPs.]

Whoops. I just noticed that the Cookbook version of “charfindID” is out -of-date. I’ll fix that, and here’s the newest version:

charfindID() {
echo “ID | Field name | Data item”; awk -F"\t" -v char=“$(printf “\$2”)” -v idfld=“$3” ‘NR==1 {for (i=1;i<=NF;i++) a[i]=$i} $0 ~ char {gsub(char,“\33[103m"char”\33[0m",$0); for (j=1;j<=NF;j++) if ($j ~ char) print $idfld FS a[j] FS $j}’ “$1” | sort -t $‘\t’ -k2,2 -Vk1 | sed ‘s/\t/ | /g’
}

2 Likes

Bob, share please a sample file perhaps? Some of us can test in Excel for you (on a PC).

(Possibly a post of interest here? UTF Import Issues · Issue #4990 · OpenRefine/OpenRefine · GitHub)

@datafixer Perhaps ask these folks where they get their name lists from? (out of their database? if so, which database?) and something about their process? (Does it involve copy-n-paste?)