Look at the two scientific names below. Do you see a difference between them?
We can’t see a difference, but a computer can. The spaces in the first name string are plain spaces. In the second string, some of the words are separated by an invisible formatting character called a “no-break space” or “non-breaking space” (abbreviated NBSP):
In the plain-text biodiversity datasets that GBIF shares, NBSPs are unnecessary because plain text isn’t formatted. Worse, NBSPs can cause data processing errors.
GBIF replaces NBSPs with plain spaces in processing, but they remain in the original Darwin Core archive files that GBIF harvests, and in verbatim.txt files. In the last 89 dataset checks I’ve done for Pensoft Publishers, 25 datasets (usually occurrence.txt files from an IPT) had NBSPs.
Strangely, in 20 of those 25 NBSP-infected datasets the NBSPs were exclusively or almost exclusively found in scientific name strings. For example, they appeared in the scientificName, acceptedNameUsage, originalNameUsage, verbatimIdentification and previousIdentifications fields.
I formerly suspected that data compilers were copy/pasting taxon names into their datasets from sources where the names had been NBSP-formatted. However, when I checked the name lists cited by the compilers, none had NBSPs. There are also no NBSPs in widely used online name sources, such as Catalogue of Life and Wikipedia.
Another difficulty with my copy/pasting idea is that copy/pasting doesn’t always work. NBSPs are retained when pasting into word processor documents and spreadsheets, but are replaced by plain spaces when pasting into text editors.
So:
(1) In the datasets shared with GBIF, why are NBSPs mainly found in scientific names?
(2) Where do the NBSPs in those names come from?
Robert Mesibov (“datafixer”); robert.mesibov@gmail.com