Example CSV attached.
file.csv (33.5 KB)
In the wild I met several other space characters beside nbsp. From gnparser:
OtherSpace <- [ \t\r\n\f\v]
Testing. Notepad++ says there are 38 NBSPs in this datafile you shared.
Testing. Excel says there are 34 cells with the NBSP.
The gremlins script I use checks for carriage returns, formfeeds and vertical tabs. I find them occasionally, scattered among various DwC items. I can’t remember ever finding them in taxon names.
Note: Discourse may not show it very well, but “gremlins script” above is hyperlinked.
Correct. Why are they there? This is from a real-world dataset.
Understood @datafixer. First, I was trying to “see what you see” and look at what I can see Notepad++ compared to Excel… so you could see how they look for an Excel user on a PC. And show others who might not have quite yet followed … what they might expect to see if they try.
@Debbie, thanks for that. I often recommend Notepad++ (and Geany) to Windows users, for both checking and editing. EmEditor is another good choice for the Windows-using community. Unfortunately the Excel habit seems to be hard to break.
Note @datafixer that while Excel tells me which rows have an NBSP (maybe more than one), it does not show me “where” they are in the row. Very sad. I would have do some sort of FIND/REPLACE strategy such as @pentcheff suggests to “See” them in Excel. I do note that they look to often (not always) be in between the scientific name and authority. (As I think @Barkworth hinted at? I wonder if it’s from copy-n-paste from something in addition to the concat).
@Debbie Good bet — if I were typesetting with species names + authorities, I’d be tempted to put NBSPs between the species epithet and the first token of the authority — a line break between those could leave a somewhat confusing couple of lines of text.
@pentcheff, if that’s what’s happening, I have to wonder why biodiversity data compilers (not their sources) would think they need to typeset. They might also be italicising genus and species names, which is also typesetting, but unlike such formatting the NBSPs won’t disappear when the data are simplified to plain text.
I think someone along the way may be copy/pasting from a source that was intended for (or the product of) a published output. An example could be an institution’s type catalogue. Something like that might have been created in a pre-digitization era. Then, when an institutional specimen database is instantiated, staff might create the specimen entries “by hand”, one by one, and quite reasonably would copy/paste from the legacy catalogue (or the “final copy” used to create it).
That thought caused me to wonder: what common formats, when copy/pasted, lead to incuding an NBSP? I ran the following tests (context: macOS Sonoma, MS Word for Mac 16.78.3, macOS Preview 11.0, Apple Pages 13.2, Affinity Designer 2.4.1). In all cases, the original text contained mixed spaces and NBSPs.
– Pages original: copy/paste elsewhere yields NBSPs & spaces.
– Word original: copy/paste elsewhere yields all spaces.
– PDF exported from Pages open in Preview: copy/paste yields all spaces.
– PDF “saved as” from Word open in Preview: copy/paste yields all spaces.
– PDF exported from Pages opened in Affinity Designer: NBSPs & spaces.
– PDF exported from Word opened in Affinity Designer: all spaces.
I don’t think I learned much except “It depends…”.
@pentcheff, thanks for checking on your Mac. I had no luck at all retaining NBSPs when copy-pasting from a range of desktop and online apps into plain-text apps, and as I say in my post, I couldn’t find NBSPs either in “the usual sources” (online) or in the database and publication sources that authors said they used for their entries.
And yet NBSPs are very common in datasets shared with GBIF.