One way to find non-Latin characters when you don’t know any programing

mgrosjean · September 30, 2021, 9:19am

This is just a short post to share one simple way to identify non-latin characters before you publish your data out in the world.

To our eyes “а” and ”a” are the same letter but they aren’t to a machine. The difference between them is the way they are encoded (how they are stored in a file and how a the computer can read them). Character encoding matters when it comes to automated data processing. A name might not be recognized because one of its letter is encoded differently than expected.

This is why it is good practice to check your data for character encoding issues. This is especially important for people who use keyboards with special characters or switch between different alphabets.

There are many ways to check for those potentially problematic characters (see for example, this blogpost). Here is one simple recipe that you can try that doesn’t require any programming skills or understanding of regular expression.

Recipe to find non-Latin characters when you don’t know any programing :

You will need a software able to use regular expressions for searches. For this little tutorial, I am used Sublime Text, which you can download here. But many other software would do, for example OpenRefine.
Make sure that your data is saved as a CSV UTF-8 file or Unicode Text file (see screenshot below show how an excel file can be saved to Unicode Text on Windows).

UnicodeTxt463×589 18.1 KB
Open your data in your software. With Sublime Text, you can open the file by clicking > File > Open and then choose the file.
Open the search box. On Sublime Text, you can do so by clicking > Find > Find…
Enable to search with regular expression. On Sublime Text, click on the lower left button with the symbols: .* (see at the bottom screenshot)
Paste the following text in the search bar including the square brackets: [^\x00-\x7F]

At that point, the non-Latin and non-ASCII characters should be highlighted. You can click on “find” or “find all” depending on how you prefer to work.

What is your favourite method to find those characters? Please share it with us!

markus · October 2, 2021, 9:59am

A note for publishers of checklists. The new COL ChecklistBank flags scientific names that contain unusual characters which include all non latin characters. This allows to quickly spot names with German Umlauts, accented characters but also unusual punctuation as you can see in this example: ChecklistBank

administrador_sibm · October 6, 2021, 10:51pm

Terrific! This is excellent for all of us who are not programmers! Thank you.

system · November 6, 2021, 8:52am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is it okay to publish in other languages(not alphabet)? Data Publishing	3	450	June 18, 2022
A NBSP mystery Data Use	31	1097	March 27, 2024
How I check Darwin Core datasets Data Publishing	1	594	March 10, 2023
How to avoid using a spreadsheet when preparing GBIF data Data Publishing	1	890	October 2, 2022
Wildcard character in searches?	2	944	August 2, 2020

One way to find non-Latin characters when you don’t know any programing

Related topics