One way to find non-Latin characters when you don’t know any programing

This is just a short post to share one simple way to identify non-latin characters before you publish your data out in the world.

To our eyes “а” and ”a” are the same letter but they aren’t to a machine. The difference between them is the way they are encoded (how they are stored in a file and how a the computer can read them). Character encoding matters when it comes to automated data processing. A name might not be recognized because one of its letter is encoded differently than expected.

This is why it is good practice to check your data for character encoding issues. This is especially important for people who use keyboards with special characters or switch between different alphabets.

There are many ways to check for those potentially problematic characters (see for example, this blogpost). Here is one simple recipe that you can try that doesn’t require any programming skills or understanding of regular expression.

Recipe to find non-Latin characters when you don’t know any programing :

  1. You will need a software able to use regular expressions for searches. For this little tutorial, I am used Sublime Text, which you can download here. But many other software would do, for example OpenRefine.

  2. Make sure that your data is saved as a CSV UTF-8 file or Unicode Text file (see screenshot below show how an excel file can be saved to Unicode Text on Windows).

  3. Open your data in your software. With Sublime Text, you can open the file by clicking > File > Open and then choose the file.

  4. Open the search box. On Sublime Text, you can do so by clicking > Find > Find…

  5. Enable to search with regular expression. On Sublime Text, click on the lower left button with the symbols: .* (see at the bottom screenshot)

  6. Paste the following text in the search bar including the square brackets: [^\x00-\x7F]

At that point, the non-Latin and non-ASCII characters should be highlighted. You can click on “find” or “find all” depending on how you prefer to work.

What is your favourite method to find those characters? Please share it with us!

4 Likes

A note for publishers of checklists. The new COL ChecklistBank flags scientific names that contain unusual characters which include all non latin characters. This allows to quickly spot names with German Umlauts, accented characters but also unusual punctuation as you can see in this example: ChecklistBank

1 Like

Terrific! This is excellent for all of us who are not programmers! Thank you.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.