Szathmáry László honlapja @ DEIK | Nim2 / Remove accents

Task: "Árvíztűrő tükörfúrógép" → "Arvizturo tukorfurogep"

Simple way (replace)

A simple solution is to make a dictionary and collect what character you want to replace by what. Example:

# Python
accents = {
  'á': 'a',
  'ü': 'u',
  ...
}

It works fine with a Hungarian text (in my case), but what if I want to work with a French text? Or with a Czech or Polish text? I should keep updating my dictionary risking that I miss something.

Is there a more robust solution?

The NFD way (keep ASCII characters only)

A general solution is to NFD-normalize the string. It means that an accented character ('é') is decomposed into two characters: base character 'e' + combining accent '´'. Then, when you iterate over the NFD-normalized string, you can keep just the ASCII characters and remove anything else.

In the Nim stdlib NFD-normalization is not implemented, thus we need a 3rd-party package called normalize:

$ nimble install normalize

Example:

import pkg/normalize

func asciify(text: string): string =
for c in toNFD(text):
if ord(c) < 128:
result.add(char(c))

let text = "Árvíztűrő tükörfúrógép"
echo text.asciify # Arvizturo tukorfurogep

Here is the docs of the package. toNFD() is an iterator; it iterates over the runes of a string. ASCII characters fall in the range [0, 127] (closed interval). Since the type of `c` is a Rune, we need to convert it back to char.

Pros and cons

This solution is simple, works with a lot of languages, and the size of the compiled binary doesn't get big because of using this 3rd-party library. However, it removes non-ASCII characters:

echo "Außländer".asciify() # Aulander

It doesn't handle 'ß' → 'ss' style substitutions.

The stdlib way

See std/unidecode

Actually, we can solve this problem by using the stdlib too. It also handles the 'ß' → 'ss' style substitutions.

The only downside is that the size of the binary will be quite big. Why? This module needs the data file unidecode.dat to work, thus this file will be embedded as a resource into your application. Under Linux, a simple example in debug build will be about 3.8 MB of size. By making a small build (without UPX), I could reduce its size to 2.8 MB.

import std/unidecode

let
text1 = "Árvíztűrő tükörfúrógép"
text2 = "Außländer"

echo text1.unidecode # Arvizturo tukorfurogep

echo text2.unidecode # Ausslander