|
Oktatás * Programozás 1 * Szkriptnyelvek Teaching • Programming 1 (BI) Félévek Linkek * kalendárium |
Nim2 /
Remove accentsTask: "Árvíztűrő tükörfúrógép" → "Arvizturo tukorfurogep" Simple way (replace)A simple solution is to make a dictionary and collect what character you want to replace by what. Example:
# Python
accents = {
'á': 'a',
'ü': 'u',
...
}
It works fine with a Hungarian text (in my case), but what if I want to work with a French text? Or with a Czech or Polish text? I should keep updating my dictionary risking that I miss something. Is there a more robust solution? The NFD way (keep ASCII characters only)A general solution is to NFD-normalize the string. It means that an accented character ('é') is decomposed into two characters: base character 'e' + combining accent '´'. Then, when you iterate over the NFD-normalized string, you can keep just the ASCII characters and remove anything else. In the Nim stdlib NFD-normalization is not implemented, thus we need a 3rd-party package called normalize: $ nimble install normalize Example: import pkg/normalize func asciify(text: string): string = for c in toNFD(text): if ord(c) < 128: result.add(char(c)) let text = "Árvíztűrő tükörfúrógép" echo text.asciify # Arvizturo tukorfurogep Here is the docs of the package. Pros and consThis solution is simple, works with a lot of languages, and the size of the compiled binary doesn't get big because of using this 3rd-party library. However, it removes non-ASCII characters: echo "Außländer".asciify() # Aulander It doesn't handle 'ß' → 'ss' style substitutions. The stdlib waySee std/unidecode Actually, we can solve this problem by using the stdlib too. It also handles the 'ß' → 'ss' style substitutions. The only downside is that the size of the binary will be quite big. Why? This module needs the data file import std/unidecode let text1 = "Árvíztűrő tükörfúrógép" text2 = "Außländer" echo text1.unidecode # Arvizturo tukorfurogep echo text2.unidecode # Ausslander |
![]() Blogjaim, hobbi projektjeim * The Ubuntu Incident [ edit ] |