Szathmáry László honlapja @ DEIK | Nim2 / Iterate over the characters of a Unicode string

Problem

import std/strformat # &"Hello {name}!"

let name = "László"

for idx, c in name:
echo &"{idx}: {c}"

Output:

0: L
1: �
2: �
3: s
4: z
5: l
6: �
7: �

Reason: strings are stored in UTF-8 format, and 'á' and 'ó' occupy 2 bytes. When you iterate over the characters, you actually iterate over each byte. In Nim, a char is an unsigned byte.

Solution

A Unicode character in UTF-8 can occupy 1, 2, 3 or 4 bytes. A Rune is a 32-bit integer, thus it can hold any Unicode character. The idea is to convert a string to a sequence of runes and iterate over the runes, where each rune represents a Unicode character.

import std/strformat # &"Hello {name}!"
import std/unicode # Rune

let name = "László"

for idx, c in name.toRunes():
echo &"{idx}: {c}"

Output:

0: L
1: á
2: s
3: z
4: l
5: ó

proc toRunes(s: string): seq[Rune]

𝥶It returns a sequence of Runes.

iterator runes(s: openArray[char]): Rune

𝥶This is just an iterator. If you want to process every character of a huge text, then it can be cheaper.