Little things that matter in language design
Little things that matter in language design
Posted Jun 9, 2013 20:09 UTC (Sun) by khim (subscriber, #9252)In reply to: Little things that matter in language design by tialaramex
Parent article: Little things that matter in language design
For this purpose it provides UAX #15
Which nobody uses in programming languages because of performance reason.
If you don't do this normalisation step you can end up with a confusing situation where when the programmer types a symbol (in their text editor which happens to emit pre-combined characters) the toolchain can't match it to a visually and lexicographically identical character mentioned in another file which happened to be written with separate combining characters. This would obviously be very frustrating.
It's not as frustrating as you think. They don't type ı followed by ˙, they just type i. And the same with other cases. Any other approach is crazy. Why? Well, because many programming languages will show ı combined with ˙ as "ı˙", not as "ı̇".
You may say that ı˙ is not canonical representation of "i". Ok. "и" plus " ̆" is the canonical representation of "й". Try this for size:
$ cat test.c
#include <stdio.h>
int main() {
printf("%c%c%c%c == %c%c\n", 0xD0, 0xB8, 0xCC, 0x86, 0xD0, 0xB9);
}
$ gcc test.c -o test
$ ./test | tee test.txt
й == й
Not sure about you but on my system these two symbols only look similar when copy-pasted in browser - and then only in the main window (if I copy-paste them to "location" line they suddenly looks differently!). And of course these two symbols are different in GNOME terminal, gEdit, Emacs and other tools!
Thus, in the end you have two choices:
- Compare strings as sequence of bytes. Result: simple, clean, robust code, but toolchain can't match [symbol] to a visually and lexicographically identical character mentioned in another file
- Compare strings as UAX #15 says. Result: huge pole of complicated code and toolchain can match symbol to a visually and lexicographically different character mentioned in another file
Frankly I don't see second alternative as superior.