Little things that matter in language design

Posted Jun 9, 2013 20:09 UTC (Sun) by khim (subscriber, #9252)
In reply to: Little things that matter in language design by tialaramex
Parent article: Little things that matter in language design

For this purpose it provides UAX #15

Which nobody uses in programming languages because of performance reason.

If you don't do this normalisation step you can end up with a confusing situation where when the programmer types a symbol (in their text editor which happens to emit pre-combined characters) the toolchain can't match it to a visually and lexicographically identical character mentioned in another file which happened to be written with separate combining characters. This would obviously be very frustrating.

It's not as frustrating as you think. They don't type ı followed by ˙, they just type i. And the same with other cases. Any other approach is crazy. Why? Well, because many programming languages will show ı combined with ˙ as "ı˙", not as "ı̇".

You may say that ı˙ is not canonical representation of "i". Ok. "и" plus " ̆" is the canonical representation of "й". Try this for size: $ cat test.c
#include <stdio.h>

int main() {
printf("%c%c%c%c == %c%c\n", 0xD0, 0xB8, 0xCC, 0x86, 0xD0, 0xB9);
}
$ gcc test.c -o test
$ ./test | tee test.txt
й == й
Not sure about you but on my system these two symbols only look similar when copy-pasted in browser - and then only in the main window (if I copy-paste them to "location" line they suddenly looks differently!). And of course these two symbols are different in GNOME terminal, gEdit, Emacs and other tools!

Thus, in the end you have two choices:

Compare strings as sequence of bytes. Result: simple, clean, robust code, but toolchain can't match [symbol] to a visually and lexicographically identical character mentioned in another file
Compare strings as UAX #15 says. Result: huge pole of complicated code and toolchain can match symbol to a visually and lexicographically different character mentioned in another file

Frankly I don't see second alternative as superior.

to post comments

Little things that matter in language design

Posted Jun 9, 2013 20:48 UTC (Sun) by hummassa (guest, #307) [Link] (7 responses)

> Which nobody uses in programming languages because of performance reason.

(UAX-15). I use it. Perl offers NFC, NFD, NFKC, NFKD without a huge perceivable (to me) performance penalty. AFAICT MySQL uses it, too.

> It's not as frustrating as you think. They don't type ı followed by ˙, they just type i. And the same with other cases. Any other approach is crazy. Why? Well, because many programming languages will show ı combined with ˙ as "ı˙", not as "ı̇".

This silly example tells me you don't have diacritics in your name, do you? Sometimes the "ã" in my last name is in one of the Alt-Gr keys. Sometimes I have to enter it via vi digraphs, either as "a" + "~" or "~" + "a". Sometimes I click "a", "combine", "~" or "~", "combine", "a". Or "~" (it's combining in my current keyboard by deafult, so that if I want to type a tilde, I have to follow it with a space or type it twice) followed by "a".

> й == й
> Not sure about you but on my system these two symbols only look similar when copy-pasted in browser - and then only in the main window (if I copy-paste them to "location" line they suddenly looks differently!). And of course these two symbols are different in GNOME terminal, gEdit, Emacs and other tools!

it seems to me that your system is misconfigured. I could not see the difference between "й" and "й" in my computer, be it in Chrome's main window, location bar, gvim, or in yakuake's konsole window.

> Frankly I don't see second alternative as superior.

UAX15 is important. People sometimes type their names with or without diacritcs (André versus Andre). Some names are in different databases with variant -- and database/time/platform dependent -- spellings. In some keyboards, a "ç" c-cedilla is a single character, in others, you punch first the cedilla dead key and then "c", and in others you type, for instance, the acute dead key followed by "c" (it's the case in the keyboard I'm typing right now). Sometimes you have to say your name over the phone and the person on the other side of the call must be capable of searching the database by the perceived name. Someone could have entered "ﬁ" and another person is searching by "fi".

So, sometimes your "second alternative" is the only viable alternative. Anyway, the programming language should support "compare bytes" and "compare runes/characters" as two different use cases.

Little things that matter in language design

Posted Jun 9, 2013 21:14 UTC (Sun) by khim (subscriber, #9252) [Link] (2 responses)

Anyway, the programming language should support "compare bytes" and "compare runes/characters" as two different use cases.

I may be mistaken, but it looks like you are discussion completely different problem. Both tialaramex and me are talking about programming langauges themselves.

(UAX-15). I use it. Perl offers NFC, NFD, NFKC, NFKD without a huge perceivable (to me) performance penalty.

Really?. Let me check:
$ cat test.pl
use utf8;

$й="This is test";

print "Combined version works: \"$й\"\n";
print "Decomposed version does not work: \"$й\"\n";
$ perl test.pl
Combined version works: "This is test"
Decomposed version does not work: ""

Am I missing something? What should I add to my program to make sure I can refer to $й as $й?

it seems to me that your system is misconfigured. I could not see the difference between "й" and "й" in my computer, be it in Chrome's main window, location bar, gvim, or in yakuake's konsole window.

Of course not! You've replaced all occurrences of "й" with "й" - of course there will be no difference! Not sure why you've did that (perhaps your browser did that for you?) but if you do a "view source" on my message then you'll see a difference, if you do the same with your message - both cases are byte-to-byte identical. It'll be a little strange to see different symbols in such a case.

UAX15 is important.

Sure. In databases, search systems and so on (where fuzzy matching is better then no matching) it's important. In programming languages? Not so much. Most of the time when language tries to save programmer from himself (or herself) it just makes him (or her) miserable long (and even medium) term.

Little things that matter in language design

Posted Jun 10, 2013 16:12 UTC (Mon) by jzbiciak (guest, #5246) [Link]

Wow... Abusing the difference between й and й (and other cases of such fun) would make for some great obfuscated code. Or better yet, subtly malicious code.

Little things that matter in language design

Posted Jun 10, 2013 17:27 UTC (Mon) by hummassa (guest, #307) [Link]

> I may be mistaken, but it looks like you are discussion completely different problem. Both tialaramex and me are talking about programming langauges themselves.

You are right about this and I apologize for any confusion.

Little things that matter in language design

Posted Jun 9, 2013 23:38 UTC (Sun) by wahern (subscriber, #37304) [Link] (3 responses)

Perl6 also has NFG, which is probably the best normalization form out of all of them, although non-standard. It's not really even just a normalization form, but addresses issues of representation and comparison at the implementation level.

Using NFG solves all the low-level problems, including identifiers in source code, by getting rid of combining sequences altogether. Frankly I don't understand why it hasn't become more common. Maybe because most people just don't care about Unicode. Every individual has come to terms with the little issues with their locale. It's only when you look at all of them from 10,000 feet that you can see the cluster f*ck of problems. But few people look at it from 10,000 feet.

Little things that matter in language design

Posted Jun 11, 2013 1:07 UTC (Tue) by dvdeug (subscriber, #10998) [Link] (2 responses)

NFG isn't a normalization form at all. It doesn't get rid of combining sequences at all; it just invents dummy characters to hide combining sequences from the user. It's not that hard to generate a billion different combining sequences and potentially DoS any system using NFG. Ultimately, it's a lot of complexity for most systems that doesn't gain you that much over NFC.

Little things that matter in language design

Posted Jun 13, 2013 1:19 UTC (Thu) by wahern (subscriber, #37304) [Link] (1 responses)

You can DoS any system that doesn't use the correct algorithms. There are ways of implementing NFG that don't require storing every cluster ever encountered.

And it's not like existing systems don't have their own issues. The nice thing about NFG is that all the complexity is placed at the edges, in the I/O layers. All the other code, including the rapidly developed code that is usually poorly scrutinized for errors, is provided a much safer and more convenient interface for manipulation of I18N text. NFG isn't more complex to implement than any other system that provides absolute grapheme indexing. It's grapheme indexing that is the most intuitive, because it's the model everybody has been using for generations.

But most languages merely aim for half measures, and are content leaving applications to deal w/ all the corner cases. This is why UTF-8 is so popular. And it is the best solution when your goal is pushing all the complexity onto the application.

Little things that matter in language design

Posted Jun 14, 2013 0:22 UTC (Fri) by dvdeug (subscriber, #10998) [Link]

This is the 21st century; for the most part, I don't index anything. I have iterators to do that work for me, and arbitrary cursors when I need a location. If I want to work with graphemes, I can step between graphemes. If I want to work with words, I can step between words.

Grapheme indexing is not what everybody has been using for generations. In the 60 years of computing history, there have been a lot of cases where people working with scripts more complex then ASCII or Chinese have handled it a number of ways, including character sets that explicitly encoded combining characters (like ISO/IEC 6937) and the use of BS with ASCII to overstrike characters like ^ with the previous character.

UTF-8 is so popular because for many purposes it's 1/4th the size of UTF-32, and for normal data never worse then 3/4 the size. And as long as you're messing with ASCII, you can generally ignore the differences. If people want UTF-32, it's easy to find.

Little things that matter in language design

Posted Jun 10, 2013 16:58 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (2 responses)

The idea that programming languages don't use UAX #15 for symbol matching due to performance problems would be an easier sell if UAX #15 anywhere near to approached the difficulty of something like C++ symbol mangling.

You seem to be suffering some quite serious display problems with non-ASCII text on your system, I don't know what to suggest other than maybe you can find someone to help figure out what you did wrong, or upgrade to something a bit more modern. I've seen glitches like those you describe but mostly quite some years ago. Your example program displays two visually identical characters on my system but I can believe your system doesn't do this, only I would point out that it's /a bug/.

Even allowing for that your last paragraph is hard to understand. Are you claiming that because on your system some symbols are rendered incorrectly depending on how they were encoded those symbols are _different_ lexicographically and everybody else (who can't see these erroneous display differences) should accept that?

Little things that matter in language design

Posted Jun 11, 2013 9:07 UTC (Tue) by etienne (guest, #25256) [Link] (1 responses)

Just a $0.02:
> You seem to be suffering some quite serious display problems with non-ASCII text on your system

It seems (some) people want to use a fixed-width font to write programs, mostly because some Quality Enhancement Program declared the TAB character obsolete, and SPACE character width is not a constant in variable-width fonts editors.
Most software language needs indentations...
With non-ASCII chars in fixed-width font, if you even get the char shape in the font you are using, the only solution is probably to start drawing each char every N (constant) pixels and have the end of large chars superimpose with the beginning of the next char...

Little things that matter in language design

Posted Jun 11, 2013 10:13 UTC (Tue) by mpr22 (subscriber, #60784) [Link]

I use a fixed-width font to write code chiefly out of pure inertia: most of my coding is done in text editors running in character-cell terminals. Code written in Inform 7 is an exception (the Inform 7 IDE's editor uses a proportional font by default, and the IDE is so well-adapted to the needs of typical Inform 7 programming that not using it is silly), but Inform 7 statements look like (somewhat stilted) English prose so I don't mind so much.