Little things that matter in language design

Posted Jun 13, 2013 1:19 UTC (Thu) by wahern (subscriber, #37304)
In reply to: Little things that matter in language design by dvdeug
Parent article: Little things that matter in language design

You can DoS any system that doesn't use the correct algorithms. There are ways of implementing NFG that don't require storing every cluster ever encountered.

And it's not like existing systems don't have their own issues. The nice thing about NFG is that all the complexity is placed at the edges, in the I/O layers. All the other code, including the rapidly developed code that is usually poorly scrutinized for errors, is provided a much safer and more convenient interface for manipulation of I18N text. NFG isn't more complex to implement than any other system that provides absolute grapheme indexing. It's grapheme indexing that is the most intuitive, because it's the model everybody has been using for generations.

But most languages merely aim for half measures, and are content leaving applications to deal w/ all the corner cases. This is why UTF-8 is so popular. And it is the best solution when your goal is pushing all the complexity onto the application.

to post comments

Little things that matter in language design

Posted Jun 14, 2013 0:22 UTC (Fri) by dvdeug (subscriber, #10998) [Link]

This is the 21st century; for the most part, I don't index anything. I have iterators to do that work for me, and arbitrary cursors when I need a location. If I want to work with graphemes, I can step between graphemes. If I want to work with words, I can step between words.

Grapheme indexing is not what everybody has been using for generations. In the 60 years of computing history, there have been a lot of cases where people working with scripts more complex then ASCII or Chinese have handled it a number of ways, including character sets that explicitly encoded combining characters (like ISO/IEC 6937) and the use of BS with ASCII to overstrike characters like ^ with the previous character.

UTF-8 is so popular because for many purposes it's 1/4th the size of UTF-32, and for normal data never worse then 3/4 the size. And as long as you're messing with ASCII, you can generally ignore the differences. If people want UTF-32, it's easy to find.