Little things that matter in language design

Posted Jun 15, 2013 22:52 UTC (Sat) by dvdeug (subscriber, #10998)
In reply to: Little things that matter in language design by khim
Parent article: Little things that matter in language design

"Unicode didn't happen in one step" is blaming Unicode for the entire history of computing.

If you don't care about if the Turks are going to use your character set, go ahead and tell them to use ASCII. If you choose to separate their alphabet from the Latin, you're going to have a problem that they consider their alphabet part of the extended Latin alphabet, and they're not going to find that an acceptable solution. If you choose to separate out the alphabets of thousands of languages (even though the English alphabet is a superset of the French and Latin), you might mollify the Turks, but nobody is going to use your character set.

In reality, Turkish support requires locale-sensitive casing functions because every other solution has serious technical and often political problems, as well as not being compatible with existing systems, including keyboards.

to post comments

Little things that matter in language design

Posted Jun 16, 2013 3:30 UTC (Sun) by hummassa (guest, #307) [Link] (2 responses)

...
> In reality, Turkish support requires locale-sensitive casing functions
...

Let's be plain: there is no "casing functions" that are not locale-sensitive. The Turkish dotted "i"s are one example, the German vs. Austrian "ß" is another, etc. And don't get me started on collation order. If one is going to try to facilitate computations by separating each locale to an alphabet, I wish good luck with its newnicode. The real Unicode thankfully does not work that way. Usually, at least. :-D

Little things that matter in language design

Posted Jun 16, 2013 8:21 UTC (Sun) by khim (subscriber, #9252) [Link] (1 responses)

If one is going to try to facilitate computations by separating each locale to an alphabet, I wish good luck with its newnicode. The real Unicode thankfully does not work that way. Usually, at least. :-D

Well, that's certainly a pity: Unicode was developed to fit in 16bit and thus merged many scripts (it assumed language will be separated "on the side" and/or will be less important then glyphs themselves). They have failed (today there are over 90'000 glyphs in Unicode) yet as a result we can not properly work with English+Turkish (or even German+Austrian) texts as you've correctly pointed out.

Today we are stuck: yes, it's not perfect and this decision certainly made life harder, not easier, but it'll be hard to replace it with anything else at this point. Similar story to QWERTY. Numerous problems which stems from that old decision are considered minor enough and it'll be hard to switch. But note that the most popular OS does exactly that for CJK. It's slowly but surely is replaced by Unicode-based OSes (such as Android) thus in the end Unicode is probably inevitable, but it does not means that you can not achieve interoperability with Turkish people and working upcase/lowercase simultaneously. You can - Unicode prevents that, nothing else.

Little things that matter in language design

Posted Jun 16, 2013 10:33 UTC (Sun) by dvdeug (subscriber, #10998) [Link]

Let's note that you want every one of 5,000 different languages to have its own code page; your comment about German+Austrian implies that you want every subdialect to have its own code page. And that's not approaching the question of how you want to deal with sometimes wildly different orthographies for one language.

"this decision certainly made life harder, not easier"

There's no certainly about it. To type "mv Das_Boot_German.avi Boata_filmoj" in your system you'd have to change keyboards several times, from whatever language mv is in, to German, to English, possibly to whatever language you count avi as, then to Esperanto. Right now, you can type that from any keyboard that supports the ISO standard 26-letter alphabet. You can't search a document for Bremen without knowing whether someone considered that a German word or an English word, and e = mc², originally written by a German speaker but understood worldwide, would get an arbitrary language tag. While there are some Cyrillic and Greek look-alikes for Latin-script words, you would explode that; "go" could be encoded any number of ways, and any non-English speaker would have to switch their keyboard to go to lwn.net or google.com or any other English-named sight.

"note that the most popular OS does exactly that for CJK."

Note that the article you link to does not say Tron is the most popular OS, and that it does not do exactly that for CJK, because Chinese is not one language; it's a rather messy collection of languages. Tron forces Cantonese to be written in the same script as Mandarin and Jinyu. Note also that Tron treats Turkish the exactly same way Unicode does, as it's a copy of Unicode in everywhere but the Han characters.

"You can - Unicode prevents that, nothing else."

If by Unicode, you mean every character set ever used for Turkish (including Tron). I've never seen a fully worked out draft of a character set that fits your specifications. That's never really impressive, is it, when someone is claiming that something would be clearly easier yet it's never been tried.

Little things that matter in language design

Posted Jun 16, 2013 6:35 UTC (Sun) by micka (subscriber, #38720) [Link] (2 responses)

> even though the English alphabet is a superset of the French and Latin

I suppose you mean "subset" ? Like English alphabet is strictly included inside French (without é, è, à, ...) and latin alphabet (I see no difference) ?

Little things that matter in language design

Posted Jun 16, 2013 9:17 UTC (Sun) by dvdeug (subscriber, #10998) [Link] (1 responses)

English uses a lot of diacritics on characters if you look hard enough. Façade and résumé are completely standard spellings; coöperate is still used, by the New Yorker, for example. I don't know that there are any words where ÿ is used, so superset might be too strong, but it's certainly not a subset.

(If we're strictly speaking of the alphabet, neither of them count accents, so both French and English have the same 26 letters for the alphabet.)

Little things that matter in language design

Posted Jun 16, 2013 10:49 UTC (Sun) by micka (subscriber, #38720) [Link]

Depending on your sources, the french alphabet is either 26 letters, taking out diacritics, or 42 letters, counting diacritics and ligatures (œ and æ) separately (I suppose, the same as ß). Even the french and english version of "french alphabet" on Wikipedia have different count (error or cultural difference in speciality languages ? I know for example "ring" in mathematics have related but different definitions for french and american mathematicians)

The spanish alphabet is more consistently considered as having 27 letters even though ñ could be considered a n with diacritic. And in the past, even some combination of letters (from the point o view of the latin alphabet) were considered separate letters.

And I don't even talk about http://en.wikipedia.org/wiki/Alphabet_%28computer_science%29 (each diacritic variant would be considered a different letter).