Little things that matter in language design

Posted Jun 12, 2013 19:01 UTC (Wed) by dvdeug (subscriber, #10998)
In reply to: Little things that matter in language design by khim
Parent article: Little things that matter in language design

I'm explaining that it's not a real option for anyone that doesn't control their own universe. It's not Unicode; it's every Turkish character set ever. It's what Turkish keyboards give you. Nobody ever picks that pair because it's not a real option.

to post comments

Little things that matter in language design

Posted Jun 14, 2013 21:40 UTC (Fri) by khim (subscriber, #9252) [Link] (12 responses)

Yet this is what used to solve the problem for Russian. Early computers in USSR only had Russian letters which were different from latin. And they, too, had this upcase problem (upcase for Russian "у" was "У" and for latin "y" was "Y"). It's not clear why Turks can not adopt the same solution. Well, "for historical reasons" probably - but that's still a "Unicode" choice.

Little things that matter in language design

Posted Jun 14, 2013 23:33 UTC (Fri) by dvdeug (subscriber, #10998) [Link] (11 responses)

I don't think you can call choosing any currently existing Turkish character set a "Unicode" choice. If we're going to dismiss history and how Turks currently use their computers, we could go further and change their whole writing system.

Russian is written in the Cyrillic alphabet, unlike Turkish which is written in the Latin alphabet. It's not written in the Latin alphabet by accident; it was changed from the Arabic alphabet in 1927 in an attempt to modernize the country and attach themselves politically and culturally to the successful West. Separating the Turkish alphabet from the Latin is not a neutral act, particularly when you don't do the same to the French or Romanian.

Little things that matter in language design

Posted Jun 15, 2013 14:19 UTC (Sat) by khim (subscriber, #9252) [Link] (10 responses)

Separating the Turkish alphabet from the Latin is not a neutral act, particularly when you don't do the same to the French or Romanian.

Sure. But this is what Unicode is all about. Unicode didn't happen in one step. Early character encodings were... strange (from today's POV). Not just Russian computers, US-based computers, too (think EBCDIC and all these strange symbols used by APL). Eventually some groups of symbols were put together and some other symbols were separated. Not just Cyrillic, but Greek (charset which is as closely related to Cyrillic as Turkish as related to Romanian), etc. Why Telugu and Kannada are separated but Chineese and Japanese Han characters are merged? If we want to make upcase/lowercase functions locale-independent we can do with Turkish (French, Romanian, etc) what was done with Telugu and Kannada.

Little things that matter in language design

Posted Jun 15, 2013 14:52 UTC (Sat) by mpr22 (subscriber, #60784) [Link]

The relationship between the Turkish variant of the Latin alphabet and some other random European variant of the Latin alphabet more closely resembles the relationship between the Serbian and Russian variants of the Cyrillic alphabet than the relationship between the Cyrillic alphabet and the Greek alphabet.

Little things that matter in language design

Posted Jun 15, 2013 22:52 UTC (Sat) by dvdeug (subscriber, #10998) [Link] (6 responses)

"Unicode didn't happen in one step" is blaming Unicode for the entire history of computing.

If you don't care about if the Turks are going to use your character set, go ahead and tell them to use ASCII. If you choose to separate their alphabet from the Latin, you're going to have a problem that they consider their alphabet part of the extended Latin alphabet, and they're not going to find that an acceptable solution. If you choose to separate out the alphabets of thousands of languages (even though the English alphabet is a superset of the French and Latin), you might mollify the Turks, but nobody is going to use your character set.

In reality, Turkish support requires locale-sensitive casing functions because every other solution has serious technical and often political problems, as well as not being compatible with existing systems, including keyboards.

Little things that matter in language design

Posted Jun 16, 2013 3:30 UTC (Sun) by hummassa (guest, #307) [Link] (2 responses)

...
> In reality, Turkish support requires locale-sensitive casing functions
...

Let's be plain: there is no "casing functions" that are not locale-sensitive. The Turkish dotted "i"s are one example, the German vs. Austrian "ß" is another, etc. And don't get me started on collation order. If one is going to try to facilitate computations by separating each locale to an alphabet, I wish good luck with its newnicode. The real Unicode thankfully does not work that way. Usually, at least. :-D

Little things that matter in language design

Posted Jun 16, 2013 8:21 UTC (Sun) by khim (subscriber, #9252) [Link] (1 responses)

If one is going to try to facilitate computations by separating each locale to an alphabet, I wish good luck with its newnicode. The real Unicode thankfully does not work that way. Usually, at least. :-D

Well, that's certainly a pity: Unicode was developed to fit in 16bit and thus merged many scripts (it assumed language will be separated "on the side" and/or will be less important then glyphs themselves). They have failed (today there are over 90'000 glyphs in Unicode) yet as a result we can not properly work with English+Turkish (or even German+Austrian) texts as you've correctly pointed out.

Today we are stuck: yes, it's not perfect and this decision certainly made life harder, not easier, but it'll be hard to replace it with anything else at this point. Similar story to QWERTY. Numerous problems which stems from that old decision are considered minor enough and it'll be hard to switch. But note that the most popular OS does exactly that for CJK. It's slowly but surely is replaced by Unicode-based OSes (such as Android) thus in the end Unicode is probably inevitable, but it does not means that you can not achieve interoperability with Turkish people and working upcase/lowercase simultaneously. You can - Unicode prevents that, nothing else.

Little things that matter in language design

Posted Jun 16, 2013 10:33 UTC (Sun) by dvdeug (subscriber, #10998) [Link]

Let's note that you want every one of 5,000 different languages to have its own code page; your comment about German+Austrian implies that you want every subdialect to have its own code page. And that's not approaching the question of how you want to deal with sometimes wildly different orthographies for one language.

"this decision certainly made life harder, not easier"

There's no certainly about it. To type "mv Das_Boot_German.avi Boata_filmoj" in your system you'd have to change keyboards several times, from whatever language mv is in, to German, to English, possibly to whatever language you count avi as, then to Esperanto. Right now, you can type that from any keyboard that supports the ISO standard 26-letter alphabet. You can't search a document for Bremen without knowing whether someone considered that a German word or an English word, and e = mc², originally written by a German speaker but understood worldwide, would get an arbitrary language tag. While there are some Cyrillic and Greek look-alikes for Latin-script words, you would explode that; "go" could be encoded any number of ways, and any non-English speaker would have to switch their keyboard to go to lwn.net or google.com or any other English-named sight.

"note that the most popular OS does exactly that for CJK."

Note that the article you link to does not say Tron is the most popular OS, and that it does not do exactly that for CJK, because Chinese is not one language; it's a rather messy collection of languages. Tron forces Cantonese to be written in the same script as Mandarin and Jinyu. Note also that Tron treats Turkish the exactly same way Unicode does, as it's a copy of Unicode in everywhere but the Han characters.

"You can - Unicode prevents that, nothing else."

If by Unicode, you mean every character set ever used for Turkish (including Tron). I've never seen a fully worked out draft of a character set that fits your specifications. That's never really impressive, is it, when someone is claiming that something would be clearly easier yet it's never been tried.

Little things that matter in language design

Posted Jun 16, 2013 6:35 UTC (Sun) by micka (subscriber, #38720) [Link] (2 responses)

> even though the English alphabet is a superset of the French and Latin

I suppose you mean "subset" ? Like English alphabet is strictly included inside French (without é, è, à, ...) and latin alphabet (I see no difference) ?

Little things that matter in language design

Posted Jun 16, 2013 9:17 UTC (Sun) by dvdeug (subscriber, #10998) [Link] (1 responses)

English uses a lot of diacritics on characters if you look hard enough. Façade and résumé are completely standard spellings; coöperate is still used, by the New Yorker, for example. I don't know that there are any words where ÿ is used, so superset might be too strong, but it's certainly not a subset.

(If we're strictly speaking of the alphabet, neither of them count accents, so both French and English have the same 26 letters for the alphabet.)

Little things that matter in language design

Posted Jun 16, 2013 10:49 UTC (Sun) by micka (subscriber, #38720) [Link]

Depending on your sources, the french alphabet is either 26 letters, taking out diacritics, or 42 letters, counting diacritics and ligatures (œ and æ) separately (I suppose, the same as ß). Even the french and english version of "french alphabet" on Wikipedia have different count (error or cultural difference in speciality languages ? I know for example "ring" in mathematics have related but different definitions for french and american mathematicians)

The spanish alphabet is more consistently considered as having 27 letters even though ñ could be considered a n with diacritic. And in the past, even some combination of letters (from the point o view of the latin alphabet) were considered separate letters.

And I don't even talk about http://en.wikipedia.org/wiki/Alphabet_%28computer_science%29 (each diacritic variant would be considered a different letter).

Little things that matter in language design

Posted Jun 16, 2013 5:36 UTC (Sun) by viro (subscriber, #7872) [Link] (1 responses)

You can easily have a text in English with quoted sentences in French or in Turkish, using the same font. Try the same with e.g. Russian and Greek and see if you will be able to read the result[1]. Turkish and French alphabets are Latin with some diacrytics added; current Cyrillic is much more distant from Greek than that, as you bloody well know.

[1] lowercase glyphs aside, (И, Н) and (Η, Ν) alone are enough to render the result unreadable (shift circa 16th century, IIRC; at some point both Eta and Nu conterparts got the slant of the middle strokes changed in the same way, turning 'Ν' into 'Н' and 'Η' into 'И')

Little things that matter in language design

Posted Jun 16, 2013 7:58 UTC (Sun) by khim (subscriber, #9252) [Link]

You can easily have a text in English with quoted sentences in French or in Turkish, using the same font. Try the same with e.g. Russian and Greek and see if you will be able to read the result[1].

Of course you could. What's the problem? You'll be forced to read Greek letter-by-letter probably, but English-speaking person will mangle French or Turkish, too. It's not as if just resemblance letters of the alphabet matters in this case: English and French may use similarly looking characters, but they use them to encode radically different consonants, vowels and words.

[1] lowercase glyphs aside, (И, Н) and (Η, Ν) alone are enough to render the result unreadable (shift circa 16th century, IIRC; at some point both Eta and Nu conterparts got the slant of the middle strokes changed in the same way, turning 'Ν' into 'Н' and 'Η' into 'И')

If you don't know which language is used you can not read your word, period. Identically-looking words in French and Turkish will have radically different pronouncements and will be, in fact, different words.