Little things that matter in language design
Little things that matter in language design
Posted Jun 9, 2013 12:15 UTC (Sun) by tialaramex (subscriber, #21167)In reply to: Little things that matter in language design by mgedmin
Parent article: Little things that matter in language design
Unicode is fairly insistent that e.g. although it provides two separate ways to "spell" the e-acute in café for compatibility reasons these two spellings are equivalent and an equality test for the two should pass. For this purpose it provides UAX #15 which specifies four distinct normalisation methods, each of which results in equivalent strings becoming codepoint identical.
If you don't do this normalisation step you can end up with a confusing situation where when the programmer types a symbol (in their text editor which happens to emit pre-combined characters) the toolchain can't match it to a visually and lexicographically identical character mentioned in another file which happened to be written with separate combining characters. This would obviously be very frustrating.
On the other hand, to completely fulfil Unicode's intentions either your language runtime or any binary you compile that does a string comparison needs to embed many kilobytes (perhaps megabytes) of Unicode tables in order to perform the normalisation steps correctly.