The kernel and character set encodings

[Posted February 18, 2004 by corbet]

It all started as a JFS bug report. The JFS filesystem, it seems, gets upset when user space passes it file names encoded in the UTF-8 format. Rather than create or open a file with the name as given, it gives up and returns EINVAL. Patches which fix the problem have been posted, but the resulting discussion has taken rather longer to be resolved.

JFS has an "iocharset" option which can be used to state explicitly, at mount time, which character encoding is being used. There were calls on linux-kernel for this option to be added to other filesystems as well. The idea was rather strongly shot down, however, for a few reasons. One of those is that multiple users could be simultaneously using different character encodings on the same filesystem; a global option for the whole filesystem clearly will not be able to address that case.

The real reason, however, is that performing character set conversion requires the kernel to interpret the file name strings being passed to it from user space. The kernel hackers are very resistant to the imposition of any such policy; it would go against decades of Unix tradition. Officially, the kernel has no policy regarding which character set is being used for file names, content, or anything else. In each case, the kernel sees nothing more than a stream of bytes.

That said, the kernel does have some policies regarding file names: they use "/" as a directory delimiter, and they are terminated by a NULL byte. This policy rules out the use of many encodings which are sometimes employed to represent non-ASCII characters; the fixed-width wide encodings all tend to use lots of bytes containing zero. In reality, the only practical choices for representing characters beyond the ASCII set are iso-8859-1 (which allows the representation of characters used in many continental European languages) and UTF-8, which can encode pretty much anything.

UTF-8 is relatively easy to use; for US users it looks just like ASCII, but it can handle a far wider range of characters while not breaking (most) code which uses traditional C strings. Thus it is often said that UTF-8 is the encoding used by the Linux kernel. That statement is a mistake, however: Linux does not use any particular encoding. If user space uses UTF-8 to represent extended characters, everything will work. But nothing forces user space to work in that way.

This approach keeps policy out of the kernel, but some developers are not entirely happy with it. The lack of policy can lead to user-space confusion in a number of ways. For example, if a user creates a file called WéîrdÑàmë, that name could be represented in the filesystem in more than one way. Depending on how user space is configured, it could choose either iso-8859-1 or UTF-8; the encoding of that name will be quite different depending on that choice. A different user space could interpret the file name differently in the future, resulting in unreadable filenames and confused users. The kernel, lacking a character encoding policy of its own, will do nothing to help prevent this situation.

Confusion over character sets can also facilitate the creation of security holes; code which attempts to clean up file names can fail if evil characters are given in an unexpected encoding. Code which expects UTF-8 must also be careful when dealing with the Linux kernel because the kernel itself makes no effort to ensure that any string is, in fact, a legal UTF-8 encoding.

To complicate the situation even more, Andrew Tridgell posted another reason why, he thinks, the kernel will have to adopt a specific character encoding: case insensitivity. Says Tridge:

The reason is that I think that eventually the Linux kernel will need to efficiently support a userspace policy of case-insensitivity and the only way to do case-insensitive filename operations is to interpret those byte streams as a particular encoding.

Needless to say, the idea of implementing case-insensitive filesystem operations in the kernel was not particularly popular. Not too many kernel hackers want to complicate the filesystem code to implement what they see as being a broken Windows feature to begin with. There are other difficulties as well: case-insensitive matching must be done differently in different languages. The end result is that case insensitive lookups are not very likely to make it into the kernel anytime soon.

Linus is not averse to trying to help out Samba and other applications which wish to implement case-insensitive behavior, however. He has proposed a new "magic_open()" interface which would make it easier for user space to perform case-insensitive lookups without actually doing that work in the kernel. This interface would likely require quite a bit of work before it would do what the Samba developers need, but something derived from it could just make an appearance in the 2.7 development series.

Meanwhile, the kernel does not seem likely to adopt any sort of official encoding anytime soon. The problems that result from the lack of an encoding policy are mostly seen as user space issues. Proper locale support is still relatively new in Linux, and many rough edges remain. Given the high level of interest in high-quality localization support in Linux, however, one might expect those edges to be smoothed down quickly.

(For those who would like to learn more about UTF-8, see this FAQ or RFC 3629).

Index entries for this article
Kernel	Character encoding
Kernel	Filesystems/Case-independent lookups
Kernel	JFS
Kernel	UTF-8 encoding

to post comments

How would a case-insensitive magic_open() call work?

Posted Feb 19, 2004 9:11 UTC (Thu) by brouhaha (subscriber, #1698) [Link] (1 responses)

Suppose I have two files named "Foobar" and "foobaR" in a particular directory. The user (possibly Samba) calls magic_open("foobar", ...). What can be expected to happen?

I think this proposed magic_open() call is almost as bad an idea as providing an option to allow normal open()s (or the filesystem code) to be case insensitive. The few applications that really need this sort of behavior should implement it in user space by reading the directory, and they can worry about how to handle ambiguous cases there.

How would a case-insensitive magic_open() call work?

Posted Feb 19, 2004 19:42 UTC (Thu) by chad.netzer (subscriber, #4257) [Link]

You get an arbitrary file. Tridge suggests that (so far) he hasn't gotten complaints about this kind of behavior (which already exists in Samba), and there are few good alternatives. One possible alternative, to try to keep track of which files are created by Posix systems, and which are created by Windows systems, and preferentially decide between the two, seems like too much work if no one really cares.

The whole case insensitivity issue of Windows is (apparently) a mess, and there appears to be no perfect policy about what to do when interoperating, other than try to do the thing which makes most practical sense.

The kernel and character set encodings

Posted Feb 19, 2004 9:24 UTC (Thu) by Cato (guest, #7643) [Link] (6 responses)

You say that the only practical choices for character encodings are ISO-8859-1 and UTF-8. In fact, there is a vast range of encodings that will work (basically any encoding that doesn't use NUL and '/' for some other purpose than ASCII semantics). For a start there is ISO-8859-*, KOI8-* (for Cyrillic), EUC-JP, Shift-JIS (both popular in Japan), and so on.

Getting the character encoding right is difficult, and with UTF-8 there is an additional complication, Unicode normalisation - the issue here is that in certain languages, you might have a symbol on the page being encoded as 3 Unicode characters: the letter with accent 1 then accent 2 in one string, and the letter with accent 2 then accent 1 in another string. These strings result in exactly the same visual appearance on screen, yet they can't be compared with a byte comparison. Unicode normalisation defines a specific order for all such 'combining character' strings, but unfortunately there is more than one normalisation form: Linux and the W3C use NFC, while Darwin and MacOS X use NFD, even on UFS filesystems.

Unicode makes life more complicated for everyone and it's likely some of this needs to be in the kernel, or at least glibc, for uniformity. For more links on Unicode, from a Perl/Wiki oriented perspective, see the plan for TWiki support of UTF-8 and this Unicode normalisation page.

The kernel and character set encodings

Posted Feb 19, 2004 9:54 UTC (Thu) by one2team (guest, #7316) [Link] (1 responses)

« You say that the only practical choices for character encodings are ISO-8859-1 and UTF-8. In fact, there is a vast range of encodings that will work (basically any encoding that doesn't use NUL and '/' for some other purpose than ASCII semantics). For a start there is ISO-8859-*, KOI8-* (for Cyrillic), EUC-JP, Shift-JIS (both popular in Japan), and so on. »

These encodings are mostly useless in a true multi-user system. Why ? Because they are all incompatible. So there is no way for a user that uses encoding A to read stuff (including filenames) made by another user using encoding B. And this is true even for close stuff (KOI8-U and KOI8-R for example). Not to speak of the poor users that may want to quote another langage (French + Russian, Welsh + Greek etc).

The only thing all those encodings are compatible with is english, which restricts second language to english and english only.

One could argue userspace would have just to use Greek encoding for Greek filenames, Russian for Russian ones and so on. But the crux of the problem is userspace have no way to request or guess what encoding was used to write a filename, since the kernel does not enforce any particular encoding nor provides encoding info to userspace.

One additionnal problem is some byte strings can result in invalid UTF-8 and cause applications to barf if they try to decode them.

The kernel and character set encodings

Posted Feb 19, 2004 13:12 UTC (Thu) by Cato (guest, #7643) [Link]

These encodings are fine where the users agree on a single character set (e.g. KOI8-R in Russia) or where there is some external data (e.g. the directory name or file name including 'koi8-r') describing the character set of the file. I am very aware that there may be conversion problems, which is why Unicode is important, but not everyone is going to move to Unicode straight away - there are still gaps in the user level tools available, though they are improving.

What might be useful is to document the legacy non-Unicode character sets that are incompatible with ASCII and in particular *nix filesystems - so far, I believe that HZ-*, ISO-2022-* and Big5 are all incompatible, but it would be good to see a definitive list. Then at least Linux users would know which character sets to avoid for filenames.

The issue of invalid UTF-8 strings is no different to any other mis-encoded characters - it would be good if glibc or perhaps the kernel checked UTF-8 for overlong characters, as this is a well known security hole and it's not hard to do this.

The kernel and character set encodings

Posted Feb 19, 2004 11:18 UTC (Thu) by ibukanov (subscriber, #3942) [Link]

> These strings result in exactly the same visual appearance on screen, yet they can't be compared with a byte comparison.

You do not need even Unicode normalization for that. In most fonts the following two lines would have exactly the same visual presentation (you have to view the page with UTF-8 encoding as LWN does not allow to enter РОТ in HTML comments due to bugs in recognition of &code; escapes):
POT
Ð ÐÐ¢
yet the first uses pure ASCII and the second uses only Cyrillic characters and means mouth in Russian.

IMHO such examples supports the notion that kernel should not impose any policy on file names encoding as in practice there are always more then one way to encode the same visual presentation and UTF-8 with Unicode does not help here.

The kernel and character set encodings

Posted Feb 19, 2004 12:14 UTC (Thu) by mwh (guest, #582) [Link]

> Unicode makes life more complicated for everyone

  If Unicode is a horde of zombies with flaming dung sticks, 
  the hideous intricacies of JIS, Chinese Big-5, Chinese 
  Traditional, KOI-8, et cetera are at least an army of ogres 
  with salt and flensing knives.        -- Eric S. Raymond, python-dev

Unicode isn't that hard to deal with, although I'd admit to not having any intuition for what the right answer is in this situation.

The kernel and character set encodings

Posted Feb 20, 2004 22:19 UTC (Fri) by spitzak (guest, #4593) [Link] (1 responses)

There is no problem with UTF-8 filenames. The bytes should be stored
unchanged, and unchanged bytes should be used to look up the file. It
does not matter if those bytes are a legal UTF-8 string or not, to say
nothing of what normalization form they are.

Unfortunately there are hordes of people out there who think dumb ideas
like case-insensitivity should be applied at low levels to stuff that
really is binary data. This kind of thinking is what causes complexity,
and complexity causes bugs and security holes.

Any program that takes a string it thinks is UTF-8 and does
<i>ANYTHING</i> other than pass the exact bytes unchanged to another
interface that wants UTF-8 is by definition broken. This simple rule will
completely eliminate all ambiguity about UTF-8.

The kernel and character set encodings

Posted Feb 21, 2004 7:49 UTC (Sat) by Cato (guest, #7643) [Link]

This problem needs to be addressed somewhere, though not necessarily in the kernel (perhaps in glibc or the GUI layer): two users create identical looking filenames using Vietnamese accented characters (letter + 2 accents in different order, 3 Unicode characters altogher). Then, there are two identical-looking filenames and you don't know how to type the 'right' one. Even if there is only one file involved, without Unicode normalisation you wouldn't be able to use bash filename completion, since you might type the accents in a different order to that used in the filename, though there would be no visual clue as to your mistake.

Given these issues, which affect command line tools as much as GUIs, it may be sensible to put NFC normalisation in glibc or the kernel, despite the complexity. Files created from another system on a Linux NFS filesystem would of course bypass glibc, so the alternatives are batch renormalisation (always an option, convmv may do this) or putting NFC in the kernel.

It's not good enough to say 'case-insensitivity should not be in the kernel' - you need to address these use cases and say how and where you would solve them.

Unicode bugs

Posted Feb 19, 2004 11:06 UTC (Thu) by simonl (guest, #13603) [Link] (3 responses)

Unicode in kernel means security holes.

Back in 2001 afair Unicode bugs were found in MS IIS. There are many ways to encode a ../../ path in Unicode, and IIS did not know all of them. However the kernel did, and thus circumvented any path sanitizing IIS did.

Linux should not repeat these mistakes.

We would have to fix every little script that deals with userdefined file names, it is impossible. Input validation is hard enough already.

Unicode bugs

Posted Feb 19, 2004 13:19 UTC (Thu) by Cato (guest, #7643) [Link]

Any new functionality can mean security holes, and this applies whether Unicode is implemented in libraries or the kernel. It's important to address Unicode's potential for such holes (overlong UTF-8 encodings etc), but mostly this is just good practice - e.g. you 'filter in' the characters you know are legal, rather than trying to 'filter out' characters that are illegal (it's very easy to miss just one).

I'm not sure Unicode needs to live in the kernel as long as there is good library support, but it's better for library or kernel maintainers to solve these problems once rather than have different buggy implementations in every application.

The specific IIS issues were related to Microsoft's non-standard %uNNNN encoding of 16-bit UCS-2 (Unicode) characters, so I don't think this is a reason to abandon Unicode.

Unicode cannot be secure---B. Schneier

Posted Feb 19, 2004 15:28 UTC (Thu) by Max.Hyre (subscriber, #1054) [Link]

Well, that should get their attention. :-) The exact wording was ``Unicode is just too complex to ever be secure.''

In his July 2000 Crypto-gram article on Unicode, Schneier points up the failures we've had dealing with ASCII control characters, escape sequences, different semantics at different levels of the application (think writing a bash command to grep for a particular grep regular expression), and concludes that with Unicode it's not merely hard, it's effectively impossible.

I don't know enough about Unicode to argue the details, but it certainly made me sit up and take notice.

Unicode bugs

Posted Feb 20, 2004 22:23 UTC (Fri) by spitzak (guest, #4593) [Link]

Avoiding those bugs is one of the primary reasons why UTF-8 is a good
idea.

"../" in a UTF-8 filename means the *BYTES* for '.', '.', and '/' appear
next to each other. It is entirely irrelevant if the UTF-8 string is
legal or if it contains a byte sequence that some broken software by
Microsoft will turn into a slash.

I don't know how many times this has to be stated. But if your program is
looking at a UTF-8 string and is doing anything other than drawing the
characters on the screen, YOU DO NOT NEED TO DECODE IT! Just look at the
bytes!

The kernel and character set encodings

Posted Feb 19, 2004 13:16 UTC (Thu) by danscox (subscriber, #4125) [Link] (2 responses)

It seems to me like this would be a perfect place for either FUSE, or a "settable" policy mechanism within the kernel. Even that can get hairy, of course, for many and varied reasons, but it would leave policy in userland, where it should be. This could possibly start up a whole set of 'cottage industries'; modules to support this or that file naming convention. I'm thinking of Firefox and it's extensions, for example.

Danny

Method for (mostly) kernel-independant Unicode filenames?

Posted Feb 19, 2004 16:28 UTC (Thu) by Max.Hyre (subscriber, #1054) [Link] (1 responses)

[Strawman proposal---please point me toward discussions where it's all been hashed out, shot down, &c. Or, just flame direct.]

How about changing filename semantics (and, of course, every filesystem known to Linux): make filenames a three-element struct: a fixed-length specification of the name's character-set encoding, a fixed-length count of the bytes in the name, and a variable-length string holding said name:

    struct filename {
        enum encoding enc;
        int cb;
        byte *rgb;
    };

Now, the kernel doesn't give a fig what the encoding is, or what it might mean---it's all bytes, with no chance (hah!) of filename buffer overflows and their attendant dangers to root. The libraries just use the struct for calls to fopen(), remove(), rename(), & friends, with the caller allowed to specify that

an exact match (on all elements of the struct) is needed for equality comparisons,
a bytewise match on the byte *s, regardless of the encoding, is sufficient, or
its own comparator function (supplied) be run on pairs of the structs.

The kernel code is encoding-agnostic, and the rest of the work (emphatically including sorting) is in userland.

A few problems

Posted Feb 20, 2004 17:35 UTC (Fri) by Ross (guest, #4065) [Link]

1) What filesystems support per-file character set selection? Which ones
can handle embedded NUL characters? What about maximum filename length
considerations -- you are no longer measuring in characters because the
number of bytes they use depends on the encoding.

2) There are a whole lot of system calls receiving or returning filenames
(the libc routines linke fopen() are a different layer): open(),
getdirentries(), readlink(), stat(), lstat(), rename(), unlink(), link(),
mknod(), chown(), chmod(), utime(), mount() etc. (not to mention Unix
domain sockets). These would all have to change. But POSIX defines them
as taking certain parameters and having certain return types. So you
either have to drop Unix compatability or you have to add duplicate
versions of each one much like Microsoft did when converting to UCS2.

3) What about Unix applications and old Linux apps? They won't even
compile if you change the prototypes. If you don't and make the old
system calls default to UTF8 or something you still have to make them work
with filenames in other encodings.

4) It won't fix the policy problem without involving the kernel anyway.
What about case insensitivity, canonicalizing characters, path delimeters
etc.? You removed the need for the terminating NUL but what about the "/"
character? What about character sets with no slash, or with multiple
slashes? The kernel will need to know what these are and that will depend
on the character set.

The kernel and character set encodings

Posted Feb 20, 2004 2:24 UTC (Fri) by flewellyn (subscriber, #5047) [Link] (2 responses)

Why does the kernel even use "/" and NUL? Seriously, pathnames should be internally coded as structures, not strings. The only parsing of pathname strings should occur in the C library, including syscall wrappers. The kernel should not have any notion, internally, of pathname separators. It's just silly.

Instead, I propose something like this: stick each element of the pathname into an array element, innermost first (that is, the "root" directory would be LAST element of the array), and use a special token to indicate ROOT. You could have the array live in a struct, with the other struct element being the length of the array, if you like. Something like this:

struct pathname {
int length;
char* elements[];
}

This way you could get at the file's name with a simple elements[0], and walk the directory tree from root to the file like this:

for (i = length; i >=0; i--) {blah blah blah whatever};

No worrying about parsing out the "/" separators.

The kernel and character set encodings

Posted Feb 20, 2004 17:37 UTC (Fri) by Ross (guest, #4065) [Link] (1 responses)

But you are using C strings to denote the elements which means they are
still NUL terminated. To fix it you need a second array for the path
component lengths. I think you are unlikely to convince any of the kernel
guys this isn't too ugly to live.

The kernel and character set encodings

Posted Feb 20, 2004 22:32 UTC (Fri) by spitzak (guest, #4593) [Link]

I agree that a length is needed, not just for encoding NUL, but to allow
a slash-seperated name to be quickly converted to this form, without a
need to malloc and copy a block of memory for each piece.

One possibly less-ugly scheme is to use Plan9's "walk" style. You have
"file descriptors" that represent a filename, unopened as yet. These are
created by copying an
existing one (a small set, such as one for "/", are provided when the
program starts up, like stdin/out). There is then a call something like
walk(fd,char* name,int length) which moves fd to the subdirectory in
name[0..length-1]. When you finally are at the desired file you call
open(fd,mode). Existing open() calls would be turned into a bunch of walk
calls followed by a new open.

With this, no arrays are passed to the kernel, and it does not have to
store these arrays.

The kernel and character set encodings

Posted Feb 20, 2004 22:37 UTC (Fri) by spitzak (guest, #4593) [Link] (3 responses)

Could somebody explain why the case-insensitivity is so important, even
for Samba? It seems to me there cannot be too many Windows programs that
take a filename provided to it by the system and change the case before
using it. My tests show that when you double-click files in Explorer or
from the file chooser or any other way I found to select the files,
Windows gave my program the filename with the exact same case as it was
reported in the file listing.

Yes users can type in the wrong case into a shell, but aren't
command-line interfaces supposed to be "unfriendly"? Why does anybody
care if user-unfriendly interfaces work for stupid users or not?

The kernel and character set encodings

Posted Feb 21, 2004 21:14 UTC (Sat) by fiberbit (subscriber, #693) [Link] (2 responses)

The problem lies in the checking whether or not a file with a given name (case insensitive) exists. Say you do an 'fp = fopen("filename", "a"), and "filename" doesn't exist yet, then in the case-insensitive case, you have to check whether "Filename" or "fIlename" or any other variant *does* exist.
You'd either have to try all possible combinations, or (in practice) scan the whole directory to see if any name matches (and use the first). This not only is very time consuming, but also racy in a multi process environment.
It could be solved by using case-insensitive hash functions in the dentry cache, but that would negatively impact normal filesystems, and is unacceptable to most, including the top penguin.

Re-mount Through Caseless VFS?

Posted Feb 23, 2004 0:37 UTC (Mon) by miallen (guest, #10195) [Link] (1 responses)

Why not just create a "casefs" VFS that just uses the existing ops for the target mounted fs but overloads lookup() to do the caseless pathwalk (and maybe save the last N paths with hashes in a separate cache)? Now you would just (re)mount an existing fs through this casefs VFS. It wouldn't be optimal but it would still be a lot faster for Samba, WINE, or whoever and it wouldn't barf all over any other kernel code. It's probably not a lot of code either.

Mike

Re-mount Through Caseless VFS?

Posted Feb 23, 2004 7:40 UTC (Mon) by massimiliano (subscriber, #3048) [Link]

I am definitely not a kernel developer, but this sounds like
the perfect solution: very general, and perfectly decoupled
from the code of existing filesystems... and moreover, you pay
the performance penalty only if you use the feature.

As an added benefit, it could be implemented entirely in user
space using FUSE, and only if/when it works very well (and the
added performance is needed) as a kernel module.

With such a layer, it would also be possible to handle all those
nasty Unicode normalizations...

Just my two cents, anyway.