The kernel's command-line commotion
When one looks at a running process, often the first item of interest is which program that process is running. The kernel makes that information available in two different places, both found within a process's /proc directory:
- comm holds the "command" that the process is running. When a program is launched with execve(), the base name of the executable file is placed in comm. So if the user runs /usr/bin/rogue, comm will contain "rogue".
- cmdline contains the entire command line passed to the running program via the argv array given to execve(); it traditionally holds the program name, which is passed as the first argument in the command line.
The two files contain similar but different information, and have different access characteristics. comm will contain the actual file name that was passed to execve() (but keep on reading); it is also stored in a kernel data structure and can be accessed quickly. Instead, the command name in cmdline is whatever was passed to execve() as argv[0], which may or may not be the actual name of the command. It is stored in the relevant process's address space, meaning that the process itself can change it, and accessing it from another process is a relatively expensive operation. For these reasons, programs like ps or top will use comm instead of cmdline when possible.
As it happens, execve() is not the only way to launch a new program within a process on Linux. There is a library function called fexecve() that takes an open file descriptor for the program to execute rather than its path name; under the hood, it is implemented with execveat(). There is interest in using fexecve() because it allows the target program to be opened and checked (looking for a signature, for example), then executed in a race-free way. Tools like systemd have support for running programs this way, and its developers would prefer to use that mode.
There is just one little problem. While execve() can initialize comm from the name of the file passed to it, fexecve() only has an open file descriptor that no longer has any path-name information associated with it. That file descriptor may be marked "close on exec", meaning that even any information that may have been found in /proc/PID/fd will be lost. The result, in current kernels, is that, when a program is run with fexecve(), the comm is simply set to the file-descriptor number of the program. If rogue is run with fexecve() from file descriptor five, comm will contain "5" rather than "rogue".
Users, being the irascible creatures that they are, have expressed the unreasonable opinion that replacing the command names of processes in their system-management tools with small integers is an unwelcome change. They have been spoiled by being able to see which program each process is running and feel entitled to that ability in the future. Kernel programming would be so much easier without users, but that is not the world we live in. So the search for a better way to set comm when fexecve() is used was begun.
In a mid-November pull request, Kees Cook included a patch from Tycho Andersen that tried to restore some useful information to /proc/PID/comm in the fexecve() case. In the absence of the file name, the kernel would simply use the information from argv[0] instead, causing the information from comm and cmdline to be essentially the same. That patch had been through a few iterations, and seemed like a good solution to everybody involved.
Torvalds, though, disagreed,
saying that "this horrible hack is too broken to live
". Over the
course of an extended and not always courteous discussion, he argued
that argv[0] is under user-space control and can contain any sort
of information; the kernel
uses comm for its own purposes, and letting user space control
it could help attackers to hide the actual executable being run. Copying
argv[0] into comm will slow program start, he said. The
right solution, according to Torvalds, is to use the file name stored in
the directory entry ("dentry") associated with the file to be executed.
That information is always present and is reliably under the kernel's
control.
The problem with the dentry-based approach, as explained by Eric Biederman, Zbigniew Jędrzejewski-Szmek, and Cook, is that it would change the name of the executable as seen by user space. As an example, a user may run rogue, but some helpful distributor may have long since turned /usr/bin/rogue into a symbolic link to /usr/bin/nethack. On current systems, a tool like ps will show this user, busily at work, running rogue, but a comm value derived from the dentry would use the actual file being executed, so it would show as nethack instead. On some systems, like Debian with its "alternatives" mechanism, the visible names of quite a few commands would change. That could break programs that are looking for specific command names. Setting comm from argv[0], instead, would preserve the original name.
Torvalds, though, was
unmoved by this argument. The dentry-based name, he said, is "THE
TRUTH
", and any program that wants to see argv[0] should just
be using cmdline instead. Anybody who wants the behavior of
execve() should just not use fexecve(), he added. The
patch, as written, would not be pulled into the mainline.
Cook tried one more
time to explain why using argv[0] was desirable. It is the
"friendlier
" choice, he said, but if Torvalds was adamant that the
dentry-based name must be used, that was going to be the result of the
discussion. Torvalds responded:
"no. THERE IS NO WAY I WILL ACCEPT THE GARBAGE THAT IS ARGV[0]
".
This response seemed to indeed be somewhat adamant, so Cook has
subsequently resent his
pull request without the controversial patch, saying that the
dentry-based approach would be implemented for the 6.14 merge window.
Jędrzejewski-Szmek said that this
approach could be worked with, "but we'll need to make an effort to warn
users and do it much more visibly
". There will, he said, inevitably be
complaints from users whose scripts have broken.
In the end, this disagreement comes down to a small piece of the kernel's
user-space interface that has existed almost since the beginning, but which
has never been precisely specified. As with any user-visible behavior,
programs have developed a reliance on the way things have traditionally
worked, making newer approaches (such as execve() from a file
descriptor) hard to implement without breaking things. There may be no
ideal solution in this case, but it would have been nice if a workable
solution could have been found with less shouting.
| Index entries for this article | |
|---|---|
| Kernel | /proc |
| Kernel | System calls/execveat() |