Toward race-free process signaling

Posted Dec 6, 2018 23:52 UTC (Thu) by dw (subscriber, #12017)
In reply to: Toward race-free process signaling by kjp
Parent article: Toward race-free process signaling

After reading your comment it's hard to take the other proposals seriously, they're ridiculously over-engineered. A kill2() that accepted an opaque, randomly generated per-process cookie stashed in the task structure that could be extracted somehow would be vastly simpler for implement and for users to understand.

Of course, having a 'process file descriptor' is overall much more generic, and perhaps those designs have aspirations for extending the functionality later, but this does not seem worth probably yet another 8kb of .text, possibly only in the name of being 'UNIXey'

to post comments

Toward race-free process signaling

Posted Dec 7, 2018 7:42 UTC (Fri) by epa (subscriber, #39769) [Link] (18 responses)

So you are saying that we supplement the current process id (limited to 32767) with a 64-bit or 128-bit value that is unique for the lifetime of the system (until a reboot)?

Then all the existing system calls taking a process id get a ‘2’ version taking the longpid instead. All other semantics stay the same.

That does seem a much better way to address the issue (and perhaps others besides, eg pid namespaces for containers would no longer be necessary once user space migrates to the new API).

Toward race-free process signaling

Posted Dec 7, 2018 8:20 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (17 responses)

This won't work. Or more precisely, it'll have all the same drawbacks.

Consider the current use-case:
- list processes
- get process pid
- kill process by pid

The new use-case will be:
- list processes
- get process pid
- get long pid by pid
- kill process by long pid

The race condition is still there. You'll need to fix all the APIs to use long pids in the first place.

Toward race-free process signaling

Posted Dec 7, 2018 9:02 UTC (Fri) by epa (subscriber, #39769) [Link] (16 responses)

Indeed, every system call will need a version that returns a long pid. So the new fork() will return the long pid directly, and so on. There is no need for a separate and race-prone lookup from short pid to long pid (which is not a 1-1 mapping anyway).

Toward race-free process signaling

Posted Dec 7, 2018 9:05 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (15 responses)

But at this point it makes sense to just use file descriptors instead of long pids. File descriptors are way better for many reasons - they can be securely sent over Unix sockets, they can be inherited by subprocesses and so on.

Toward race-free process signaling

Posted Dec 7, 2018 11:22 UTC (Fri) by epa (subscriber, #39769) [Link] (6 responses)

I guess the only thing you can't do with a file descriptor is type it in on the command line. So shell scripts, etc, would still be prone to race conditions.

(A numeric 64-bit pid can be sent over sockets and told to a subprocess, of course.)

Toward race-free process signaling

Posted Dec 7, 2018 14:01 UTC (Fri) by ebiederm (subscriber, #35028) [Link] (3 responses)

A 64bit pid is long enough you can't reliably type it on the command line, even 32bits are a problem.

This is part of the reason why pids which have a 32bit type are limited to 16bits by default.

Toward race-free process signaling

Posted Dec 7, 2018 17:18 UTC (Fri) by smurf (subscriber, #17840) [Link] (1 responses)

These days, that's the only reason to do this. People not running 10-year-old code are unlikely to be affected by >16-bit PIDs.

I habitually set maxpid to 99999. Anything unlikely to run >1000 processes, like Raspberry Pis, get 9999.

Toward race-free process signaling

Posted Dec 7, 2018 17:53 UTC (Fri) by zdzichu (subscriber, #17118) [Link]

I think the bigger maxpid, the better – safer. Short pids encourage manual typing, which is error-prone. Big pids kinda forces copy-pasting, which is safer (modulo pid reuse).

Toward race-free process signaling

Posted Dec 7, 2018 19:46 UTC (Fri) by epa (subscriber, #39769) [Link]

Sorry for being unclear. I didn’t mean literally typing in the number (I would cut and paste anyway). I was illustrating the general point that a process id is just a number, with no special magic, and can be handled by any programming language including shell scripts. It can be saved to a file, passed on the command line, even sent over TCP/IP if necessary.

Existing code which works with 15-bit process ids could normally work on 64-bit ones with no change, or at most a change of type from int to long in strongly typed languages. File descriptors are great, but they form their own closed world and need a new set of APIs. They cannot just be treated as an opaque number or a string of text.

Toward race-free process signaling

Posted May 6, 2019 3:02 UTC (Mon) by cyphar (subscriber, #110703) [Link] (1 responses)

In many cases, /proc/self/fd/... is a neat way to "type an fd on the command-line".

Toward race-free process signaling

Posted May 6, 2019 12:22 UTC (Mon) by smurf (subscriber, #17840) [Link]

Your favorite shell's autocomplete mechanism should be able to understand PIDs too.

It's still somewhat dangerous to actually use that, though. The probability that mistyping the first four digits and pressing TAB gives you an entirely unrelated process shouldn't be underestimated.

Toward race-free process signaling

Posted Dec 7, 2018 16:03 UTC (Fri) by dw (subscriber, #12017) [Link] (7 responses)

My understanding is that this is an attempt to fix an edge case in code that does not keep track of its own children correctly. The problem is one of:

1) Child exits, crap parent kills unrelated process because it wasn't paying attention
2) /etc/init.d/postfox stop, crap init script kills unrelated process due to stale PID file.

No solution presented thus far actually solves case 1), the old API will continue to exist in perpetuity, and any new API will always only see limited uptake, due to portability or simple lack of effort to port everything over. There is a limit to the value in any solution, because it is unlikely to see revolutionary uptake. A simple solution therefore seems preferable.

The file descriptor solution does not meaningfully solve case 2), there is still a race for the init script to open /proc/blah/pid and somehow introspect the descriptor it received matches the daemon it is trying to kill, so some "is this really the process I want?" code is still necessary.

The FD solution creates a world of security pain that doesn't match the typical UNIX files model, because the kernel object in question can change its security identity over time.

The cookie-based solution does not entail updating every single API, the original problem is only about signal delivery, and thus only effects kill() and possibly clone().

A cookie-based solution allows the identifier persist on disk easily. Consider two new system calls:

- pid_to_handle(pid_t pid, struct pid_handle *handle) -- accepts pid==0 or pid==child pid. In the 0 case, PID of current process returned. In remaining case, return -1 if PID is not a child of the current process.

- kill_by_handle(pid_t, struct pid_handle *); -- works identically to kill(), except handle must match. No other restriction placed on caller.

After calling clone(), pid_to_handle() is used by the parent prior to waitpid() to retrieve the handle. For daemonizing processes, it must be the child invoking it on itself as any handle the parent could receive would be for the intermediary daemonizing process that almost immediately died.

Toward race-free process signaling

Posted Dec 7, 2018 17:30 UTC (Fri) by smurf (subscriber, #17840) [Link] (6 responses)

> A cookie-based solution allows the identifier persist on disk easily.

A pid-plus-verifiable-identifier approach solves this problem just as well.

> /etc/init.d/postfox stop, crap init script kills unrelated process due to stale PID file.

This is why sane init systems tend to not use PID files.

> After calling clone(), pid_to_handle() is used by the parent prior to waitpid() to retrieve the handle.

This entails a race. You should not be required to assume that your thread is the only one calling waitpid(-1). clone2() needs to return the handle atomically.

Toward race-free process signaling

Posted Dec 7, 2018 17:37 UTC (Fri) by dw (subscriber, #12017) [Link]

> This entails a race. You should not be required to assume that your thread is the only one calling waitpid(-1). clone2() needs to return the handle atomically.

That's a fair point, but multi-threaded software with competing threads calling waitpid(-1) are no less buggy IMHO than those with competing threads say, closing random file descriptors, or creating new ones without CLOEXEC -- the problem is simply moved. It's just one of many single-thread-centric interfaces an MT app must give up. And particularly, it is a class of problem that is not fixed by modifying clone() -- an MT app exhibiting this behaviour has bigger problems than race-free child signalling

Toward race-free process signaling

Posted Dec 8, 2018 2:12 UTC (Sat) by wahern (subscriber, #37304) [Link] (4 responses)

> > /etc/init.d/postfox stop, crap init script kills unrelated process due to stale PID file.
>
> This is why sane init systems tend to not use PID files.

If the service takes a POSIX lock on the PID file (rather than writing it out), the PID can be queried atomically. You can't *use* it atomically, but that's because the only way to atomically send a signal to an individual process is if you're the parent and aren't using SA_NOCLDWAIT.

If the child disassociates from the service manager then you either need to rely on process groups or cgroups. While process groups are atomic (a beneficial inheritance from legacy TTY and batch job management), the cgroups approach still involves reading PIDs from a file, which has the same TOCTTOU race.

Basically, on Linux I think it's still impossible to write a service manager that isn't susceptible to the classic PID file race while also being able to accurately signal individual wayward processes. (And to be fair, I don't think it's possible on any other Unix-like system, at least not using published and supported interfaces.) You could use cgroups and PID namespaces to minimize collateral damage, but it's still fundamentally a hack. You could use a seccomp policy to prevent disassociation from the process group, but you still couldn't target *individual* processes in the group.

To safely signal individual processes there's really no substitute for process descriptors. A larger PID namespace that doesn't recycle PIDs isn't any better, even as an expediency. In both cases you still need to add a bevy of new syscalls and additional bookkeeping in the kernel. While PIDs may seem easier to use from the shell, the shell is perfectly capable of juggling and passing around descriptors (e.g. exec 8</proc/PID). The necessary bookkeeping in the kernel isn't less for wider PIDs because, like with the shell, all the infrastructure for descriptors exists and is easily applied. The benefit of descriptors, however, is that it gives processes a handle to query process state, like exit status, as well as a channel for reliable delivery of lifetime events (e.g. fork) so that a service manager could manage process trees in a straight-forward, race-free manner. That may not happen immediately, but if you're going to add new syscalls, why pick the dead-end solution?

Toward race-free process signaling

Posted Dec 8, 2018 2:41 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> Basically, on Linux I think it's still impossible to write a service manager that isn't susceptible to the classic PID file race while also being able to accurately signal individual wayward processes.
You can do that with cgroups, but it does require some trickery:
- Put a process in cgroup.
- SIGSTOP it.
- Inspect the cgroup to make sure the process is still the correct one.
- Send the signal.
- SIGCONT it.

Toward race-free process signaling

Posted Dec 8, 2018 2:43 UTC (Sat) by dw (subscriber, #12017) [Link]

If you're willing to risk sending SIGSTOP to a random process, as done here, there is no value to cgroups or indeed any API change whatsoever.

Toward race-free process signaling

Posted Dec 8, 2018 8:39 UTC (Sat) by nopsled (guest, #129072) [Link] (1 responses)

No need to SIGSTOP or anything else, just use the freezer (which is coming for v2, patches have already been posted).

Toward race-free process signaling

Posted Dec 8, 2018 9:02 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

The last time I tried that (3-4 years ago) it resulted in unrecoverable system lockups. So I kinda hesitate to recommend it.

Toward race-free process signaling

Posted Dec 8, 2018 1:59 UTC (Sat) by kmeyer (subscriber, #50720) [Link] (7 responses)

If you just want to track and kill your children processes, FreeBSD's pdfork() and pdkill() do exactly that. The handle is an opaque unique integer (i.e., a file descriptor).

Toward race-free process signaling

Posted Dec 9, 2018 19:22 UTC (Sun) by nybble41 (subscriber, #55106) [Link] (6 responses)

> The handle is an opaque unique integer (i.e., a file descriptor).

File descriptors are only integers within the context of a single process; the integer is meaningless without the process's descriptor table. For tracking one's own child processes that works OK, but it makes it difficult to save the identifier to a file or send it to an unrelated process, which are both desirable use cases.

Personally, I like the suggestion to switch to monotonically increasing 64-bit process IDs, but with the constraint that at any time the least significant 32 bits of any new process ID must be unique and range from 1 to pid_max. Just skip any PIDs which overlap or would be out of range. Make the 64-bit PIDs start at 2**32 so that they can be distinguished from traditional PIDs, and have system calls accept both the full 64-bit PID or just the least significant 32 bits. The effect would be that you can still refer to processes exactly as you do now, or in a race-free way using larger integer IDs, using the same system calls. (This is basically a cross between "don't reuse PIDs" and "tag processes with GUIDs as well as traditional PIDs").

Toward race-free process signaling

Posted Dec 10, 2018 21:25 UTC (Mon) by kmeyer (subscriber, #50720) [Link] (5 responses)

> File descriptors are only integers within the context of a single process;

Maybe you meant "unique" rather than "integers?"

> it makes it difficult to save the identifier to a file or send it to an unrelated process

Huh? E.g., http://poincare.matf.bg.ac.rs/~ivana/courses/ps/sistemi_k...

> Personally, I like the suggestion to switch to monotonically increasing 64-bit process IDs, but with the constraint that at any time the least significant 32 bits of any new process ID must be unique and range from 1 to pid_max.

Yeah, that is a pretty good proposal given the constraints Linux faces around not breaking the kernel ABI at all for any binary that exists today.

Toward race-free process signaling

Posted Dec 12, 2018 4:30 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (4 responses)

>> File descriptors are only integers within the context of a single process;
> Maybe you meant "unique" rather than "integers?"

That would also work, but no, I meant "integers". I suppose some would say that the file descriptor *is* the integer, and representationally that's often true, but to me the term refers conceptually to an active entry in the file descriptor table (a particular file/pipe/socket/device, set of flags, position, etc.), and the integer is just the index into the table. (Consequently, all file descriptors have unique integer identifiers within a process, but not all integers are file descriptors.) A process only has one descriptor table, so there is an isomorphism between the process's file descriptor table entries and their integer indices. Outside the process, however, you need some other way to identify the state that a file descriptor stands for; the index alone, without context, isn't enough.

>> it makes it difficult to save the identifier to a file or send it to an unrelated process
> Huh? [link to "Passing File Descriptors over UNIX Domain Sockets"]

I'm aware of FD passing over UNIX domain sockets, but that's only a partial solution. How exactly does one pass a file descriptor to or from a shell script using Unix-domain sockets, for example? Even from C it's more complex than just sending an identifier over *any* available communications channel. It also doesn't address the use case of serializing the identifier to a file, where there may not be any process running to hold the file descriptor open.

Toward race-free process signaling

Posted Dec 12, 2018 5:43 UTC (Wed) by kmeyer (subscriber, #50720) [Link] (3 responses)

> That would also work, but no, I meant "integers". I suppose some would say that the file descriptor *is* the integer, and representationally that's often true, but to me the term refers conceptually to an active entry in the file descriptor table (a particular file/pipe/socket/device, set of flags, position, etc.), and the integer is just the index into the table.

Ok — this might just be a terminology difference between FreeBSD and Linux. In FreeBSD, file descriptors are definitely integers. They index a per-process table of 'struct filedescent's named the 'fdescenttbl.' Each filedescent has a 'struct file' pointer, 'struct filecaps', flags, and a sequence number associated with it.

The 'struct file' tracks things like 'f_type' (DTYPE_VNODE for regular files, DTYPE_SOCKET, DTYPE_PIPE, DTYPE_KQUEUE, etc); 'fileops' (file-level vtable of operations); file-associated credentials; associated 'struct vnode' (inode in Linux terminology), if any; the file offset (for operations like read(2) or write(2) that don't take an explicit offset); etc.

So you can understand my confusion — we would call what you're talking about a 'filedescent' or just 'file'.

> all file descriptors have unique integer identifiers within a process

Ok, definitely 'filedescent' in our terminology, rather than 'file' (which would be shared by a dup(2)ed fd).

> Outside the process, however, you need some other way to identify the state that a file descriptor stands for; the index alone, without context, isn't enough.

Sure. That state can be passed between processes using unix domain sockets and control messages, though. That's not as frictionless as copying and pasting some pid number between arbitrary processes, but it's not like the credential is locked to the original process exclusively.

> How exactly does one pass a file descriptor to or from a shell script using Unix-domain sockets, for example?

It could readily be done with an extension builtin, or a helper program. (Or something crazy like ctypes.sh.) Yeah, it's higher friction than just passing around some integer. On the flip side, it works without increasing pid space to 64 bits.

> It also doesn't address the use case of serializing the identifier to a file, where there may not be any process running to hold the file descriptor open.

I'm not sure it's reasonable to serialize a process handle of any kind to a file and expect it to be meaningful later. Say your file is on NFS, or backed up to a remote system. Or even a local filesystem, but the machine has been rebooted. I'm not overly concerned with not handling that case.

Thanks for the discussion, it's been interesting and given me some food for thought.

Toward race-free process signaling

Posted Dec 12, 2018 22:34 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (2 responses)

> In FreeBSD, file descriptors are definitely integers. They index a per-process table of 'struct filedescent's named the 'fdescenttbl.'

So to unpack the abbreviations a bit, you're saying that in FreeBSD, "file descriptors" are integers which identify "file descriptor entry" structures in a per-process "file descriptor entry table". I can see how that would be confusing.

I'm not sure it's a FreeBSD vs. Linux thing, but to me including "entry" in the name of the type for the elements of an array seems a bit awkward. I would just say "file descriptor table" for the array, and the elements of the array ("file descriptor table entries") would simply be referred to as "file descriptors". If the "file descriptor" is a member of the current process's "file descriptor table", as is usually the case, then it can be referred to simply by its index in the table (its "file descriptor number"). One might casually refer to this as "passing a file descriptor" when one is really just passing the index of the file descriptor, much like "passing an object" usually just means passing the address of the object.

(BTW, note the terminology in that document you linked to: "Passing File Descriptors over UNIX Domain Sockets". It is not the *integer* which is being passed over the socket—the receiving process will most likely get a different integer—but rather the object which the integer refers to.)

>> How exactly does one pass a file descriptor to or from a shell script using Unix-domain sockets, for example?
> It could readily be done with an extension builtin, or a helper program.

It could be, but now we're talking about adding dependencies on external programs which (a) haven't been written yet and (b) won't necessarily be present on every system. In the end you could just rewrite any shell script in C, but I wouldn't consider that a realistic solution to the problem of "how to do X in a shell script". The object is to create a script which works with existing systems and commonly available tools.

> On the flip side, it works without increasing pid space to 64 bits.

Along those lines, the only change that's really needed is for open references to /proc/PID directories (or their contents) to block PID reuse. As others have pointed out, there are ways to tag processes via their environment variables—as long as they are willing to cooperate—so any process could (1) open the PID directory, (2) check for the matching environment variable, and (3) send the signal if the environment matches, knowing that the PID won't be reused since the directory is open. The kernel could improve on this somewhat by implementing pdfork() and assigning unforgeable identifiers when processes are created, but it's not really necessary just to prevent races for most use cases.

Toward race-free process signaling

Posted Dec 15, 2018 16:14 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

POSIX already has terms for these things. The things open(2) returns are 'file descriptions': the file description is where things like the open flags and file offset reside: the file descriptor is an integer pointer to a file description, which also has flags of its own (currently only O_CLOEXEC). dup() and fork() copy the descriptor, but leave it referring to the same description. close() closes a descriptor, and if no descriptors are left referencing its description, the description is also closed.

Toward race-free process signaling

Posted Dec 18, 2018 16:35 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

> the file descriptor is an integer pointer to a file description, which also has flags of its own (currently only O_CLOEXEC).

This is interesting but it really just adds another layer of indirection: file descriptor number (integer) -> file descriptor (object with flags) -> file description (object with flags & file offset) -> file. It's still true that a file descriptor is not simply an integer, since integers don't have O_CLOEXEC flags.