eclone()

By Jonathan Corbet
November 18, 2009

Developers working to implement a checkpoint/restart capability for Linux want the ability to create a new process with a specific process ID. In the absence of that feature, restarted processes will suddenly find themselves with different PIDs, which can only lead to confusion. To implement explicit PID selection, the checkpoint/restart developers have proposed various extensions to the clone() system call with names like clone_with_pids() and clone_extended(). No version has yet been merged, and the proposed API continues to evolve.

The latest proposal is called eclone(); it looks like this:

    int eclone(u32 flags_low, struct clone_args *args, int args_size,
	       pid_t *pids);

The flags_low argument corresponds to the flags argument to the existing clone() call, which is running out of space for new flags. The pids argument is an optional list of PIDs to apply to the new child process, one for each namespace in which the process appears. Everything else goes into args:

    struct clone_args {
	u64 clone_flags_high;
	u64 child_stack_base;
	u64 child_stack_size;
	u64 parent_tid_ptr;
	u64 child_tid_ptr;
	u32 nr_pids;
	u32 reserved0;
	u64 reserved1;
    };

A number of these fields (child_stack_base, child_stack_size, parent_tid_ptr, child_tid_ptr) correspond to existing clone() arguments. clone_flags_high allows the addition of more flags; no new flags are defined in the eclone() proposal, though. The length of the pids array is given by nr_pids, and the reserved fields are there for future expansion.

Comments on the new proposal have been scarce; it may be that the development community has gotten a little tired of seeing these patches over and over. The silence could also mean that there are no objections to this proposal. One big obstacle could remain to the merging of this system call, though: it is there to support the checkpoint/restart facility, which is definitely not ready for merging into the mainline. Getting checkpoint/restart to a completed and maintainable state is likely to take some time; until then, there may be reluctance to add a new system call which does not, yet, have any real-world users.

Index entries for this article
Kernel	Checkpointing

to post comments

eclone()

Posted Nov 19, 2009 17:44 UTC (Thu) by MarkWilliamson (guest, #30166) [Link] (2 responses)

I still don't entirely understand why the need here can't be satisfied in
some way using the containerisation mechanisms (and possibly extending them
somewhat) to allow the resumed processes to believe they have the same PIDs,
though these actually might have changed.

A clone_with_pids() or equivalent doesn't really seem like a solution to me
- if those PIDs are taken then you're stuck, right? That seems a bit
brittle.

eclone()

Posted Nov 19, 2009 18:39 UTC (Thu) by dmag (guest, #17775) [Link] (1 responses)

I'm with Linus on this one. He complained that "checkpoint/restart" will only work with a small subset of programs anyway. Everything from TCP connections to open files are always going to be problematic. So why not put a little bit of the burden on the process itself (to deal with new PIDs), instead of trying to complicate the kernel?

eclone() / containers

Posted Nov 20, 2009 9:45 UTC (Fri) by nicollet (subscriber, #37185) [Link]

Because we can't rewrite all the userland apps to be container aware. I think containers are the right way to go. Hypervisors are a way to say: "our OS can't use all the horsepower, let's put several hosts on the same physical machine".

Containers might also help to migrate virtual machines from one physical host to the other very fast, by tuning the VM subsystem. Today any page can be in the RAM or Swapped. During a container migration, we can add a third level: "on this host". That way if you want to move a 1 GB vserver/container, you wouldn't need to transfer the whole data to begin executing code.
It would help to reduce the TCP latency problem IMHO.

eclone()

Posted Nov 20, 2009 1:00 UTC (Fri) by riddochc (guest, #43) [Link] (4 responses)

Okay, I'm confused. I have some memory of a topic of debate in kernel-land, long ago, about PID randomization. I don't recall exactly why randomizing PIDs was considered a good idea, but I think it was somehow security-related.

It seems that the proposal discussed here, "the ability to create a new process with a specific process ID," would be exactly what randomized PIDs attempts to prevent.

So, I'm not sure what I'm talking about, obviously, but does this give anyone else a suggestion of what I'm talking about?

eclone()

Posted Nov 20, 2009 9:32 UTC (Fri) by anselm (subscriber, #2796) [Link] (2 responses)

If PIDs are increased sequentially, unrelated programs can use the rate of process creation as a »covert channel« for (low-bandwidth) communication. Randomised PIDs prevent that.

eclone()

Posted Nov 20, 2009 12:28 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (1 responses)

Couldn't you use fluctuating number of processes as an even-lower-bandwidth covert channel?

eclone()

Posted Nov 20, 2009 12:43 UTC (Fri) by anselm (subscriber, #2796) [Link]

Maybe. Off the top of my head, the problems with that might be that

other processes will fork, too, so especially on a busy system the signal-to-noise ratio will probably be much worse, and
you may not be allowed to create as many simultaneous processes as you need to make yourself noticeable.

The main difference is that with sequentially numbered PIDs, the receiver of the covert channel only needs to fork(2) periodically and look at the returned child PID to find out how many processes have been created in the meantime; it does not need to be able to find out how many processes are running on the system, let alone be able to find out how many child processes another process has (when a suitably hardened system may prevent it from finding out any details about that process at all, which is why the covert channel is necessary to begin with).

eclone()

Posted Nov 26, 2009 23:59 UTC (Thu) by efexis (guest, #26355) [Link]

Yeah I get what you're saying but it's all cool (at least on the proposals I looked in more detail at, I assume the case is still the same here); the process's 'real' PID will still be the same as what it always would've been if it had been fork()ed or clone()ed at that point without specifying a new PID for it. If you want to send it a kill signal or look at its memory usage in /proc/$pid, nothing will have changed there. But what you're doing with the clone() call is creating a new PID namespace, and the process is born within this namespace, and gets a new 'virtual' PID that addresses it from within there. Any other processes that also exist in that namespace will talk to the process with the new virtual PID, and cannot talk to processes that don't have a PID in that namespace. For example, each new namespace can be born with a process with PID of 1 which acts as the init (collecting child processes that live longer than their parents do etc), but to the outside world the process may have a PID of 16384 for all it matters.

Therefore any process can start up with a PID of its choosing providing you create a namespace for it and don't fill it with other stuff first. The idea here is that you can suspend a whole container(/namespace) full of processes with virtual PIDs that may be talking to each other by PID, write them out to disc, and then at a later point recreate all the processes from the image into a new container with the same virtual PIDs (but new real PIDs) and they won't need to any of the wiser.

:-)

eclone()

Posted Nov 30, 2009 9:01 UTC (Mon) by robbe (guest, #16131) [Link]

Why do they need these funny reservedN fields, when the structure is
already extensible via the args_size parameter? Am I missing something?