TCP connection hijacking and parasites - as a good thing
The traditional ptrace() API calls for a tracing program to attach to a target process with the PTRACE_ATTACH command; that command puts the target into a traced state and stops it in its tracks. PTRACE_ATTACH has never been perfect; it changes the target's signal handling and can never be entirely transparent to the target. So Tejun supplemented it with a new PTRACE_SEIZE command; PTRACE_SEIZE attaches to the target but does not stop it or change its signal handling in any way. Stopping a seized process is done with PTRACE_INTERRUPT which, again, does not send any signals or make any signal handling changes. The result is a mechanism which enables the manipulation of processes in a more transparent, less disruptive way.
All of this seems useful, but it does not necessarily seem like part of a checkpoint/restart implementation. But it can help in an important way. One of the problems associated with saving the state of a process is that not all of that state is visible from user space. Getting around this limitation has tended to involve doing checkpointing from within the kernel or the addition of new interfaces to expose the required information; neither approach is seen as ideal. But, in many cases, the required information can be had by running in the context of the targeted process; that is where an approach based on ptrace() can have a role to play.
Tejun took on the task of saving and restoring the state of an open TCP connection for his example implementation. The process starts by using ptrace() to seize and stop the target thread(s); then it's just a matter of running some code in that process's context to get the requisite information. To do so, Tejun's example program digs around in the target's address space for a nice bit of memory which has execute permission; the contents of that memory are saved and replaced by his "parasite" code. A bit of register manipulation allows the target process to be restarted in the injected code, which does the needed information gathering. Once that's done, the original code and registers are restored, and the target process is as it was before all this happened.
The "parasite" code starts by gathering the basic information about open connections: IP addresses, ports, etc. The state of the receive side of each connection is saved by (1) copying any buffered incoming data using the MSG_PEEK option to recvmsg(), and (2) getting the sequence number to be read next with a new SIOCGINSEQ ioctl() command. On the transmit side, the sequence number of each queued outgoing packet - along with the packet data itself must be captured with another pair of new ioctl() commands. With that done, the checkpointing of the network connection is complete.
Restarting the connection - possibly in a different process on a different machine entirely - is a bit tricky; the kernel's idea of the connection must be made to match the situation at checkpoint time without perturbing or confusing the other side. That requires the restart code to pretend to be the other side of the connection for as long as it takes to get things in sync. The kernel already provides most of the machinery needed for this task: outgoing packets can be intercepted with the "nf_queue" mechanism, and a raw socket can be used to inject new packets that appear to be coming from the remote side.
So, at restart time, things start by simply opening a new socket to the remote end. Another new ioctl() command (SIOCSOUTSEQ) is used to set the sequence number before connecting to make it match the number found at checkpoint time. Once the connection process starts, the outgoing SYN packet will be intercepted - the remote side will certainly not be prepared to deal with it - and a SYN/ACK reply will be injected locally. The outgoing ACK must also be intercepted and dropped on the floor, of course. Once that is done, the kernel thinks it has an open connection, with sequence numbers matching the pre-checkpoint connection, to the remote side.
After that, it's a matter of restoring the incoming data that had been found queued in the kernel at checkpoint time; that is done by injecting new packets containing that data and intercepting the resulting ACKs from the network stack. Outgoing data, instead, can be replaced with a series of simple send() calls, but there is one little twist. Packets in the outgoing queue may have already been transmitted and received by the remote side. Retransmitting those packets is not a problem, as long as the size of those packets remains the same. If, instead, the system uses different offsets as it divides the outgoing data into packets, it can create confusion at the remote end. To keep that from happening, Tejun added one more ioctl() (SIOCFORCEOUTBD) to force the packets to match those created before the checkpoint operation began.
Once the transmit queue is restored, the connection is back to its original state. At this point, the interception of outgoing packets can stop.
All of this seems somewhat complex and fragile, but Tejun states that it
"actually works rather reliably
". That said, there are a lot
of details that have been ignored; it is, after all, a proof-of-concept
implementation. It's not meant to be a complete solution to the problem of
checkpointing and
restarting network connections; the idea is to show that the problem can,
indeed, be solved. If the user-space
checkpoint/restart work proceeds, it may well adopt some variant of
this approach at some point. In the meantime, though, what we have is a
fun hack showing what can be done with the new ptrace() commands.
Those wanting more details on how it works can find them in the
README file found in the example code repository.
| Index entries for this article | |
|---|---|
| Kernel | Checkpointing |
| Kernel | Networking |
| Kernel | ptrace() |