Running code within another process's address space

By Jonathan Corbet
April 16, 2021

One of the key resources that defines a process is its address space — the set of mappings that determines what any specific memory address means within that process. An address space is normally private to the process it belongs to, but there are situations where one process needs to make changes to another process's memory; an interactive debugger would be one case in point. The ptrace() system call makes such changes possible, but it is slow and not always easy to use, so there has been a longstanding quest for better alternatives. One possibility, process_vm_exec() from Andrei Vagin, was recently posted for review.

In truth, alternatives to ptrace() already exist for some tasks. The cross-memory attach system calls were merged for 3.2 in 2011 as process_vm_readv() and process_vm_writev(). As their names would suggest, they allow one process to read from and write to another process's memory. Those system calls satisfy many needs, but fall short when even more invasive access is needed to another process's address space. Sometimes, it seems, there is no alternative to running code within the target address space.

Vagin's patch set gives a couple of examples of where this access would be useful. User-mode kernels, such as User-mode Linux and gVisor, have to be able to intercept system calls made by a sandboxed process and, possibly, run them in the address space of that process. The Checkpoint/Restore in User space project needs to reach deeply within a process to extract all of the information needed to checkpoint it. Both use cases are currently handled with ptrace() but, once again, better and faster alternatives are wanted.

The alternative proposed by Vagin is a new system call:

    int process_vm_exec(pid_t pid, struct sigcontext uctx, unsigned long flags,
    			siginfo_t siginfo, sigset_t *sigmask, size_t sizemask);

A successful call will cause the calling process's address space to be changed to that of the process identified by pid. The cover letter notes that using a pidfd might be preferable; that would make this system call inconsistent with process_vm_readv() and process_vm_writev(), though. The values in uctx are used to load the processor registers (including the instruction pointer) before resuming execution in the new address space — an important step, since using the previous instruction pointer from the old address space is unlikely to yield satisfactory results in the new address space.

If flags is zero, process_vm_exec() will change the address space, then resume execution as indicated by uctx; that execution will continue until the process either makes a system call or receives a signal. Either way, the old address space will be restored and process_vm_exec() will return to the caller. The siginfo structure will describe the event that interrupted execution in the other address space; if it's a system call, siginfo will be made to look as if a SIGSYS signal had been received.

If, instead, flags contains PROCESS_VM_EXEC_SYSCALL, the purpose of the call is to invoke a system call within the target process's address space. In this case, uctx should contain the system call number and arguments in the appropriate registers, as would be the case for a real system call. The address space will be switched for the duration of the system call, then restored before returning to the caller.

This patch series was posted as a proof of concept with the idea of getting comments on the proposed API. Jann Horn was quick to respond that the proposed system call does not appear to fit the stated use cases well; it is too much for one and not enough for the other. For the case of running code within a different address space (as systems like User-mode Linux do), he suggested, creating a whole new process is overkill; it might be better to have a system call that allows the construction of new address spaces separately. For the checkpoint/restore case, instead, there may still be a need to access resources within a process beyond its address space, though he didn't say which resources those might be. Vagin responded that a relatively generic system call seemed better than a whole set of specialized ones, even if the generic alternative is not a perfect fit to all use cases.

Florian Weimer did have another resource in mind, though, that would be useful for the the GNU C library. There is a difference between how Linux implements setuid() and what POSIX requires: Linux only changes the credentials for the calling thread, while POSIX specifies that it must change the credentials for all of the threads running in a process. Currently, glibc implements POSIX semantics on Linux by sending signals to all threads so that they can all call setuid() together, which is less than ideal. It would be much nicer to just be able to call setuid() within the context of each thread without actually interrupting the threads. Such a feature could also be useful for implementing memory barriers, he said.

There is clearly some tension here between creating a feature that would be useful in some contexts and trying to solve a larger and more complex problem. In such cases, developers must pick their path carefully; trying to do too much is a good way to ensure that nothing actually gets far enough to land in the mainline kernel. So what will happen with process_vm_exec() is far from clear at this point; it may eventually find its way to acceptance, but it could change form considerably before that happens.

Index entries for this article
Kernel	System calls/process_vm_exec()

to post comments

Running code within another process's address space

Posted Apr 16, 2021 15:58 UTC (Fri) by rvolgers (guest, #63218) [Link] (3 responses)

Just as io_uring has finally settled on a sane context management mechanism for its worker threads, a new fundamentally broken way to mix-and-match process contexts is being invented. Marvelous.

Running code within another process's address space

Posted Apr 16, 2021 16:46 UTC (Fri) by josh (subscriber, #17465) [Link] (2 responses)

I do wonder if problems like the setuid issue could be solved by io_uring and a "run this on behalf of" mechanism that allows one thread to submit items into an io_uring that runs in the context of another thread.

Running code within another process's address space

Posted Apr 16, 2021 20:13 UTC (Fri) by luto (subscriber, #39314) [Link] (1 responses)

setuid could be solved by a new syscall. The error handling might be nontrivial, but it’s doable.

Running code within another process's address space

Posted Apr 17, 2021 0:47 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

Can we just get a file-decriptor based process API? Then the need to inject stuff into other processes would be severely reduced.

Running code within another process's address space

Posted Apr 16, 2021 17:50 UTC (Fri) by klbrun (subscriber, #45083) [Link] (1 responses)

Whatever happened to the Sun Microsystems Solaris door? Per Wikipedia, a port was made for Linux 2.4.18 in 2003.

Solaris-style Doors

Posted Apr 17, 2021 1:26 UTC (Sat) by CChittleborough (subscriber, #60775) [Link]

The Linux port was only “alpha quality”, and sadly work on it seems to have stopped in 2003.

The Doors facility needs both (1) kernel support and (2) a non-trivial user-space library. Door servers need that library to manage a thread pool in ways that almost require tight integration with the threads library. This is much easier when one organization produces both the kernel and the core libraries than in the Linux community.

(Full disclosure: I created that Wikipedia article.)

Running code within another process's address space

Posted Apr 17, 2021 0:02 UTC (Sat) by roc (subscriber, #30627) [Link] (1 responses)

Remote system calls often need parameters (both in and out) in memory. If the target task is running, how do you guarantee with PROCESS_VM_EXEC_SYSCALL that any memory changes you make aren't overwritten by the running task? You won't be able to use the stack for example. Would you have to remote-sycall an mmap/munmap to allocate dedicated memory for your parameters?

Running code within another process's address space

Posted Apr 22, 2021 16:50 UTC (Thu) by avagin (subscriber, #63724) [Link]

Yes, we will need to create memory mappings. You can look at CRIU how it injects a parasite code into a process.

Some questions on Running code within another process's address space

Posted Apr 17, 2021 5:57 UTC (Sat) by dongmk (subscriber, #131668) [Link] (3 responses)

Thanks for your nice article! The `process_vm_exec` system call is interesting, and I get some questions while reading the article.

1. If I understand correctly, the current `process_vm_exec` can only execute code that is already in the target process's address space, which could be inconvenient for tasks such as inspecting the address space content.
In other words, the calling process needs to inject the code to the target (perhaps by `process_vm_writev`) before it invokes `process_vm_exec` to execute the injected code.

Or does `process_vm_exec` provide some mechanisms to "bring" its own code to the switched address space?

> that execution will continue until the **process** either makes a system call or receives a signal.

Is this **process** the calling process or the target process? Or both?

Will the target process (and threads in the target process) pause its execution upon the system call?
If so, how could the calling process resume the target's execution?
If not, the calling process could miss some system calls made by the target, right?

3. I think things will be complicated when there are multiple threads in the calling process.
For example, when one thread calls `process_vm_exec` to switch the address space, other threads will run in the "wrong address space" and fail.

Sorry for asking so many questions.

Some questions on Running code within another process's address space

Posted Apr 17, 2021 9:23 UTC (Sat) by smurf (subscriber, #17840) [Link] (2 responses)

> > that execution will continue until the **process** either makes a system call or receives a signal.

> Is this **process** the calling process or the target process? Or both?

Umm, the target of course. When the target does [ make a syscall / gets a signal ], the system call returns, as described in the text. It obviously cannot do that if the caller continues execution in any way.

Some questions on Running code within another process's address space

Posted Apr 17, 2021 10:57 UTC (Sat) by dongmk (subscriber, #131668) [Link] (1 responses)

Thanks for your reply. :) But I find that it's more complex than I thought earlier, and I am getting more confused.

It may be helpful to confirm what the caller will do during `process_vm_exec`; here are two possibilities:

1) the caller blocks until the target [makes a syscall / gets a signal];
2) the caller resumes execution by jumping to the IP specified in `uctx` in the target's address space; and the caller somehow gets back to the original context in the original address space when the target [makes a syscall / gets a signal].

For 1), emmm, there seems to be no `exec` effect since the original address space is restored when `process_vm_exec` returns; the caller simply blocks to wait for the target's syscall/signal.

For 2), a new control flow seems to have emerged after `process_vm_exec`, and the new control flow is terminated (at any time) upon the target's syscall/signal.

Some questions on Running code within another process's address space

Posted Apr 23, 2021 5:18 UTC (Fri) by avagin (subscriber, #63724) [Link]

> 2) the caller resumes execution by jumping to the IP specified in `uctx` in the target's address space; and the caller somehow gets back to the original context in the original address space when the target [makes a syscall / gets a signal].

I would rephrase this:

2) the caller resumes execution by jumping to the IP specified in `uctx` in the target's address space; and the caller somehow gets back to the original context in the original address space when it makes a syscall or gets a signal.

The target process is used only to grub its address space. process_vm_exec doesn't stop it and doesn't change its state (registers, signals, fpu, etc).

There are a few examples, I think they can help to understand how this works:
https://lwn.net/ml/linux-kernel/20210414055217.543246-5-a...

Running code within another process's address space

Posted Apr 17, 2021 9:15 UTC (Sat) by flussence (guest, #85566) [Link]

Surely it should be called process_vm_exec*v*? :)

Running code within another process's address space

Posted Apr 17, 2021 15:09 UTC (Sat) by pm215 (subscriber, #98099) [Link]

I would have thought that the correct way to solve "implementing POSIX semantics for setuid() purely in userspace is somewhere between tricky and impossible" (the glibc machinery for it is pretty gnarly) is "implement a new kernel syscall that provides the whole-process semantics". The kernel is in a much better position to be able to do that than userspace ever will be. I'm mildly surprised that nobody's ever had a go at it.

Running code within another process's address space

Posted Apr 19, 2021 16:54 UTC (Mon) by jnewsome (guest, #151740) [Link]

As someone working on a process sandbox/simulator/emulator <http://shadow.github.io/>, this looks super-exciting! We're in the process now of moving from LD_PRELOAD-based interposition to ptrace. This simplifies a lot of things and is more robust, but is a significant performance hit. AFAICT we'd be able to switch over from ptrace to this new API without much work.

Running code within another process's address space

Posted Apr 22, 2021 2:01 UTC (Thu) by alkbyby (subscriber, #61687) [Link]

It is indeed nifty to have such "immediate signal" facility within same address space. And assuming it will be used with sigmask set to block everything, it looks robust. Even delivering another such "signal" when one is already running (assuming the caller provides the stack in both cases which seems reasonable anyways) would work. I assume things like backtraces would work too, as this facility could use same sigreturn trampoline stuff as real signals.

One particularly nice feature of this is that it is entirely compatible with all kinds of libraries. I.e. today libraries are essentially unable to use signals because it is ~impossible to arbitrate between libraries who is using which signal number. When there are no signal numbers or any state (e.g. altstack, sigaction) involved in first place, it looks like there no any kinds compatibility trouble.

Another thing that would be nifty is being able to somehow specify not saving/restoring huge simd states to enable cheaper or more perf-sensitive usages of this facility (e.g. garbage collectors might choose to use it too), and to save stack. It seems inevitable that safe usage of this facility requires caller to allocate stack for each thread's "remote call" and not having to bother about simd registers would save nontrivial number of bytes.