[go: up one dir, main page]

|
|
Log in / Subscribe / Register

Preventing atomic-context violations in Rust code with klint

By Jonathan Corbet
November 17, 2023

LPC
One of the core constraints when programming in the kernel is the need to avoid sleeping when running in atomic context. For the most part, the responsibility for adherence to this rule is placed on the developer's shoulders; Rust developers, though, want the compiler to ensure that code is safe whenever possible. At the 2023 Linux Plumbers Conference, Gary Guo presented (via a remote link) the klint tool, which can find and flag many atomic-context violations before they turn into user-affecting bugs.

Rust is built on the idea that safe Rust code, as verified by the compiler, cannot cause undefined behavior. This behavior comes in a lot of forms, including dereferencing dangling or null pointers, buffer overruns, data races, or violations of the aliasing rules; code that is "safe" will not do those things. The Rust-for-Linux project is trying to create an environment where much kernel functionality can be implemented with safe code. On the other hand, some surprising behavior, including memory leaks, deadlocks, panics, and aborts, is considered "safe". This behavior is defined, thus "safe" (though still, obviously, bad).

Atomic context in the kernel raises some interesting safety questions. If code, for example, executes a sequence like:

    spin_lock(&lock);
    /* ... */
    mutex_lock(&mutex);   /* can schedule */
    /* ... */
    spin_unlock(&lock);

the result could be a deadlock if another thread attempts to take the same spinlock on the same CPU. That is "safe" (but "bad") code. But what about code like the following?

    rcu_read_lock();
    schedule();
    rcu_read_unlock();

In this case, the safety of this code, even in the Rust sense, is not so clear. RCU assumes that there will be no context switches in code that is running within an RCU critical section; calling into the scheduler breaks that assumption. In this case, the atomic-context violation can indeed be a safety issue, creating use-after-free bugs, data races and worse. This is "fine" for C code, where the distinction between "safety" and "correctness" is not so well defined. Rust developers, though, try to live by different rules; consequently, they cannot design a safe API that allows sleeping in atomic context.

[Gary Guo] Avoiding that situation is not easy, though. One possible solution would be to make all blocking operations unsafe. That, Guo acknowledged, is likely to be widely seen as a bad idea. Another approach is token types, which are commonly used in Rust to represent capabilities; functions that might sleep can require a token asserting the right to do so. That leads to complex and unwieldy APIs, though. It is possible to do runtime checking, using the preemption count maintained in some kernel configurations now. That adds runtime overhead, though, and the preempt count is not available in all kernel configurations.

The last option would be to simply ignore the problem and trust developers to get things right, perhaps using the kernel's lockdep locking checker to find some problems on development systems. That approach, though, is unsound and not the Rust way of doing things.

The root of the problem, Guo said, is the need to optimize three objectives (soundness, an ergonomic API, and minimal run-time overhead) in a "choose any two" situation. Token types optimize soundness and overhead at the expense of an ergonomic API, for example, while run-time checking improves the API but sacrifices the goal of avoiding run-time overhead. Solutions that optimize all three quantities are hard to come by; the kernel's needs simply do not fit nicely into the Rust safety model.

The answer, Guo said, is to adapt the Rust compiler to this use case; that has been done in the form of a tool called "klint", which will verify at compile time the absence of atomic-context violations to the maximum extent possible. For the cases that cannot be verified, an escape hatch, in the form of a run-time check or use of unsafe, will be provided to developers so that their code can be built.

This tool was built with a number of goals in mind. It should be easy to explain and understand, of course, and provide useful diagnostics. There needs to be an escape hatch so that it does not get in the way of getting real work done. Its defaults, he said, should be sane, and there should be little need for additional annotations in the kernel. Finally, the tool needs to be fast so that it can be run every time the code is built.

Klint gives every function two properties, the first of which is an "adjustment" describing the change it makes (if any) to the preemption count (which, when non-zero, indicates that the current thread cannot be preempted). The second is the expected value of the preemption count when the call is made; this value can be a range. The klint tool tracks the possible state of the preemption count at each location, looking for situations where a function's expected preemption count is violated.

Thus, for example, rcu_read_lock() increments the preemption count by one, and can be called with any value. In Rust code, that would be annotated as:

    #[klint::preempt_count(adjust = 1, expect = 0.., unchecked)]
    pub fn rcu_read_lock() -> RcuReadGuard { /* ... */ }

As klint passes over the code, it tracks the possible values of the preemption count and flags an error if an expected condition is not met. For example, schedule() would be annotated as expecting the preemption count to be zero; if klint sees a call to schedule() after a call to rcu_read_lock() it will complain — unless, of course, there is a call to rcu_read_unlock() that happens first.

The compiler's type inference makes explicit annotation unnecessary much of the time. There are exceptions, naturally, including at the foreign-function interface boundary, with recursive functions, and with indirect function calls. Other limitations exist as well; there is, for example, no way to annotate functions like spin_trylock(), where the effect on the preemption count is not known in advance. Perhaps, in the future, that shortcoming could be addressed by adding some sort of match expression to the annotations, he said.

Data-dependent acquisition, where, for example, a function only takes a lock if a boolean parameter instructs it to, is also not handled by klint at this point. Finally, there are cases where the compiler injects code into a function that confuses klint, leading to incorrect reports. This problem is currently blocking the wider use of klint, and is thus urgent to solve. Meanwhile, he said, klint imposes a negligible compile-time overhead.

Guo concluded by saying that klint is available on GitHub for folks who want to play with it. More information can also be found in the slides from the talk.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting our travel to this event.]

Index entries for this article
KernelDevelopment tools/Rust
ConferenceLinux Plumbers Conference/2023


to post comments

Preventing atomic-context violations in Rust code with klint

Posted Nov 17, 2023 18:39 UTC (Fri) by IanKelling (subscriber, #89418) [Link] (10 responses)

ran cloc on klint:
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Rust                            30            880            532           6265
Markdown                         3             25              0             73
YAML                             2              6              3             31
TOML                             1              3              1             12

Preventing atomic-context violations in Rust code with klint

Posted Nov 17, 2023 21:45 UTC (Fri) by cytochrome (subscriber, #58718) [Link] (7 responses)

Pardon my ignorance, but what should I make of those numbers?
Is that an unusually small (or large) loc for this type of tool?

Preventing atomic-context violations in Rust code with klint

Posted Nov 18, 2023 8:16 UTC (Sat) by IanKelling (subscriber, #89418) [Link] (2 responses)

Up to you. I suggest looking around. On debian, apt install cloc, then

<pre>
mkdir -p /tmp/s
cd /tmp/s
apt source SOME_PACKAGE
cloc .
</pre>

Run ncdu to find the big files. Take a stroll.

Preventing atomic-context violations in Rust code with klint

Posted Nov 18, 2023 8:17 UTC (Sat) by IanKelling (subscriber, #89418) [Link]

Forgot to hit html for my comment, but it doesn't matter.

Preventing atomic-context violations in Rust code with klint

Posted Nov 18, 2023 9:28 UTC (Sat) by burki99 (subscriber, #17149) [Link]

I still find your comment rather odd. You could also tell us how many times the letter u is used in this linter. But that wouldn't add much to my or other readers' understanding of this tool.

Preventing atomic-context violations in Rust code with klint

Posted Nov 20, 2023 1:48 UTC (Mon) by dfc (subscriber, #87081) [Link] (3 responses)

I was catching up on LWN tonight and saw another comment that used cloc. Same commenter equally strange to read without any context.

https://lwn.net/Articles/951716/

Preventing atomic-context violations in Rust code with klint

Posted Nov 20, 2023 10:00 UTC (Mon) by atnot (guest, #124910) [Link] (2 responses)

There's a few commenters on this website known for making strange comments and nearly constantly getting into the same protracted arguments with each other. Call them say, "bold about rust", "signature conjector" and "realname with RMS".

I'm afraid your current best option is just to ignore them.

Preventing atomic-context violations in Rust code with klint

Posted Nov 20, 2023 23:26 UTC (Mon) by cytochrome (subscriber, #58718) [Link] (1 responses)

I'm not sure that "commenters... making strange comments..." applies to the individual who made the first comment. It just would have been nice to have their interpretation of the LOC information they provided. However, it may have simply been their enthusiasm for applying a cool tool (cloc) and sharing the results.

Preventing atomic-context violations in Rust code with klint

Posted Nov 21, 2023 0:27 UTC (Tue) by WolfWings (subscriber, #56790) [Link]

...that's kinda the definition of 'commenter making strange comments' though.

They're providing zero context, just pasting the output from some random tool they throw at every piece of software they can, and literally expecting others to interpret it for themselves and explicitly refusing to explain.

I don't know how many more boxes on the bingo-card of "boy that's a random comment" you can hit?

Preventing atomic-context violations in Rust code with klint

Posted Nov 18, 2023 10:35 UTC (Sat) by mpr22 (subscriber, #60784) [Link]

So it's a fairly small program, apparently broken out into manageably sized chunks.

That's good to know.

Did you have any particular reason for posting these statistics without commentary?

Preventing atomic-context violations in Rust code with klint

Posted Nov 18, 2023 21:45 UTC (Sat) by ttuttle (subscriber, #51118) [Link]

I’m likewise curious what conclusions you intend us to draw from this.

Preventing atomic-context violations in Rust code with klint

Posted Nov 19, 2023 19:56 UTC (Sun) by tialaramex (subscriber, #21167) [Link] (8 responses)

Memory leaks in particular seem to me to be a major Red Herring.

If I intentionally don't track the underlying memory allocation for some modestly sized but long-lived information I know I won't replace without terminating the software, we're not losing anything by that. In userspace Rust's Box::leak explicitly says I know this is a Box (thus it owns a heap allocation) but I'm OK with it just never being freed, give me a *reference* to the thing inside the box, the reference now lives forever, that's fine because it's never being freed.

In contrast, if I have some memory allocations which I'm definitely going to release properly... but it turns out I actually only ever release them in the very moment before exit even though I haven't needed them for an extended period, that's a real problem, and yet *that* isn't formally a leak.

Preventing atomic-context violations in Rust code with klint

Posted Nov 20, 2023 21:03 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (6 responses)

Well, there's two sides to it:

* Rust's soundness guarantees do not cover leaks at all. As far as Rust is concerned, leaking memory (and never running its destructor, if any) is 100% sound in all cases. This is why mem::forget() and Box::leak() are safe functions. The soundness rules only require that *if* a destructor is run, then the object must not be used once the destructor returns. This can have some unintuitive consequences. For example, leaking a MutexGuard is sound, which means it is sound to lock a Mutex and never unlock it (and deadlock/starve all of the other threads that want to use that lock). The same applies to most of the other things in std::sync.
* If you're writing a kernel, you really really really do not want to leak memory (unless you are e.g. constructing some static data structure at boot time). Leaking memory in userspace is bad, but the kernel will clean up after you when your process terminates. Memory leaked in kernelspace will remain leaked until the system reboots. OTOH, as described in the previous bullet, there are a lot of other things you can do that Rust considers sound, but are highly undesirable in practice.

Soundness was never meant to be the same thing as correctness. It was meant to prevent UB. Leaking memory does not risk UB, so it is sound. Deadlock is not UB either, so it is sound. Etc.

Preventing atomic-context violations in Rust code with klint

Posted Nov 20, 2023 23:22 UTC (Mon) by dvrabel (subscriber, #9500) [Link] (5 responses)

Leaking memory is only non-UB behaviour if your program has access to unlimited memory, otherwise the program will panic at some undefined point in time when it runs out of memory, although I suppose rust claims this behaviour as well-defined since the outcome when it happens is deterministic.

Preventing atomic-context violations in Rust code with klint

Posted Nov 21, 2023 0:54 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

But this is exactly why I think "leaks" are a red herring. You've gone from "leak" to "unbounded memory allocation" and now there's a big problem, but the two are only related colloquially.

The sort of leaks that leakdice is written to investigate count as both and are surely a bug which must be fixed. But when you try to write down formally what isn't OK, you discover that the leak wasn't the problem, it was the unbounded thirst for memory.

"We can optimise this movie player by loading all the quarter million frames into RAM and then flipping through them" blows up just as badly on a 1990s PC with 4MB of RAM regardless of whether you wrote code to free all those images at the end, we're never getting that far.

And yes, it's not UB to drop dead once you're out of other options. For an OS kernel rebooting might be a reasonable option, and so might power off, neither of which is Undefined Behaviour.

Preventing atomic-context violations in Rust code with klint

Posted Nov 21, 2023 1:08 UTC (Tue) by excors (subscriber, #95769) [Link] (1 responses)

I don't think panicking is considered non-UB just because of determinism. Rust's goal is that safe code cannot trigger the set of behaviours it considers UB (assuming there are no soundness bugs in `unsafe` code) - that's fundamental to the relationship between safety, soundness, and UB - but safe code can trigger panics, so panics cannot be considered UB.

And I think safe code is allowed to panic for ergonomic reasons: in most applications and libraries, it would be very annoying if every object allocation and every array access and every unwrapping of an Option etc had to explicitly handle errors or propagate the errors through all the APIs in its call stack. Those operations are frequently done where it's obvious they can't fail - e.g. allocation will always succeed in default Linux userspace (it just may attract the ire of the OOM killer some time later), and you might have already done some manual bounds checks before accessing the array, etc - so it's nice to be able to skip the useless error handling code there. In the rare cases where the thing you thought was 'obvious' was actually wrong, it will panic and cleanly terminate the process, and the system can recover by e.g. restarting the application, which isn't perfect but it seems a worthwhile tradeoff.

That's a worse tradeoff in kernel code, where it's much harder to recover from termination and you'd be willing to put a lot more effort into writing panic-free code, and maybe it could make sense to declare panic as UB so the compiler guarantees safe code won't panic; but much of Rust's standard library was designed to panic, so changing the definition of UB would be too much divergence from regular Rust, and it's probably easier to just treat panics like any other non-statically-checked bug that you want to avoid.

Preventing atomic-context violations in Rust code with klint

Posted Nov 21, 2023 9:58 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

The Rust standard library isn't available in Rust-for-Linux, only core (some of which is necessary to have Rust even work as a language e.g. you can't have for loops without core::iter::IntoIterator) is provided, plus their own take on alloc, but not std.

Even for writing very low level code, "This won't happen, don't bother me about it" is ergonomically necessary. It's fine if the reaction when you're wrong can be very dramatic, right down in the machine code if we have a fault but the CPU has for example not been told how to deal with any faults (or that the handlers for all faults are gone) it will just reboot.

Preventing atomic-context violations in Rust code with klint

Posted Nov 21, 2023 15:58 UTC (Tue) by MarcB (subscriber, #101804) [Link]

Something potentially happening at an undefined (better: unpredictable) point in time does not make it undefined behaviour.

Undefined behaviour is generally limited to things that are undefined within a given language specification. Here the specification will most likely say that memory allocations can fail and what happens if they do.

UB does not include things that are influenced by external factors. In this example, it might as well be another program that consumed all memory and causes you program's allocation to fail. So if your program panics or not does not even depend on its own quality.

Preventing atomic-context violations in Rust code with klint

Posted Nov 22, 2023 11:08 UTC (Wed) by farnz (subscriber, #17727) [Link]

The behaviour is well-defined, not UB, because the behaviour of Rust code in an OOM situation is well-defined; either a panic if you use the infallible allocation APIs (which assume that you can't have OOM), or a failure if you use a fallible allocation API (such as Vec::try_reserve.

This means that I can read Rust code, and I know exactly what will happen on OOM, and where OOM can happen (assuming I'm running on an underlying allocator that reports OOM, and not an overcommitting allocator that kills the process on OOM). In contrast, if OOM is UB, then once I hit an OOM situation, the program can behave in any fashion - including "time travel" back to before OOM happened followed by weird behaviour. In practice, UB is a huge problem when the compiler is doing any form of optimization, because the compiler can assume that UB does not happen, and reason from there.

Preventing atomic-context violations in Rust code with klint

Posted Nov 21, 2023 16:43 UTC (Tue) by tbelaire (subscriber, #141140) [Link]

Mem::forget started out as unsafe, and there were even scoped threading APIs which relied on the destructors being run on join guards to guarantee safety. But then people discovered that you could put those join guards into an reference counted cycle and leak the entire cycle using only safe code.

Clearly something was wrong, either RC or the scoped threading APIs were incorrect.

It was easier to declare leaks safe than get ride of reference counted pointers.

https://rust.cologne/2022/08/24/avoiding-leakpocalypse.html

Preventing atomic-context violations in Rust code with klint

Posted Nov 20, 2023 1:42 UTC (Mon) by Paf (subscriber, #91811) [Link]

I’m not - yet - a rust programmer, but this:
“ The last option would be to simply ignore the problem and trust developers to get things right, perhaps using the kernel's lockdep locking checker to find some problems on development systems. That approach, though, is unsound and not the Rust way of doing things.”
Is perfect. Why should people ever check for something a tool can check for them? And yes, I know there are exceptions, but it’s the right place to *start*. We forget stuff all the time or are never taught it. Every bug the tool can find or the design can prevent is time saved for other things.

Preventing atomic-context violations in Rust code with klint

Posted Nov 20, 2023 1:54 UTC (Mon) by ringerc (subscriber, #3071) [Link] (1 responses)

It'd be nice to call it something a bit more descriptive than 'klint'.

That one character doesn't say a lot about what it's linting. Characters aren't expensive.

I get that it's 'rust-for-linux/klint' on GitHub. But a more searchable, less collision prone name would not hurt anyway.

Preventing atomic-context violations in Rust code with klint

Posted Nov 22, 2023 19:14 UTC (Wed) by hmh (subscriber, #3838) [Link]

Submit a PR with suggestions, such as krslint, maybe ;-)

In fact "krs" as a prefix for kernel-rust things in general, if there is not some other small prefix already in use, seems useful...

Bikeshed to taste :-)

Preventing atomic-context violations in Rust code with klint

Posted Nov 28, 2023 13:07 UTC (Tue) by error27 (subscriber, #8346) [Link]

Smatch has a similar check for C. Although Smatch doesn't care about concurrency issues here like Rust would. Only about sleeping in atomic bugs.

check_sleep_info.c tracks which functions sleep. It only tracks functions which sleep on every path. This doesn't mean that we miss bugs, but it does mean that the warning messages are not printed as close to lock as you would want. (You want the warning as close as possible to the lock).

check_preempt_info.c tracks which functions adjust the preempt count. It only tracks simple functions which enable or disable preemption one time. If we first disable preemption and then enable it again that's ignored. There are a bunch of functions hard coded as disabling or enabling preemption. I don't know if they're all necessary, but I had the list from my locking code so it was easy to cut and paste it.

check_preempt.c tracks whether preemption is currently disabled.

check_scheduling_in_atomic.c actually prints the warning.

When Smatch prints a warning then you generally need `smdb.py preempt <function>` to find out which call tree is holding the lock.

The code is at https://github.com/error27/smatch

The main advantage of the Smatch approach is that you don't need to annotate anything. Annotations are not going to be complete and they're seldom as flexible as you would like. For example, kmalloc() will sleep unless you pass ___GFP_DIRECT_RECLAIM (or GFP_ATOMIC) so how would you annotate that? There must be a way, because that's the main source of sleeping in atomic bugs but it's not described in the article.

The main cause of missed bugs in Smatch is that this code doesn't handle IRQ context (only preempt count).

The main cause of false positives is places which set a flag or a variable can_sleep in one function and then read it again in another function five steps away. For example, in "struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };", the MSG_DONTWAIT flag means that it won't sleep. Smatch has some work arounds for some of these but still has false positives.


Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds