Best practices for error handling in kernel Rust
Dirk Behme led a session discussing the use of Rust's question-mark operator in the kernel at Kangrejos 2024. He was particularly concerned with the concept of "silent" errors that don't print any messages to the console. Other attendees were less convinced that this was a problem, but his presentation sparked a lot of discussion about whether the Rust-for-Linux project could improve error handling in kernel Rust code.
He opened the session by giving a simplified example, based on some actual debugging he had done. In short, he had a Rust function that returns an error. In Rust errors are just normal values that are returned to the caller of a function like any other value. The caller can either do something explicit with the error — or return early and pass the error on to its caller using the question-mark operator. Code like foo()? calls foo() and then, if it returns an Err value, immediately returns the error to the calling function.
This chain has to end at some point. In user space, the ultimate endpoint is the return value from the program's main function. If main() returns an error, the default behavior is to print it to stderr and exit with a nonzero exit code:
$ ./failing-rust-program
Error: ParseIntError { kind: InvalidDigit }
In the kernel, there's no such facility. Therefore an unhandled error might not result in anything being printed to the kernel log, which is not ideal. Behme looked at a few alternatives. One suggested alternative was that the coding standards for kernel Rust could encourage people to use .expect() instead of the question-mark operator, which does show a message in the log. It also currently crashes the kernel, however, which is never the right solution.
let value = fallible_operation()
.expect("the operation not to fail in this case. If it does, BUG_ON()");
An alternative solution, he mentioned, could be adapted from ScopeGuard. He asked Andreas Hindborg to explain in more detail. Unlike the question-mark operator or .expect(), ScopeGuard is not built into the Rust language, Hindborg explained. Rather, it's part of the kernel crate. A ScopeGuard is a wrapper around a function pointer that calls the function whenever the ScopeGuard stops being referenced. It's useful for handling cleanup when a function has more complicated cleanup than just freeing memory or unlocking locks (which can be handled by Rust's Drop mechanism). The user can also "disarm" the ScopeGuard to prevent it from firing on code paths that handle cleanup themselves (or don't need to do it for some reason):
// Create an anonymous function to perform cleanup
let guard = ScopeGuard::new(|| {
... // Cleanup code
});
...
if some_condition() {
return; // guard goes out of scope, runs cleanup code
}
...
// When the function exits normally, the guard can be disarmed
guard.disarm();
return;
Behme suggested that they might want to adapt this to create an error type that prints a log message to the console when an error of that type is dropped. So if a programmer calls a fallible function, they can either handle the error themselves (preventing it from being dropped), pass it on to their caller (likewise preventing it from being dropped), or forget to handle it — causing the value to be dropped and print an error to the console. He thought a downside of this approach is that he wasn't certain how to make it work with line information — after all, the place where the error is dropped is likely to be far away from where it was created, so the error type would need to capture and store the original source information.
For guidance, Behme suggested looking at what C does. In C, not all errors print log messages, but not all errors are silent either. It can be decided on a case-by-case basis. Kernel Rust code could potentially do the same, but then contributors lose some of the convenience of having a generic solution. He asked what the attendees would prefer.
Miguel Ojeda asked when they actually want errors to be printed — does it make sense to have a special debugging mode that prints more errors? Or to select individual functions for tracing? In either case, it would be better to enable that without requiring source-level changes for tracing.
Behme was skeptical about the idea of having a special debugging mode; he felt that developers often don't actually run with modes like that turned on, and that customers with a production image don't either. So adding a special mode would not really help.
Benno Lossin asked what information Behme wanted for debugging — did he want just the file and line number, or would he like to show a whole backtrace? Behme answered that obviously the best approach would be "an AI that tells me where the bug is". Failing that, more information is better — and just one line would be better than being completely silent. "I'll take what I can get".
Lossin mentioned that in one of his user-space projects he embeds a backtrace in his error type by using one of the internal details of the question-mark operator: when a function returns one type of error, but uses the question-mark operator on another, the operator tries to convert between the types using the From trait. If no such conversion is possible, it's a compile-time error.
So in Lossin's project, he has one error type for all of his code, and when some external library returns an error, the From implementation captures a backtrace and stores it. Behme thought that sounded potentially useful, and that having full backtraces available would be helpful.
Alice Ryhl wasn't sure that they should want errors in drivers to print to the console, however. She pointed out that some "error" conditions are actually fairly normal, such as temporary allocation failures. Logging about these whenever they happen could spam the kernel log. Behme responded that Ryhl's driver has about 200 places that use the question-mark operator — did she really want all of those to be silent?
Ryhl explained that some of them are not silent — she has explicit logging code where needed. But also, she said, the utility of the question-mark operator is that it makes it easy to bubble-up errors to higher levels that can handle them with more context. So not every invocation of the question-mark operator really needs to log.
Paul McKenney agreed, saying that it was possible to have too much output, especially in embedded systems. He noted that kernel developers have spent 20 years adding and removing debug info from various places in the kernel in order to try and get the balance correct.
Behme noted that he would even be happy with the conclusion that Rust should just handle each case manually, like C, as long as they started forbidding the question-mark operator so that people actually have to think about it.
Boqun Feng thought that dynamic BPF tracepoints could maybe provide some help with debugging things like this, therefore there was no need to log until the user attaches a function for debugging.
Greg Kroah-Hartman disagreed, saying that whatever logging they decided on was something that would run all of the time. It's often not possible to tell the customer to rerun something with debugging turned on. He had some thoughts of his own on the question, however: traditionally in the kernel, it's the function that creates the error that's responsible for logging it, he said. If there's a memory error, you don't print it at every level bubbling up.
Behme suggested that would mean the question-mark operator should only be used after checking that the function that creates the error does log it, then.
Carlos Bilbao was confused about how Behme was seeing errors not being reported, anyway — in Rust, the Result type, which represents either an error or a normal return value, has the #[must_use] attribute. So the compiler will emit a warning if the code does not do something with an error.
Lossin thought that the C approach might not make sense for Rust, because idiomatic Rust often has many small helper functions, and it could make sense to centralize error reporting more than that. He went on to suggest adding an extension to Result that can attach log messages to errors — like the popular user-space anyhow crate does:
some_fallible_operation().log("if it fails, this is printed")?;
// and the error is still bubbled up with ?
Gary Guo pointed out that you actually want more information than just "there was an error", so you can't really get away from the need for custom error-printing code. Hindborg noted that different use cases warrant different amounts of logging.
The eventual conclusion was that the question-mark operator was not the problem per se, but rather a lack of standardized error handling and logging in kernel Rust. Additional libraries or simple documentation that helps address the issue could be useful. The balance between performance, ease of debugging, and error handling remains one that requires human judgement.
[ Thanks to the Linux Foundation, LWN's travel sponsor, for supporting our coverage of Kangrejos. ]