Warning about WARN_ON()

Posted Apr 18, 2024 21:11 UTC (Thu) by roc (subscriber, #30627)
Parent article: Warning about WARN_ON()

My experiences at Mozilla taught me that you want two kinds of assertions:
A) "Something is wrong and it is dangerous to proceed. Emit debugging information and terminate." (e.g. when you suspect memory corruption)
B) "Something is wrong but it's not important enough to terminate. Emit debugging information and continue." (e.g. when something is at the wrong position on the screen)

People often complained that severity-B warnings were too easy to ignore (e.g. while testing) and wanted to make them severity-A instead. They were wrong. You instantly realize that's a bad idea when you're trying to debug a severity-A problem and you can't because you keep hitting a different severity-B problem that is incorrectly being treated as severity-A. You do need telemetry and management discipline to make sure that severity-B problems are tabulated and not ignored.

Warning about WARN_ON()

Posted Apr 20, 2024 4:31 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

The problem is that the definition of "important" varies with the use case. I would say we need to consider at least three different situations:

1. A "pet" where all of the code running in userspace (regardless of UID) is either trusted or (adequately) sandboxed.
2. A "cattle" where all of the code running in userspace is trusted or sandboxed.
3. A "cattle" where some of the code running in userspace is not trusted and not (adequately) sandboxed.

In case (1), panicking is a big deal. It could cause serious data loss. We do not want to do it. In cases (2) and (3), there is other infrastructure that will deal with a panic. We will not lose (much) data, and whatever data loss does happen is "priced in" (i.e. we have accepted that we may lose it and planned accordingly). In cases (1) and (2), continuing in an invalid or unexpected case is somewhat likely to be safe (though obviously not certain). In case (3), we know that userspace is actively trying to take over the system at least some of the time, and continuing in an invalid or unexpected case would potentially make it easier for them to do that.

There is also:

4. A "pet " where some of the code running in userspace is not trusted and not (adequately) sandboxed.

This describes the average user's home computer. Unfortunately, in this case, we're screwed, because we have to choose between possibly losing data, and possibly letting an attacker take over the system.

In this thread, I think most people are taking the position that (4) is a lost cause (or at least, it requires finer-grained security than panic_on_warn can provide - see for example seccomp, SELinux, etc.), and therefore panic_on_warn should be calibrated to the expectations of (3) and maybe (2). That entails using WARN_ON liberally in contexts where a genuinely unexpected or invalid state arises, but conservatively or not at all in more mundane situations that may simply reflect a userspace-level misconfiguration. (4) can then be configured with panic_on_warn disabled, and other security measures can be applied to the extent practicable.