Warning about WARN_ON()

Posted Apr 18, 2024 21:58 UTC (Thu) by nevets (subscriber, #11875)
Parent article: Warning about WARN_ON()

I have a strict requirement in my code that all WARN_ON() calls are used only to show that a bug happened in the kernel. If it is hit, it's a bug (I agree that this should never be used for incorrect user space input).

Panic on WARN should never trigger, and if it does, then a fix to the kernel is urgently needed.

I truly believe that this use should be encouraged.

to post comments

Warning about WARN_ON()

Posted Apr 18, 2024 22:46 UTC (Thu) by kees (subscriber, #27264) [Link] (5 responses)

This is exactly the intent of WARN*(). It should be use for expected-to-be-totally-unreachable condition. Killing the entire system (e.g. BUG) reduces the likelihood that it will get caught/debugged/etc. But if userspace can reach it, something is very wrong. (And this is why many deployments set panic_on_warn: something impossible happened: we can no longer trust the integrity of the system.)

Warning about WARN_ON()

Posted Apr 21, 2024 16:08 UTC (Sun) by marcH (subscriber, #57642) [Link] (4 responses)

> The recommendation to avoid panic_on_warn has been ignored

Of course it has been "ignored", because:

> we can no longer trust the integrity of the system

That's the simple reason why this advice has been ignored: because it's really bad!

"Something that should never happen, happened. Let's pretend nothing important happened and keep running anyway this monolithic kernel written in the programming language of memory corruptions". Unbelievable.

If some crazy people or unusual use cases want panic_on_warn to be OFF for $reasons then let them - but don't let them force that bad choice on everybody else by changing the semantics of existing APIs. That's even more unbelievable.

Warning about WARN_ON()

Posted Apr 21, 2024 16:21 UTC (Sun) by mb (subscriber, #50428) [Link] (3 responses)

Yes, I also have panic-on-{oops, warn, ...} and panic-reboot enabled on my server systems.

It makes a whole lot of sense to reboot when the system got into a state that was thought to be impossible.
Going on as if nothing happened is not an option for me.

I do understand why this behavior would not be ideal on desktop systems.
There I'd like to be notified and make the decision manually.

Suggesting pr_warn for "impossible" states is taking away this decision from administrators. Which is wrong. Kernel developers cannot decide what to do in these cases, because there's no single correct answer.

Warning about WARN_ON()

Posted Apr 21, 2024 18:01 UTC (Sun) by marcH (subscriber, #57642) [Link] (2 responses)

> Suggesting pr_warn for "impossible" states is taking away this decision from administrators. Which is wrong. Kernel developers cannot decide what to do in these cases, because there's no single correct answer.

This. The job of kernel _developers_ is to 1) provide APIs with a "scale" of error levels defined as best as possible. As simple a scale as possible, but not simpler 2) Thoroughly enforce it through code reviews. The latter is not easy because error handling is hard to test, rarely ever tested, formally defining "levels" is also hard and many developers don't care about errors. So it'll never be perfect. But it's not optional and must be done as best as possible.

Then, what to do with these error levels is a POLICY decision that belongs to the _administrator_ and specific use cases. BTW the _same_ administrator can be dealing with errors differently depending on the use cases.

"Mechanism, not policy". Default values matter and sometimes they must be adjusted (slowly and carefully) but developers should never remove some knobs or change their semantics only because of the perception that "some" administrators don't understand them. First, developers can't possibly understand all use cases. Then what about all the administrators who have been using the knobs as intended the whole time? Why should they suffer?

Warning about WARN_ON()

Posted Apr 24, 2024 2:02 UTC (Wed) by jwarnica (subscriber, #27492) [Link] (1 responses)

>"Mechanism, not policy". Default values matter and sometimes they must be adjusted (slowly and carefully) but developers should never remove some knobs or change their semantics only because of the perception that "some" administrators don't understand them. First, developers can't possibly understand all use cases. Then what about all the administrators who have been using the knobs as intended the whole time? Why should they suffer?

Opensource allows people to get into things deeper than credible levels of safety.

Absolutely it is true that "developers can't possibly understand all use cases", which suggest their offer of software that works should be credible. Not in any legal sense of "warranty for a particular purpose" but some level of credible thinking, analysis, testing. If you've tested 13 scenarios, say you've tested 13 scenarios, not 255 scenarios since that is the next smallest data type.

If the downstream user wants to open up the code and test more, fine. They can do that. But that wasn't what the upstream provided and implied they tested; the user then is on the hook for what happens.

Warning about WARN_ON()

Posted Apr 24, 2024 18:23 UTC (Wed) by marcH (subscriber, #57642) [Link]

> But that wasn't what the upstream provided and implied they tested; the user then is on the hook for what happens.

The number of clones, branches, commits/versions and the Kconfig combinatorial explosion is so huge that "tested upstream" does not really mean anything. At best you could have metric measuring some distance from a git tag from Linus and the ,config he uses but would that be very useful?

The linux kernel is not a product, it's a very large set of legos.

"Upstream" is not a location, it's a direction. It's in the name.