Suppressing SIGBUS signals

Posted Jun 28, 2021 8:35 UTC (Mon) by dullfire (guest, #111432)
In reply to: Suppressing SIGBUS signals by matthias
Parent article: Suppressing SIGBUS signals

> This was always the case. If the client changes the contents of the buffer to garbage, the compositor will not notice and the window will display garbage. Nothing new here. The change is that the client truncating the buffer now will look exactly the same as the client clearing the buffer at an inappropriate time.

You appear to be conflating two very different issues here. It's true the compositor has never really be able to tell if the client has been displaying garbage. However the source of SIGBUS is essentially a protocol error. The client has promised the compositor that there was buffer here, and instead there was a hand grande. SIGBUS always allowed the compositor to detect and handle such protocol violations. The proposed changes remove that ability (for the case of compositors that use it, so I guess it being "opt-in" is better than being "opt-out").

> Probably the program whose window displays garbage.

Except the compositor doesn't know there's even a problem, so it's can't say "hey it's connection XXX", or it registered with string "YYY". Not all programs have server side decoration (and IIRC the wayland standard is client side decoration). So if the user wasn't looking at that window, or doesn't remember which one it is (or never new in the case of dialogs, and a few other things), then it becomes difficult to figure out. If the window is on a "task list" like WM UI element, that might help, but not all windows are placed there.

It's probably not impossible, but it surely wouldn't be easy.

To sum it up (again): The approach here appears to be "make it work 'safely' by sweeping all the problems under the rug". I don't think that's a good long term solution.

to post comments

Suppressing SIGBUS signals

Posted Jun 28, 2021 13:18 UTC (Mon) by kleptog (subscriber, #1183) [Link] (4 responses)

Isn't the real problem here the use of signal? The compositor is just doing a memory lookup, at assembly level there is no such thing as a "failed read". The kernel can generate a signal but it can't fail the original instruction, the MOV has to return something. The compositor would have to do a siglongjmp() because otherwise it's going to generate a SIGBUS for every single instruction accessing the missing pages. And siglongjmp() is pretty tricky at the best of times.

I can understand that developers would prefer if the kernel would just return zeros and the compositor recheck the size of the file after the copy completes. It's just less moving parts that way.

Someone else here noted userfaultfd() is doing something similar, so maybe there's an answer there. The compositor thread would be blocked and the handler could remap the pages to avoid the SIGBUS retriggerring.

Suppressing SIGBUS signals

Posted Jun 28, 2021 14:52 UTC (Mon) by dullfire (guest, #111432) [Link] (2 responses)

> Isn't the real problem here the use of signal? The compositor is just doing a memory lookup, at assembly level there is no such thing as a "failed read".

As I said in my original post: I'm pretty sure the compositors SIGBUS handler could mmap over the faulting region, and then set a flag that the given client is miss behaving.

When the compositor returns from the signal handler, it would finish it's turn function/call stack (using the newly mmaped region... possibly zero filled, possibly full of kitten pictures.), eventually get high enough to notice that that connection had been marked as bad, then either discard the processing it had done, or perhaps show one frame with a garbage window.

That way you resume from the error AND get notification of the buggy/malicious client. And you can have meaningful error messages.

Suppressing SIGBUS signals

Posted Jun 28, 2021 17:29 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (1 responses)

> As I said in my original post: I'm pretty sure the compositors SIGBUS handler could mmap over the faulting region, and then set a flag that the given client is miss behaving.

signal-safety(7) does not list mmap as async-signal-safe, at least on my system. But that just means that POSIX doesn't require it to be safe. I suppose it's possible that Linux does implement an async-signal-safe mmap as an extension?

(Unfortunately, we can't just use the usual trick of "set a flag and return, then do the real work from the main event loop" because we need to fix the memory problem *before* we return from the SIGBUS handler, or else we'd need to pause the offending thread, call mmap from a different thread, etc.)

Suppressing SIGBUS signals

Posted Jun 28, 2021 18:54 UTC (Mon) by dullfire (guest, #111432) [Link]

I would not imagine "mmap" could ever be fully async-signal-safe (after all you are screwing with the page table). However for the every narrow case of replacing a no-longer valid mapping, it should be fine.

Seeing as mmap is typically a thin wrapper around a syscall (AFAIK there is no currently known way for this to not be a kernel task) most of this must naturally be done in kernel, which negates most of the issues.

However since you brought it up, I'm assuming you are implying being portable/standards conformant is important. So for that we would need a change of standards. But that's probably more of a 'paper' change (while it's possible some implementations that are posix.1 2008 compliant implement mmap(2) in a way that would be unsafe for this proposed usage, I kind of doubt it).

A linux-only change, that just papers over the issue seems like a poor solution.

Suppressing SIGBUS signals

Posted Jun 28, 2021 14:53 UTC (Mon) by excors (subscriber, #95769) [Link]

> The compositor is just doing a memory lookup, at assembly level there is no such thing as a "failed read". The kernel can generate a signal but it can't fail the original instruction, the MOV has to return something.

I don't think that's really true. At the assembly level, the outcome of the MOV instruction is that it either loads a value into the register *or* triggers an exception. E.g. if you look in the ARMv8-A Architecture Reference Manual, it explicitly defines the LDR instruction in terms of the "AArch64.MemSingle" operation which can call the "AArch64.TakeException" operation, which sets up the exception state then calls "EndOfInstruction" to stop any further processing of the LDR instruction, so it won't write anything to the destination register.

Once the exception is triggered, the kernel is free to do whatever it wants - update the page tables then jump back to the MOV instruction to retry it, manually update the register state then jump back to the instruction after the MOV, call a signal handler, etc. Anyone writing user-space assembly code has to be aware of that, e.g. there are often ABI rules about stack pointers that are specifically there to allow the kernel to interrupt your thread at any point and run a signal handler on its current stack. So that's not something you can safely ignore when working at assembly level.

Suppressing SIGBUS signals

Posted Jun 29, 2021 10:34 UTC (Tue) by matthias (subscriber, #94967) [Link]

> You appear to be conflating two very different issues here. It's true the compositor has never really be able to tell if the client has been displaying garbage. However the source of SIGBUS is essentially a protocol error. The client has promised the compositor that there was buffer here, and instead there was a hand grande. SIGBUS always allowed the compositor to detect and handle such protocol violations.

The client has promised that there is a buffer that does not change while the compositor is using it. Any change would be a protocol error. Only in the very special case that the change is truncate, this protocol error could be detected by the compositor. With the proposed change of semantics, this special case just looks like other cases where the client changes the buffer at the wrong time.

> [Finding out which program displays garbage] is probably not impossible, but it surely wouldn't be easy.

For debugging purposes there should be a possibility to ask the compositor to which program some window belongs. After all, you also want to debug cases where a program just opens a window with garbage in it. Probably a much more common case than a program calling truncate on the buffer.