Julia

Posted Jul 27, 2024 15:36 UTC (Sat) by khim (subscriber, #9252)
In reply to: Julia by malmedal
Parent article: May the FOLL_FORCE not be with you

> We certainly used to be able to do that. It gave a nice speedup on the 386 when you could save a register by having a mov #immediate, register and modify the immediate as needed.

386 doesn't even have “executable” bit in its page tables. Were you using segments? I guess this may work with segments since they cache permission in CPU registers. Still sounds very tricky and fragile to me.

Are you even talking about change it from writable to executable using mprotect (and in SMP environment) when you say “we used to be able to do that”?

> The old rule was you needed a jump or taken branch instruction to be sure the pipeline was clear after you modified executable code, later you could do any "serialising" instruction, like cpuid, instead.

I think we are talking about past each other, again. I'm talking about execution of the code that you are planning to patch by some other CPU core. Were these 386 systems, that you are talking about, even SMP ones? On UP system things are much, much, MUCH simpler. But on SMP systems when you are patching code that other CPU may be executing at this precise moment you need atomicity guarantees or complicated and convoluted scheme that would ensure that code that you are planning to patch is not, currently, executing.

Reference for this?

Look for the Asynchronous modification under Cross-Modifying Code in the AMD Manual. Intel provides more or less the same guarantees, but I don't remember which section of the manual describes that.

to post comments

Self-modifying code

Posted Jul 27, 2024 17:03 UTC (Sat) by malmedal (subscriber, #56172) [Link] (9 responses)

Thank you.

So reading page from page 206, you are talking about asynchronous modification. 8 bytes is not because of 486, it is because 8 bytes is 64 bits, the size that gets atomically updated. Also the 64 bits must be aligned.

It is basically warning about the situation where the instruction pointer is in the middle of a 64bit quad word when the quad word gets updated, if the instruction boundary changes so that the IP is not actually at the start of an intended instruction you have a problem.

They are recommending a sort of RCU like approach to avoid this:

Note that since stores to the instruction stream are observed by the instruction fetcher in program order, one can do multiple modifications to an area of the target thread's code that is beyond reach of the thread's current control flow, followed by a final asynchronous update that alters the control flow to expose the modified code to fetching and execution.

Reading a bit further, on synchronous modification where the target thread is waiting while the other thread is writing, the rules are the same as before, you can make whatever changes you want, but the target thread must execute a serialising instruction.

I don't remember if the 386 had an executable-only mode, but we certainly had writable memory that could be executed.

It's not a fragile thing to do, it is even supported by ld, see the -N option. Seems that interferes with shared libraries, so if you want that you need to use mprotect.

I believe it fell out of favour because the performance advantage became much less when the 486 came with on-chip cache.

Self-modifying code

Posted Jul 27, 2024 20:21 UTC (Sat) by khim (subscriber, #9252) [Link] (8 responses)

> I don't remember if the 386 had an executable-only mode, but we certainly had writable memory that could be executed.

The main issue that we are discussing here revolves around NX bit that allows one to create non-executable code!

On 386 the only way to make code non-executable was to play with segments and their limits. On Unix-like OS the best you may do is split 4GB of virtual address space in two: non-excutable area and executable area.

That means that approach that skissane talks about is just simply not possible on 386! Except if you use extremely weird OS which doesn't use paging, but uses segments for virtual memory.

Such OSes may exist, in theory, but I certainly know none, that actually did this thing in practice, that's why I have become so excited you said you did that with 386.

But it looks more and more likely that you haven't done what we are talking here about at all and are talking about entirely different situation.

> They are recommending a sort of RCU like approach to avoid this:

Note that since stores to the instruction stream are observed by the instruction fetcher in program order, one can do multiple modifications to an area of the target thread's code that is beyond reach of the thread's current control flow, followed by a final asynchronous update that alters the control flow to expose the modified code to fetching and execution.

That just happens with JITs automatically: once you have created optimized version of routine there are rarely the need to go back to intepreter. But yeah, usually only one call/jmp instruction is patched.

It's not a fragile thing to do, it is even supported by ld, see the -N option. Seems that interferes with shared libraries, so if you want that you need to use mprotect.

Again: that's different. Keeping something in the write+execute mode is dangerous WRT exploits, but not fragile, but playing with permissions and flipping from read+write to read+execute and back is pretty fragile because you need to ensure that code that you want to patch is not executed on the other core!

> I believe it fell out of favour because the performance advantage became much less when the 486 came with on-chip cache.

No, it fell out of favor much later, when people started caring about security and started enforcing W^X property.

First with segment limit tricks and then, later, with hardware NX bit.

Only Apple and only on iOS enforces it so radically as make JITs simply impossible, other OSes provide ways for JITs to work, that we are discussing here.

But all this discussion is happening in an W^X world!

Why do you keep bring W+X examples and keep saying that you can do everything easily if only you remove that restriction… of course it's possible to do, what could be simpler?

That's simply not what we are discussing here! The idea is to ensure that W^X is strictly enforced, maybe even teach kernel not to ever provide W+X mappings at all — and yet still keep JITs working, somehow.

Self-modifying code

Posted Jul 27, 2024 21:52 UTC (Sat) by malmedal (subscriber, #56172) [Link] (7 responses)

I was responding to your earlier statement:

> On x86 CPUs there are special guarantee that it can be done safely if modified part fits fully into 8bytes segment (it probably goes back to 80486 CPUs because I have no idea how to explain that 8bytes limitation)

Anyway, for you current statement, just have the kernel to the tricky bit.

Have the app ask for a writable alloc, fill it with code.

Then the app tells the kernel make this executable and the safe interrupt point is xxx.

Then when the time comes to replace the running code the app tells the kernel to write a
processor-specific safe stop sequence to the interrupt point.

on x86 this would likely be a sequence of eight int 3 instructions.

This will stop the thread and the app can allocate a new writable alloc and tell the kernel to make it executable and
have the thread continue there.

Self-modifying code

Posted Jul 27, 2024 22:56 UTC (Sat) by khim (subscriber, #9252) [Link] (6 responses)

> I was responding to your earlier statement:

Said statement was just a side comment to explain what JITs are doing, why and how what they are dong is guaranteed to work. To show that need of JITs to alter running code while it's running was acute enough and understood enough that even hardware makers already created special JIT-tailored guarantees there.

> Have the app ask for a writable alloc, fill it with code.

That's possible.

> Then the app tells the kernel make this executable and the safe interrupt point is xxx.

How exactly do you plan to do that? Would kernel take 10-20 bytes of generated code and allocate whole 4KB (or, worse, 16KB if we are talking about modern ARM) page to make it executable?

This sounds pretty wasteful.

> Then when the time comes to replace the running code the app tells the kernel to write a processor-specific safe stop sequence to the interrupt point.

So now you want to introduce something like stop-the-world interrupt in place where previously everything was completely lock-free?

Also: JIT doesn't actually replaces running code, it replaces branch target in the running code. Using well-documented and guaranteed approach explicitly described in the CPU manual.

Why do you want to change that?

> on x86 this would likely be a sequence of eight int 3 instructions.

Why eight and what would it give us? Except more complications and more places to have bugs?

> This will stop the thread and the app can allocate a new writable alloc and tell the kernel to make it executable and have the thread continue there.

But app doesn't need that! App just simply wants to replace target of jump! Without all that complicated and useless machinery! Previously there was call COMPILE_ME_FOO and now there would be call FOO_JIT_COMPILED_AND_READY_TO_USE. That's all!

It's useless because eight int 3 instructions don't guarantee anything (x86 includes instructions longer than eight bytes and with redundant prefix you may force almost any instruction to be longer) and it's useless because write via /proc/self/mem (or use of two mappings) already does everything that's needed!

Why adding API that would be more convoluted and slower yet not actually safer then existing API?

It doesn't really makes much sense! You gave us some elaborate solution to some unknown problem, but neglected to say what is the problem that solution is supposed to solve!

It's very hard to understand whether proposal is good or bad if we have no idea what that proposal even supposed to achieve.

As in: what that dane with eight int 3 instructions and additional syscall was supposed to accomplish? What would it do better than write to /proc/self/mem or two separate mappings (one writable, one executable) for the same chunk of memory?

Self-modifying code

Posted Jul 28, 2024 0:38 UTC (Sun) by malmedal (subscriber, #56172) [Link] (5 responses)

It's an answer to your previous request:

> maybe even teach kernel not to ever provide W+X mappings at all

If you don't want the kernel to provide userspace with a W+X mapping you can have the kernel keep the mapping to itself, but if userspace then can
ask the kernel to do arbitrary changes, or have separate W and X mappings, you haven't gained that much, so you want to limit what the kernel will do to the simplest thing that will work. The point is to stop the old code at a safe location, on some CPUs you can set address-based breakpoints instead.

Self-modifying code

Posted Jul 28, 2024 8:40 UTC (Sun) by khim (subscriber, #9252) [Link] (4 responses)

> If you don't want the kernel to provide userspace with a W+X mapping you can have the kernel keep the mapping to itself, but if userspace then can ask the kernel to do arbitrary changes, or have separate W and X mappings, you haven't gained that much

Yes, you did. You have made life for attackers harder. As explained in the Wikipedia article. And that article even includes section about JITs, too!

Security and usability are always at odds, there exist 100% bullet-proof way to stop any attacks, both local and remote — just turns the computer off and all kinds of attacks are prevented! But this “protection” is not very usable, thus we need something else.

> you want to limit what the kernel will do to the simplest thing that will work.

Yes, but now we need to determine what is that work even is!

> The point is to stop the old code at a safe location

Do you actually read what I wrote? Just where have I wrote that JIT wants/needs that? It have no such need. On the contrary, what JIT needs is described precisely under titles Asynchronous modification under Cross-Modifying Code in the AMD Manual: the nature of the code being executed by the target thread is such that it is insensitive to the exact timing of the update.

JIT (or, heck, dynamic loader that resolves symbols lazily) initially inserts jump to the “slow path” because “fast path” doesn't exist, later, when “fast path” does exist jump is replaced. That's all, JIT doesn't care if “slow path” is used a few times after “fast path” is created, it just wants “eventual consistency” where programs stop using “slow path” after a few milliseconds.

And, as I have shown you, with references to AMD manual, CPUs actually offer enough relevant guarantees on the hardware level, description is so precisely tailored to the need of JITs that they could, as well, call it “JIT-friendly code modification” and not “asynchronous modification”.

And yet, you repeatedly invent complication that make things more problematic and AFAICS don't achieve anything security-wise? Why? What's the point?

Self-modifying code

Posted Jul 28, 2024 9:31 UTC (Sun) by malmedal (subscriber, #56172) [Link] (3 responses)

> Why? What's the point?

You asked for this earlier:

>> maybe even teach kernel not to ever provide W+X mappings at all

This is an answer, it has a cost, an expensive one, but it is an answer.

Self-modifying code

Posted Jul 28, 2024 10:00 UTC (Sun) by khim (subscriber, #9252) [Link] (2 responses)

> This is an answer, it has a cost, an expensive one, but it is an answer.

Sure, but as was mentioned in the very beginning, seven years ago there are already two other methods (three if you include self-ptrace).

You are proposing third (or fourth) one without explaining why it's better than what we already have.

This looks like “he have to do something” — “this is something” — “let's do it!” logic.

Such logic rarely produces good designs.

Self-modifying code

Posted Jul 28, 2024 10:57 UTC (Sun) by malmedal (subscriber, #56172) [Link] (1 responses)

> You are proposing third (or fourth) one without explaining why it's better than what we already have.

In discussions like this I believe it is useful to go through many possible options and evaluate their pros and cons without being particularly wedded to any one of them. Since it's an option that meets the specific constraint that you, yourself, chose to highlight, I felt that it should be included in the discussion. This approach appears to be incompatible with your style of arguing, so I'll just be ignoring you from now on.

Self-modifying code

Posted Jul 28, 2024 12:05 UTC (Sun) by khim (subscriber, #9252) [Link]

> In discussions like this I believe it is useful to go through many possible options and evaluate their pros and cons without being particularly wedded to any one of them.

Why? What have that approach brings to you? What have you achieved doing it?

I find that very strange. Things that we already have, things that exist, by definition, have a priority. They are already here, they are done, that enough. But any change from the status quo need a justification.

Sure, I like to go “back the memory lane” and see why things that we have are like they are. Because situation of today is different from situation of yesterday.

But no matter what, even if the thing that made us to pick original decision is no longer valid or even if the original decision was made on a whim without any rational justifications… things that we have are very-very different from things that we don't have.

> Since it's an option that meets the specific constraint that you, yourself, chose to highlight, I felt that it should be included in the discussion.

One may invent bazillion crazy schemes if not constrained by anything. Talking about them would take forever unless we would limit these discussions, somehow.

“Anything new should come with an extra justification that explains why should we do that if some other solution already exists” is very good rule if we are talking about something that we plan to implement. I don't know anyone who achieved anything significant while violating it (but note the subtle difference: if we don't yet have a solution at all then someone who doesn't “know” that “X is simply impossible” may achieve something really cool… but when said X is not just possible in theory but we already know how to do X in practice then situation chances).

Well, maybe fiction writers would be an exception, but even they, when they construct their strange imaginary worlds, still play on that contrast between what “we” have and what “they” have. “Does it exist?” is still very much a central question that governs their decisions even if they imagine a world where something that we already have doesn't exist and where evolution of civilization, as a consequence, goes into a different direction.

> This approach appears to be incompatible with your style of arguing, so I'll just be ignoring you from now on.

Fine by me. I don't like to waste time on pointless discussions without any practical consequences (even if the consequence is minor like “now that I have wrote that I may just refer people here instead of repeating my arguments again and again”) while you seem to regard these as the only ones worthy of pursuing.