Protecting control dependencies with volatile_if()
Protecting control dependencies with volatile_if()
Posted Jun 21, 2021 16:42 UTC (Mon) by farnz (subscriber, #17727)In reply to: Protecting control dependencies with volatile_if() by itsmycpu
Parent article: Protecting control dependencies with volatile_if()
Because, on the very weakest implementations of (e.g.) AArch64, load acquire is more expensive than load followed by a dependent branch, but the load followed by a dependent branch is a sufficient barrier for the purposes at hand.
Consider in terms of the following pseudo-code; thread 0 is writing to two memory locations, and before the two threads run on separate CPU cores, the memory at adr1 and adr2 is set to 0.
# Thread 0 MOV R0, #42 MOV R1, #99 STR [addr3], R1 STR [addr2], R1 STR_REL [addr1], R0 # Thread 1 MOV R5, 1 MOV R6, 1 LDR R4, [addr1] CMP R4, #42 BEQ load JMP end .load LDR R5, [addr2] .end LDR R6, [addr3]
The question we're trying to answer is "what values are in R5 and R6 of thread 1 at the label end?".
On an Alpha, there are 3 possible answers: if R4 is 42, then R5 and R6 must both be either 0 or 99, depending on whether or not the LDR R5 and LDR R6 observed the store to addr2 or not. If R4 is not 42, then R5 and R6 must be 1. Alpha is allowed to re-order the loads in thread 1 freely because they are independent of each other, and thus all of (R5 = 0, R6 = 0), (R5 = 0, R6 = 99), (R5 = 99, R6 = 0) and (R5 = 99, R6 = 99) are possible.
On x86, the answer is either R5 = R6 = 1 or R5 = R6 = 99; once the load of R4 has taken place, thread 1 is guaranteed to see the stores to both addr1 and addr2, and thus the loads to R5 and R6 must see the store to addr2. Thus, if R4 is 42, R5 is guaranteed to be 99, while if R4 is not 42, R5 is guaranteed to be 1. Same reasoning applies to R6 as to R5.
ARM's memory ordering is mostly weak, like Alpha's, but it adds in one more wrinkle; dependency chains act as ordering. Because of this, R5 is again either 1 or 99; however, this happens for slightly different reasons to x86. On ARM, the STR_REL guarantees that a load that observes the store to addr1 will also be able to observe the store to addr2; because the load of R5 has a dependency on the value of R4, the ARM memory ordering model guarantees that the load of R5 will take place after the load of R4, and hence will also observe the store to addr1. However, there's no such dependency in R6's control flow, and thus R6 is allowed to be either 0 or 99 if R4 is 42, because the CPU can reorder the load to happen before the conditional.
If the load to R4 was a load-acquire, then all three processors would observe the same memory contents. However, both ARM and Alpha would be denied reordering opportunities by the load-acquire that they have without it, and hence could be slower.