Using LOCK CMPXCHG or even plain CMPXCHG does not make sense unless it is done i...

amluto · 2026-04-06T13:58:51 1775483931

You are misunderstanding me, which is perhaps understandable, since I’m talking about the minutiae of x86, not locking in general.

When unlocking a futex-backed mutex, one needs to do two things. First, one needs to actually unlock it: this is a store-release in modern lingo, and on x86 almost any store instruction has the correct ordering semantics. Second, one needs to determine whether to call futex_wake, which is conceptually just reading a flag “is someone waiting” and then branching on the result. The problem is that the load needs to be ordered after (or at least not before) the store.

x86 provides two main ways to do this, MFENCE and LOCK. For whatever reason, at least Intel has tried pretty hard to optimize LOCK, and it’s often the case that LOCKed operations on a hot cache line is faster than MFENCE. (I have benchmarked this, and Linux uses this trick.)

My point is that the specific algorithm of unlocking a futex-backed mutex does not require the full ordering semantics of MFENCE or LOCK. And my secondary observation is that x86 has some non-LOCKed RMW instructions, one of which is plain CMPXCHG. Unlocked CMPXCHG is much faster than LOCK anything or MFENCE — I’ve benchmarked it. There are also the flag outputs from operations like ADD. And I’m speculating that maybe some of these instructions are secretly actually ordered strongly enough for futex unlock.