You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The structs proposal will be the first shared memory feature that's convenient for wide use. As such, we should make sure we don't shoot ourselves in the foot, particularly with performance and complexity.
Building APIs on top of structs
Consider an example API built on top of structs built by users, a third party, or us. Let's say we build an abstraction for a queue of elements shared across threads/agents. Any agent can insert or retrieve elements inserted by itself or other agents. The queue q provides two methods:
q.Push(x) inserts an element x into the queue. x must be a primitive or a shared struct.
q.Pop() removes an element from the queue and returns it or null if the queue is empty. For the purposes of this example the choice of which element it retrieves is not relevant.
If thread A does q.Push(42) and thread B does x = q.Pop() and it turns out that x gets 42 then things are simple.
But consider what happens if thread A wants to pass a shared struct with some data through the queue.
let a = q.Pop();
if (a != null) {
let x = a.x;
let y = a.y;
…
}
Any reasonable queue API implementation needs to provide enough synchronization so that thread B gets to see the x and y values 4 and 8 that thread A wrote. We don't want B to read x and get undefined. The synchronization must be provided by the queue — there's nothing that B could do on its own to synchronize. This is standard practice in almost all such APIs.
Synchronization models
The question becomes the nature of the synchronization that the queue API should provide. In practice the simplest (and most efficient answer) is ACQUIRE/RELEASE. This guarantees that thread B sees all writes done to the Point it got when reading from it.
Unfortunately the ACQUIRE/RELEASE model is not supported by structs so users cannot provide it. Only the far slower SEQ_CST reads and writes are supported. This means that the queue will provide SEQ_CST semantics and its users will come to rely on that — it will be a breaking change to switch to ACQUIRE/RELEASE later.
Performance and complexity
The problem is that uncontended SEQ_CST reads and writes are somewhere around 100x slower than ACQUIRE/RELEASE ones which have about the same performance as unordered reads and writes (and, in fact, on x86 use the exact same machine instructions as unordered reads and writes). The performance difference is real, resulting in massive heroics cross-cutting between applications and the OS to get around just one SEQ_CST operation — see restartable sections for one example. This has caused enormous complexity, pain, and bugs.
Outside of mutexes, SEQ_CST reads and writes do have a use case, but it's tiny. Only a few extremely obscure algorithms require those. In more than a decade of working with atomics I've never run across a single case where I needed SEQ_CST reads or writes, whereas I've used ACQUIRE/RELEASE ones on a daily basis.
Can this wait for later?
No. SEQ_CST will be baked into every user API built on top of shared structs. It's already a problem with SEQ_CST being unnecessarily baked into various existing spec methods for which ACQUIRE/RELEASE would suffice. The more these get used, the more we'll be needlessly stuck with poor performance and higher complexity forever — exactly the thing we want to avoid.
I might present on this topic at a future meeting.
The text was updated successfully, but these errors were encountered:
I'm certainly happy to include acq/rel earlier than later -- in fact it's probably better to do so as a proposal that can land earlier and independently of shared structs since it will affect SABs as well.
My one constraint here is to do so in lockstep with Wasm, just as we did with the original memory model, because, well, we have the same memory model. cc @conrad-watt
Seems sensible! The immediate observations I would make:
We should take care to use the "modernised" definition of acq/rel that incorporates relatively recent fixes in the C++ model (see the distinction between simply-happens-before and strongly-happens-before introduced here). Our current SC-only model is still "correct", since these subtleties are only relevant once acq/rel is introduced.
This would also be a natural opportunity to introduce acq/rel/acq-rel and (especially) SC fence operations, to mirror Wasm.
The structs proposal will be the first shared memory feature that's convenient for wide use. As such, we should make sure we don't shoot ourselves in the foot, particularly with performance and complexity.
Building APIs on top of structs
Consider an example API built on top of structs built by users, a third party, or us. Let's say we build an abstraction for a queue of elements shared across threads/agents. Any agent can insert or retrieve elements inserted by itself or other agents. The queue q provides two methods:
If thread A does
q.Push(42)
and thread B doesx = q.Pop()
and it turns out thatx
gets 42 then things are simple.But consider what happens if thread A wants to pass a shared struct with some data through the queue.
Thread A:
Thread B:
Any reasonable queue API implementation needs to provide enough synchronization so that thread B gets to see the
x
andy
values 4 and 8 that thread A wrote. We don't want B to readx
and get undefined. The synchronization must be provided by the queue — there's nothing that B could do on its own to synchronize. This is standard practice in almost all such APIs.Synchronization models
The question becomes the nature of the synchronization that the queue API should provide. In practice the simplest (and most efficient answer) is ACQUIRE/RELEASE. This guarantees that thread B sees all writes done to the Point it got when reading from it.
Unfortunately the ACQUIRE/RELEASE model is not supported by structs so users cannot provide it. Only the far slower SEQ_CST reads and writes are supported. This means that the queue will provide SEQ_CST semantics and its users will come to rely on that — it will be a breaking change to switch to ACQUIRE/RELEASE later.
Performance and complexity
The problem is that uncontended SEQ_CST reads and writes are somewhere around 100x slower than ACQUIRE/RELEASE ones which have about the same performance as unordered reads and writes (and, in fact, on x86 use the exact same machine instructions as unordered reads and writes). The performance difference is real, resulting in massive heroics cross-cutting between applications and the OS to get around just one SEQ_CST operation — see restartable sections for one example. This has caused enormous complexity, pain, and bugs.
Outside of mutexes, SEQ_CST reads and writes do have a use case, but it's tiny. Only a few extremely obscure algorithms require those. In more than a decade of working with atomics I've never run across a single case where I needed SEQ_CST reads or writes, whereas I've used ACQUIRE/RELEASE ones on a daily basis.
Can this wait for later?
No. SEQ_CST will be baked into every user API built on top of shared structs. It's already a problem with SEQ_CST being unnecessarily baked into various existing spec methods for which ACQUIRE/RELEASE would suffice. The more these get used, the more we'll be needlessly stuck with poor performance and higher complexity forever — exactly the thing we want to avoid.
I might present on this topic at a future meeting.
The text was updated successfully, but these errors were encountered: