Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packing pointers into double-word width atomics #517

Open
eggyal opened this issue Jul 17, 2024 · 48 comments
Open

Packing pointers into double-word width atomics #517

eggyal opened this issue Jul 17, 2024 · 48 comments

Comments

@eggyal
Copy link

eggyal commented Jul 17, 2024

Under Using Strict Provenance, the standard library documents:

(Yes, if you’ve been using AtomicUsize for pointers in concurrent datastructures, you should be using AtomicPtr instead. If that messes up the way you atomically manipulate pointers, we would like to know why, and what needs to be done to fix it.)

(I've not been able to locate a discussion similar to the following, and assume that this is the best place to reply to that invitation—but please do point me elsewhere if not!).

Some concurrent algorithms (I am particularly thinking of some epoch-based memory reclamation schemes) solve the ABA problem by packing both a pointer and an integer (e.g. an epoch) into a single (double-word width) atomic variable (e.g. a 128-bit atomic, manipulated using cmpxchg16b, on x86_64 architectures).

To do this with the existing atomic types requires using the appropriate double-word width atomic integer (e.g. AtomicU128 on 64-bit architectures), and converting the relevant single-word width subpart to/from a pointer.

If such pointer may be to different allocated objects over the course of the algorithm, it clearly is not possible to derive provenance in those conversions via strict provenance's .with_addr() method (because any "base" pointer from which provenance is derived would also have to be updated atomically with any update to this atomic).

I am not sure what the best solution to this might be. Perhaps something like an AtomicDoublePtr<T, U> type or similar?

Somewhat related to #286 and #480.

@RalfJung
Copy link
Member

Yeah this is unfortunately not supported currently. A similar situation arises with types like (u64, u32) that would fit into 16 bytes but are not safe to transmute to u128 due to padding.

IMO the best solution is to have some way to do atomic operations on MaybeUninit<u128> -- or else on any T that is 16 bytes large (or 8 bytes, or some other size supported for atomics). I am not sure what the best way is to expose that from the standard library, though. Maybe a generic Atomic<T> with something like const { assert!(valid_atomic_size(size_of::<T>())) }? Or maybe we can have a magic trait that is implemented only for types that have the right size? @lcnr @BoxyUwU how terrible of an idea would that be?

@lcnr
Copy link

lcnr commented Jul 18, 2024

Or maybe we can have a magic trait that is implemented only for types that have the right size?

That's possible. It's also a stability hazard for user-defined types so idk if that is desirable 🤔 you could have an explicit AtomicSized derive which checks that the type has a valid size but is opt-in?

@RalfJung
Copy link
Member

RalfJung commented Jul 18, 2024

And then a (T, U) tuple would only be AtomicSized if it has the right size and both T and U are AtomicSized? That could work. It would also be a limitation since one can form a tuple of the right size from types that don't have the right size, like ([u8; 3], u8).

OTOH it's not a new stability hazard, it is exactly the same hazard as transmute::<LibraryType, u128>.

@eggyal
Copy link
Author

eggyal commented Jul 18, 2024

Whilst I agree that there is a broader problem of atomically manipulating types that can't transmute to the containing atomic due to padding etc, my point was more about maintaining pointer provenance: something that AtomicPtr does but AtomicUsize does not.

When packing a pointer into a larger atomic, there is no equivalent to AtomicPtr, and I don't think this would be solved by having some more general means of packing smaller types into larger atomics? Do we not need a specific type for pointers here?

@Diggsey
Copy link

Diggsey commented Jul 18, 2024

@eggyal There is a hierarchy:

usize -> *mut -> MaybeUninit
[nothing] -> [provenance] -> [uninitialized bytes]

Where stuff to the right can store everything that stuff to the left can. @RalfJung is proposing tackling the atomic manipulation of generic types, which would include MaybeUninit, and so would automatically cover all the other use-cases, whereas an AtomicDoublePtr type would solve the provenance issue, but wouldn't allow you to atomically manipulate eg. types with padding, since padding may be uninitialized.

@eggyal
Copy link
Author

eggyal commented Jul 18, 2024

Ah, my misunderstanding. I hadn't appreciated that provenance is propagated through uninitialized bytes!

@lcnr
Copy link

lcnr commented Jul 18, 2024

And then a (T, U) tuple would only be AtomicSized if it has the right size and both T and U are AtomicSized? That could work. It would also be a limitation since one can form a tuple of the right size from types that don't have the right size, like ([u8; 3], u8).

hmm, would that work? Given that the tuple is larger than T and U I feel like it's not obvious that copying (T, U) works if T and U can be individually handled by atomic operations.

@RalfJung
Copy link
Member

Ah, my misunderstanding. I hadn't appreciated that provenance is propagated through uninitialized bytes!

Uninitialized bytes do not have provenance. However, MaybeUninit can hold any kind of byte -- uninit, init, and init with provenance. That's why it works as a universal container that can store any data without UB and without losing anything in the roundtrip.

hmm, would that work? Given that the tuple is larger than T and U I feel like it's not obvious that copying (T, U) works if T and U can be individually handled by atomic operations.

That's why I said "if it has the right size and both fields are AtomicSized".

I don't know which rules you had in mind, since this is not a structural property. But clearly if the concern is future compatibility, then to use (T, U) in an atomic type, the authors of T and U must opt-in somehow.

@RalfJung
Copy link
Member

RalfJung commented Jul 18, 2024

There is however a classic issue with atomic operations on types with padding -- how should compare_exchange be specified? Comparing uninitialized bytes is UB. So we'd have to some very specific stunts I think... like, only allow types where "which bytes are padding" is statically known (no unions, no enums), and then specify compare_exchange to only compare the non-padding bytes but also say that it can always spuriously fail if there are any padding bytes.

I am sure we already discussed this at some point, but I am not sure if we have an issue for that... @chorman0773 do you remember?

So in that sense the situation of having two pointers is simpler, since there are no padding bytes. OTOH, by virtue of this being pointers, the issue in #480 would come up but now there'd be two provenances to worry about...

@Diggsey
Copy link

Diggsey commented Jul 18, 2024

how should compare_exchange be specified? Comparing uninitialized bytes is UB.

IMO there's only two sane options, either it should be UB to do the comparison, or it should be specified that input values are frozen before being compared, with spurious failures being the logical consequence of that.

@bjorn3
Copy link
Member

bjorn3 commented Jul 18, 2024

or it should be specified that input values are frozen before being compared, with spurious failures being the logical consequence of that.

That wouldn't have a forward progress guarantee as the compiler can always freeze it to the wrong value if the freeze happens as part of the compare_exchange. Freeze only requires the output of the freeze to be consistent between uses of the same freeze, not for it to be consistent between freezes.

https://llvm.org/docs/LangRef.html#freeze-instruction

All uses of a value returned by the same ‘freeze’ instruction are guaranteed to always observe the same value, while different ‘freeze’ instructions may yield different values.

@Diggsey
Copy link

Diggsey commented Jul 18, 2024

That wouldn't have a forward progress guarantee as the compiler can always freeze it to the wrong value if the freeze happens as part of the compare_exchange

I think that's unavoidable if the compare/exchange loop happens in user code, since I don't think there's any guarantee that padding bytes are preserved?

Plausibly fetch_update could be made to work, since the implementation could make sure to use MaybeUninit to store the old value in that case.

@bjorn3
Copy link
Member

bjorn3 commented Jul 18, 2024

I think that's unavoidable if the compare/exchange loop happens in user code, since I don't think there's any guarantee that padding bytes are preserved?

If the user freezes themself once before the compare exchange loop rather than delegate it to compare_exchange there is no forward progress problem, but if compare_exchange does it, the compiler is allowed to produce an infinite loop.

@chorman0773
Copy link
Contributor

chorman0773 commented Jul 18, 2024 via email

@bjorn3
Copy link
Member

bjorn3 commented Jul 18, 2024

If the current argument has padding, the compiler can always freeze the padding such that it doesn't match the actual value in the atomic, and thus force the compare_exchange to fail.

@chorman0773
Copy link
Contributor

chorman0773 commented Jul 18, 2024 via email

@bjorn3
Copy link
Member

bjorn3 commented Jul 18, 2024

Right, if a MaybeUninit is returned, that would indeed work.

@chorman0773
Copy link
Contributor

Or [u8;N]. I use this in my libatomic impl.

@chorman0773
Copy link
Contributor

FTR, the way xlang defines compare_exchange is that any atomic operation will freeze the padding bytes (and only the padding bytes) in any value it stores, then compare_exchange and any compound assignment is UB when the backing memory contains uninit bytes.

@VorpalBlade
Copy link

Could this operation instead be implemented only for types that lack padding? The user could always add an explicit "padding field" to fill out any padding (and for example initialise it to zero). Of course in that case you would need something like a compiler implemented marker trait for "contains no padding" so perhaps that is more work.

@RalfJung
Copy link
Member

RalfJung commented Jul 19, 2024

FTR, the way xlang defines compare_exchange is that any atomic operation will freeze the padding bytes (and only the padding bytes) in any value it stores, then compare_exchange and any compound assignment is UB when the backing memory contains uninit bytes.

That makes it unsound to turn an &mut TypeWithPadding into &Atomic<TypeWithPadding>... or at least, that operation can't be a NOP, it has to freeze.

But indeed, then we can have the invariant that an &Atomic<T> always points to fully initialized data, which alleviates the issue with liveness/progress around a freezing version of compare_exchange. (It does need a somewhat non-trivial compare_exchange though as was noted above, where if we get a "not equal" result, we compare what we actually saw with the intended current state via PartialEq::eq, and if those are equal we know that the "not equal" was caused by padding differences and we replace our intended result with the thing we just read. IIRC that's what corssbeam's AtomicCell does.)

@chorman0773
Copy link
Contributor

That makes it unsound to turn an &mut TypeWithPadding into &Atomic... or at least, that operation can't be a NOP, it has to freeze.

It would, yes. That's the downside here.

@pitaj
Copy link

pitaj commented Jul 20, 2024

What about something like this?

static FOO: AtomicSet<(AtomicUsize, AtomicPtr)> = AtomicSet::new((0, null()));

FOO.compare_exchange(...)

AtomicSet would only be implemented for permutations of std atomics that are guaranteed to fill a supported atomic (no padding).

@VorpalBlade
Copy link

What about something like this?

static FOO: AtomicSet<(AtomicUsize, AtomicPtr)> = AtomicSet::new((0, null()));

FOO.compare_exchange(...)

AtomicSet would only be implemented for permutations of std atomics that are guaranteed to fill a supported atomic (no padding).

This could in the future be extended neatly to support accessing the inner atomics as well, depending on how mixed size atomics in #345 gets decided (but either case is obviously not blocked on the other, just a nice potential future synergy).

@Ddystopia
Copy link

Ddystopia commented Jul 22, 2024

Hello, what's the reason compiler cannot just mask out (on assembly level) padding bytes/bits before atomic store? Applying mask on a word or two introduces a performance penalty? Or has problems with orderings/llvm?

@RalfJung
Copy link
Member

RalfJung commented Jul 22, 2024 via email

@comex
Copy link

comex commented Jul 22, 2024

Plausibly fetch_update could be made to work, since the implementation could make sure to use MaybeUninit to store the old value in that case.

In my experience, when working with atomic structs, fetch_update is almost always what you want anyway. It would be nice to have it available without the additional padding restrictions.

@eggyal
Copy link
Author

eggyal commented Jul 24, 2024

(It does need a somewhat non-trivial compare_exchange though as was noted above, where if we get a "not equal" result, we compare what we actually saw with the intended current state via PartialEq::eq, and if those are equal we know that the "not equal" was caused by padding differences and we replace our intended result with the thing we just read. IIRC that's what corssbeam's AtomicCell does.)

But how can we "replace our intended result with the thing we just read"? crossbeam's AtomicCell uses simple assignment:

https://github.com/crossbeam-rs/crossbeam/blob/2a82b619bef638f328776714ec7ccf022859dda2/crossbeam-utils/src/atomic/atomic_cell.rs#L1126-L1134

However, can we be assured that current = previous; won't be optimised away when it can be deduced that current == previous from the immediately preceding test/branch? Can we ever guarantee that we overwrite current with an equal value but different padding?

@RalfJung
Copy link
Member

current and previous should have type MaybeUninit<T> to make sure all padding etc is preserved.

@chorman0773
Copy link
Contributor

Respectfully, MaybeUninit does not preserve padding any better than T does'

@RalfJung
Copy link
Member

Ah of course... 🤦
Yeah the type for that would be MaybeUninit<[u8; size_of::<T>()]>.

Crossbeam anyway does something else -- the relevant part is current_raw = previous_raw;, and that is also what the compare-exchange works on.

For a primitive operation supporting compare-exchange on any T however, what this means is that returning the old value as a T is not good enough since padding would be lost. It'd be better to make the operation take expected_val: &mut T; then the value at *expected_val can be updated in-place using bytewise operations, without losing padding.

@Diggsey
Copy link

Diggsey commented Jul 24, 2024

Yeah the type for that would be MaybeUninit<[u8; size_of::()]>.

If the values have to be frozen for compare/exchange to work, then shouldn't the type be just [u8; size_of::<T>()]?

edit:
I guess MaybeUninit does have a second purpose if "freeze" is a thing - to prevent safe code from accessing frozen bytes.

@RalfJung
Copy link
Member

If the bytes can carry provenance, an array of u8 won't work.

@Diggsey
Copy link

Diggsey commented Jul 26, 2024

If the bytes can carry provenance, an array of u8 won't work.

Of course 🤦

I think it's very confusing and inconsistent that MaybeUninit<StructWithPadding> does not preserve padding, but MaybeUninit<u8> does preserve provenance, despite u8 not having provenance.

Why is padding treated differently than provenance? If MaybeUninit is supposed to be exactly like T except that it allows uninitialized bytes, then it should not preserve provenance. If on the other hand MaybeUninit is supposed to act like a "bag of bytes" then it should also preserve padding.

@chorman0773
Copy link
Contributor

Padding is a property of the type - provenance is a property of a byte.
MaybeUninit<T> probably shouldn't have padding bytes, but we promised that it has the same ABI as T, which only is possible if it has the same padding bytes as T (due to certain ABI requirements).

@RalfJung
Copy link
Member

RalfJung commented Jul 26, 2024

It's definitely confusing, and the underlying reason is that computers are terrible. :(

When we promised that MaybeUninit<T> is ABI-compatible with T, the people involved didn't know all the weird ABIs out there, so this it was only found out quite a bit later that this implies that not all bytes in the MaybeUninit can always be preserved. Maybe we should try to fix this by going back on this ABI promise, restricting it to types that don't have padding...

That's a separate discussion, though: #518

@eggyal
Copy link
Author

eggyal commented Jul 27, 2024

I think I'm still not grasping something here.

If, for compare-exchange of an Atomic<T>, it is required that T: Eq (which seems to me to be the only sane interpretation of the operation), then is it not the case that:

  • since structs have static padding, successful byte-wise comparison can only occur when the non-padding bytes are equal;
  • since an enum's discriminant is never undef, successful byte-wise comparison can only occur if the discriminant byte(s) are equal—and then field bytes follow as for structs above: ie successful byte-wise comparison can only occur when the non-padding bytes are equal; and
  • for a union to implement Eq, PartialEq::cmp must be capable of determining the variant—and therefore any bytes required for that determination cannot be undef (at least when the compare-exchange is invoked, although because PartialEq::cmp is safe to call I should think this is always required?); consequently, successful byte-wise comparison can only occur if the variants are equal, and then non-padding bytes again follow as above.

It therefore seems to be the case that T: Eq implies that successful byte-wise comparison can only occur if non-padding bytes are equal, irrespective of any padding bytes?

Obviously byte-wise comparison may fail due to undef/padding bytes, but that is why one must then (as crossbeam does) call PartialEq::cmp to determine whether the failure is genuine (return to user) or spurious (reattempt with newly discovered current value, type-equal but byte-unequal to the original).

I suppose I'm struggling to understand why it would ever be necessary to freeze, or why this couldn't be implemented generically by the compiler (either with a compile-time size check, or requiring T: AtomicSized as discussed early in this issue).

@Ddystopia
Copy link

If, for compare-exchange of an Atomic, it is required that T: Eq (which seems to me to be the only sane interpretation of the operation)

for a union to implement Eq, PartialEq::cmp must be capable of determining the variant

I'm not sure how can we require T: Eq, because compare-exchange is a hardware-defined operation that requires a bitwise comparation and you can't run arbitrary PartialEq::cmp code to perform a comparation.
For example, this obviously cannot work:

struct Foo(u8);
impl Eq for Foo;
impl PartialEq for Foo {
    fn eq(&self, other: &Foo) -> bool {
        std::thread::sleep_ms(1000);
        self.0.eq(&other.0)
    }
}

fn foo(foo: &Atomic<Foo>) {
    foo.compare_exchange(Foo(3), Foo(4), Relaxed, Relaxed);
}

@eggyal
Copy link
Author

eggyal commented Jul 27, 2024

The library API Atomic::<T>::compare_exchange would not itself be an atomic operation, but would be guaranteed to use atomic operations on the underlying memory. PartialEq::cmp is called on (a copy of) the value loaded from that memory, not the memory itself.

@comex
Copy link

comex commented Jul 27, 2024

Obviously byte-wise comparison may fail due to undef/padding bytes, but that is why one must then (as crossbeam does) call PartialEq::cmp to determine whether the failure is genuine (return to user) or spurious (reattempt with newly discovered current value, type-equal but byte-unequal to the original).

The problem is that "reattempt with newly discovered current value" may never succeed if the current value in memory contains uninit bytes. Or in @chorman0773's formulation, it would be undefined behavior.

However, after re-reading the thread, I personally find the proposed solutions unsatisfying. Atomic*::from_mut is useful, so adding extra validity requirements on Atomic<T> compared to T would be a unfortunate loss of functionality.

And in this case, the problem being solved is purely an artifact of compiler optimizations. In general, uninit bytes are not just an artifact of optimizations because of MADV_DONTNEED/MADV_FREE, the special case where memory truly can 'change on its own' in the assembly-level abstract machine. But when it comes to compare-exchange loops specifically, it's okay if the memory changes on its own sometimes, as long as it doesn't do so every single time you try to read from it. You will still make forward progress eventually. And MADV_DONTNEED/MADV_FREE more than satisfies that criterion, since AFAIK it can cause any given byte to change at most once.

But I don't see any straightforward way to 'fix' this either, not without adding an ugly special case that LLVM and GCC may or may not end up respecting – that is, if they ever get around to adding optimizations for atomics. Right now they basically don't optimize them at all.

@VorpalBlade
Copy link

Wouldn't the "min" variation of this be to only support the no-padding case, and leave the door open to relax that in the future?

Would it be useful to move forward with such a proposal, or do the motivating use cases need padding? If so, why can't they just add manual "padding fields"? This seems to me to be entirely about ergonomics: yes it is nicer to support padding, but there is no use case that can't be rewritten to avoid needing padding.

@pitaj
Copy link

pitaj commented Jul 27, 2024

It's difficult if not impossible to support explicit padding with generics.

@RalfJung
Copy link
Member

@eggyal it's a bit hard to reply since you haven't given concrete code. Let me try:

Obviously byte-wise comparison may fail due to undef/padding bytes, but that is why one must then (as crossbeam does) call PartialEq::cmp to determine whether the failure is genuine (return to user) or spurious (reattempt with newly
discovered current value, type-equal but byte-unequal to the original).

The issue is that unini bytes behave a bit like bytes that take different values each time you read them. If the memory backing a (u8, u16) is 0x00UU0000 (using U to indicate "uninit"), and you do a compare-exchange with 0x00000000, you would usually get UB (can't compare uninit bytes) but if we say that atomics freeze both operands before comparing, let's say you get false and it says the actual value stored there is 0x00FF0000 (a valid result of freezing the in-memory value). So now you try again with 0x00FF0000, comparing that to the (unchanged) in-memory value 0x00UU0000, and again you get false and it says the actual return value is now 0x00AA0000 (another valid result of freezing the in-memory value). This can go on forever.

It's a somewhat theoretical concern because of course real uninit bytes are not unstable in this way, but if programmers want to reason about things like "does this concurrent program ever make progress", then we have to design a spec that doesn't have such "implicit" infinite loops, not even theoretically.

This can either be fixed by somehow saying that freeze "stabilizes" when it is tried sufficiently often) (but I have no idea how to do that), or by picking a different scheme for specifying these atomic operations.

@RalfJung
Copy link
Member

RalfJung commented Jul 29, 2024 via email

@nonnull-ca
Copy link

I'm probably missing something straightforward.

What about defining compare_exchange / etc to succeed if equal ignoring padding bits?

This obviously doesn't quite match actual hardware, and in some cases may result in non-ideal code (on a type with padding bits the compiler may have to lower compare_exchange into a CAS loop - but this should only be for the case for a type with padding bits).

@bjorn3
Copy link
Member

bjorn3 commented Oct 27, 2024

That wouldn't be able to guarantee forward progress, right? The CAS loop may cause compare_exchange itself to never return if another thread constantly changes the padding while the CAS loop is running.

@nonnull-ca
Copy link

nonnull-ca commented Oct 28, 2024

That wouldn't be able to guarantee forward progress, right? The CAS loop may cause compare_exchange itself to never return if another thread constantly changes the padding while the CAS loop is running.

Some implementations may have forward progress guarantee issues with this approach - but it at least allows the memory model itself to be sane.

That being said, I was under the assumption that underlying implementations could already punt forward progress guarantees for individual atomics ops, and we tended to mostly gloss over this as implementation details.

Consider ARM for a second (the architecture I'm most familiar with personally). ARM without LSE uses a LDREX/STREX (or LDX/STX in A64) (load-linked / store-conditional is the more common term) loop for many basic atomic operations, which does not guarantee forward progress in... well, among many many many other cases, precisely the case you indicate. Thread B doing repeated writes to a memory location (even if it doesn't change the value!) can cause thread A doing pretty much any atomic implemented as a LDX/STX loop to said memory location to stall indefinitely.

ARM tried to have more forward progress guarantees for LDX/STX, but has walked them back over time (I suspect due to the bewildering variety of errata on older ARM processors of the form "if thread A does XXX then thread B may not make forward progress" - ARM's response has tended to be mainly 'well don't do that then' and a spec update to further restrict the cases in which LDX/STX are guaranteed to make forward progress).

If this isn't an issue, then defining compare_exchange / etc to succeed if equal ignoring padding bits shouldn't be an issue either.

If this is an issue... what do other architectures using a load-linked store-conditional (RISCV, notably) guarantee & do?

@RalfJung
Copy link
Member

RalfJung commented Oct 28, 2024

Yeah the LL/SC thing was discussed recently somewhere but I can't find the discussion right now -- arguably that makes ARM a non-conforming implementation in terms of liveness (or "forward progress") properties. I don't think it is a good idea to add more such issues to the language -- in particular, I don't think it is a good idea to add operations that are impossible to implement in a liveness-preserving way even on hardware with stronger guarantees than ARM. "It's theoretically broken in certain conditions on some hardware" is not a good argument for "well then let's give up on it entirely".

Regarding ignoring padding bytes, this has been discussed above already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests