Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArcCell not lock-free #160

Open
jethrogb opened this issue Dec 11, 2017 · 17 comments
Open

ArcCell not lock-free #160

jethrogb opened this issue Dec 11, 2017 · 17 comments

Comments

@jethrogb
Copy link

The current implementation of ArcCell is not lock-free. It uses a spin lock. I very much want to use this functionality, but I don't think this implementation should be included in crossbeam right now (and frankly, the fact that it's included right now is quite misleading).

@ghost
Copy link

ghost commented Dec 11, 2017

You're right. We should probably implement ArcCell<T> using differential reference counting, although that would also require implementing a custom Arc<T> type (the one from the standard library wouldn't be compatible).

Do you perhaps have an example of how you want to use ArcCell<T>? I'm asking to see what exactly you need so that we can accommodate your use case as well as possible.

@jethrogb
Copy link
Author

jethrogb commented Dec 11, 2017

I've needed something like this over and over again to store program-global configuration. gets are frequent and sets extremely rare. For example https://docs.rs/log4rs/0.7.0/src/log4rs/lib.rs.html#446-449

Alternatively, I think you can easily implement this using 128-bit atomics on platforms that support them. But I don't think those are available on stable Rust.

@Vtec234
Copy link
Member

Vtec234 commented Dec 11, 2017

I've attempted a partially wait-free version of ArcCell in #81, but note that it is currently slightly incorrect. Changing the faulty ordering (see PR thread) should fix it, so you could try it out.

That said, crossbeam is currently ongoing a major reorganisation and new types with similar functionality might (and will if people need them) appear in the future.

@ghost
Copy link

ghost commented Dec 11, 2017

@Vtec234 I'm slightly worried that wait-free get() operations might starve set() operations forever. Wouldn't it be better to simply use parking_lot::RwLock<Arc<T>> instead?

@jethrogb Actually, I believe a primitive tentatively called HazardBox<T> might be a better fit for your use case. But it doesn't exist yet. :) The idea is to protect the shared object by thread-local hazard pointers to avoid cache invalidation (instead of updating a globally shared value on every read).

@Vtec234
Copy link
Member

Vtec234 commented Dec 11, 2017

@stjepang That should work, yes. My implementation is essentialy a very simple rw-lock and I'm sure much better implementations are possible (maybe based on HP, like you said).

@vorner
Copy link
Contributor

vorner commented Apr 1, 2018

Hello

I played with this a bit and I have an implementation that is lock-free (wait-free in the case of the readers) and it shouldn't be possible to starve the writers. It also works with the standard Arc. I could release it as a separate crate, but I think the code would be better cared for in crossbeam somewhere (I'm not sure where exactly, if there's a specific sub-crate where it fits).

So, if you'd like to accept it, I can create a pull request and place it somewhere (and rename it to ArcCell, to keep the consistency with the current implementation).

What do you think about it?

http://github.com/vorner/arc-swap

@ghost
Copy link

ghost commented Apr 3, 2018

@vorner Hey, that looks neat! Designing an atomic shared pointer (like AtomicReference in Java) in a language without GC is a surprisingly difficult problem, and your approach is quite interesting.

I think the biggest drawback of ArcSwap (and atomic_shared_ptr in C++ for that matter) is that readers are updating the same atomic numbers, thus suffering from contention. In my benchmarks, loading from an uncontended ArcSwap takes 28 ns, but then it gets to 180 ns if another thread is loading from the same ArcSwap at the same time. The numbers would probably be similar with atomic_shared_ptr.

Consider the fact that loading from AtomicReference in Java takes less than 10 ns (it's just a typical load instruction on x86 machines), and our numbers quickly pale in comparison.

The solution is to not update any shared counters when loading from ArcSwap. In order words, loading shouldn't return an Arc<T>, but something like ArcSwapGuard<'a, T> instead, which can be dereferenced into a &T for reading or "upgraded" into a real Arc<T> if the user wishes so. I'm prototyping something similar here, if you're interested. Loading takes around 25 ns, and doesn't degrade at all in presence of concurrent readers. And there is no locking whatsoever - reading and updating is always lock-free.

@vorner
Copy link
Contributor

vorner commented Apr 7, 2018

Reading the hazard-pointer approach, it does look interesting. But I'm a bit nervous about two things ‒ the fact that an update may force a reader to loop (therefore the slow-path operation hurting the hot-path one, possibly making it stuck for a long time/forever with a steady inflow of changes), and that the chains of registries and thread entries can grow to long ones and I don't think they ever shrink back in the code. Anyway it is well possible this will be faster than my approach.

Is there any estimate on when your approach could be ready? I ask because I want to know if it makes sense for me to clean it up, add all the license files, and publish it, or (in case hazard-cell will be ready soon in crossbeam) if it's not worth it.

@ghost
Copy link

ghost commented Apr 8, 2018

the fact that an update may force a reader to loop (therefore the slow-path operation hurting the hot-path one, possibly making it stuck for a long time/forever with a steady inflow of changes)

If updates are so frequent that they manage to force reads to spin in a loop, then updates kind of are the hot-path. :) But I see what you're saying, it's a fair concern that readers aren't wait-free.

that the chains of registries and thread entries can grow to long ones and I don't think they ever shrink back in the code.

Yes, the current prototype isn't tackling this problem. It's in the plan to shrink back the linked lists.

Is there any estimate on when your approach could be ready?

There are other things on my schedule at the moment, but I'd say in a few weeks.

I ask because I want to know if it makes sense for me to clean it up, add all the license files, and publish it, or (in case hazard-cell will be ready soon in crossbeam) if it's not worth it.

If you have a use case for ArcSwap, go ahead! In any case, we'll definitely want to benchmark our approaches and see how they compare. I don't expect either one to be a win in all scenarios (for example, ArcSwap will almost surely have faster uncontended updates).

@acsearle
Copy link

I'll throw my hat in the ring too. I have a reasonably complete AtomicArc now using differential reference counting, inspired by Folly::AtomicSharedPtr. I suspect it is close to what @stjepang was envisioning in the second comment. It has the same problem of reads degrading under contention, has to use its own reimplementation of Arc, and will only work on x86_64 and (hopefully) AArch64 using their spare pointer bits. On the brighter side, it also provides AtomicWeak, has no thread_local or global structures, and no requirements for the user to call anything outside the std::sync::atomic interface.

https://github.com/acsearle/barc

Might be an interesting comparison point if nothing else.

@ghost
Copy link

ghost commented Sep 13, 2018

@acsearle Thanks for sharing -- that looks really good, I'm impressed!

So I've just whipped up a simple set of benchmarks to compare the performance of load operation with barc, with hazard pointer-based Arc, and with mutex-protected Arc. Results:

running 6 tests
test conc_load_barc   ... bench:         141 ns/iter (+/- 28)
test conc_load_hazard ... bench:          19 ns/iter (+/- 1)
test conc_load_mutex  ... bench:          39 ns/iter (+/- 25)
test load_barc        ... bench:          17 ns/iter (+/- 1)
test load_hazard      ... bench:          19 ns/iter (+/- 2)
test load_mutex       ... bench:          13 ns/iter (+/- 2)

I wonder what you think about using spare pointer bits in light of near-future migrations to 56-bit address spaces on x86-64 (and maybe even 64-bit). That seems like an unfortunate portability nightmare. :( There was an interesting discussion about this topic in the LuaJIT repo.

@jethrogb
Copy link
Author

@acsearle have you considered an implementation using double-word atomics? You should be able to test with 64-bit atomics on 32-bit while we wait for rust-lang/rust#53514 to land

@acsearle
Copy link

Wow, the performance of parking_lot Mutex is phenomenal! Is it enough to just .lock() the mutex, or would the better comparison be with .lock().clone() so we get and destroy a usable object per loop, as returned by AtomicWeightedArc::load() and (I think) hazard's .get()? The disheartening result on contested load is consistent what I saw in my own jury-rigged benchmarks: inherently slow since each load has to win a compare_exchange.

I wasn't aware that losing those bits was imminent (there's a relatively recent talk by Chandler Carruth somewhere where he talks about how LLVM uses them). That's unfortunate. Re other architectures, I have thought about it a little. For both of these questions, it should be fairly easy to experiment since it requires only a platform-specific reimplementation of CountedPtr and CountedNonNull. Another option is using alignment bits, perhaps over-aligning the ArcInner allocation to get more of them (if the Rust allocator supports this yet). But yes, as we lose bits, we have to replenish more frequently and the number of threads that are lock-free degrades until we become a spinlock.

Anyway, my concern (having learned a lot from implementing it) is that this style of implementation, independent of whether the count bits are in the pointer or a second word, can never be competitive on contested loads with implementations that manage to use an atomic read for that purpose, and so is a bit of a dead end :(

bors bot added a commit that referenced this issue Jan 25, 2019
301: Remove ArcCell, MsQueue, and TreiberStack r=stjepang a=stjepang

These three pieces are being removed, each for its own reason:

1. `ArcCell<T>` uses a spinlock and provides little value over just `Mutex<Arc<T>>`. There are much better alternatives like [arc-swap](https://docs.rs/arc-swap). We should really implement a better version of this primitive, but it's not trivial. See [here](#160) for a discussion.

2. `MsQueue<T>` has no reason to exist in presence of `SegQueue<T>`, which is better in almost every way. The implementation has gotten really old and could use some cleanup. A selling point of this queue is blocking pop operation, but channels are much better at that job anyway.

3. `TreiberStack<T>` is the least used piece of Crossbeam, if used at all. The implementation has poor performance and is only interesting for its educational value - it's the *hello world* of lock-free programming. For that reason, I'm moving it into `crossbeam-epoch/examples`.

All of these are practically useless and only confuse users. People assume there's something smarter in `ArcCell` than a spinlock, or struggle with choosing `MsQueue` vs `SegQueue`. As such, these three pieces are a net *negative* for the library so it's best to just remove them and come up with better replacements in the future.

Co-authored-by: Stjepan Glavina <stjepang@gmail.com>
zeeshanlakhani pushed a commit to williamofockham/NetBricks that referenced this issue Apr 30, 2019
zeeshanlakhani pushed a commit to williamofockham/NetBricks that referenced this issue Apr 30, 2019
huitseeker added a commit to huitseeker/diem that referenced this issue Jul 25, 2019
* This modifies the following behavior:
  - updates `crossbeam` in `proptest_helpers` to 0.7.2 (which was [just released](crossbeam-rs/crossbeam#401 (comment)))
  - removes usage of `crossbeam` in `common::logger` as:
    + `ArcCell` was [removed](crossbeam-rs/crossbeam#301) in more recent versions of crossbeam,
    + it was [spin-locking](crossbeam-rs/crossbeam#160) anyway

* Why this is better
  - closes #279
  - [`arc-swap`](https://docs.rs/arc-swap/) is (mostly) lock-free

* Why this is worse
  - [`arc-swap`](https://github.com/vorner/arc-swap) is young

* Tests

`cargo audit` + CI
huitseeker added a commit to diem/diem that referenced this issue Jul 25, 2019
* This modifies the following behavior:
  - updates `crossbeam` in `proptest_helpers` to 0.7.2 (which was [just released](crossbeam-rs/crossbeam#401 (comment)))
  - removes usage of `crossbeam` in `common::logger` as:
    + `ArcCell` was [removed](crossbeam-rs/crossbeam#301) in more recent versions of crossbeam,
    + it was [spin-locking](crossbeam-rs/crossbeam#160) anyway

* Why this is better
  - closes #279
  - [`arc-swap`](https://docs.rs/arc-swap/) is (mostly) lock-free

* Why this is worse
  - [`arc-swap`](https://github.com/vorner/arc-swap) is young

* Tests

`cargo audit` + CI
@jonhoo
Copy link
Contributor

jonhoo commented Feb 17, 2020

ArcCell was removed in #301, so I assume this should be closed?

@jethrogb
Copy link
Author

From #301:

We should really implement a better version of this primitive

Alternatively, this issue could become the feature request?

@acsearle
Copy link

acsearle commented Sep 9, 2020

I've put in a draft pull request of an AtomicArc using deferred coalesced reference counting, for discussion, but I seem to have screwed up linking it to this issue. It should be considerably faster than my 2018 attempt; atomic loads are pure reads, it uses guards and scoping to prevent temporaries from needing to manipulate the counts at all, and it uses a thread-local ledger to coalesce and (often cancel out) the pending increments and decrements before finally applying the increments at the end of the guard's life and deferring the decrements to the collector.

@jethrogb
Copy link
Author

jethrogb commented Sep 9, 2020

#564

exrook pushed a commit to exrook/crossbeam that referenced this issue Oct 7, 2020
I was looking through old issues, and crossbeam-rs#160 mentioned that crc32
checksumming for zip files was showing up in profiles.  zip-rs 0.4 uses
a byte-at-a-time version of crc32; zip-rs 0.5 relies on the `crc32fast`
crate, which features faster implementations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants