Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Switch.on_release thread safe #684

Merged
merged 1 commit into from
Feb 9, 2024

Conversation

talex5
Copy link
Collaborator

@talex5 talex5 commented Feb 6, 2024

This is needed to allow resource pools to be shared between domains. Fixes #669.

It works by putting a mutex around the on_release list. I tried a lock-free version too, but it was quite messy.

There is one change to the existing behaviour: it used to be possible to attach new resources to a switch in a release handler, but now that is rejected. I don't think the old behaviour was useful, but it could be brought back if needed (we'd just need to take the mutex again at the end).

I was thinking of deprecating remove_hook in favour of try_remove_hook, but all existing uses were safe so I decided not to do that for now.

Note: There are some places where we do Switch.check before creating a resource. This isn't thread-safe, and might pass even after the switch has ended if called from a different domain. That shouldn't be a problem (the switch could always have ended after the check anyway), but might annoy TSan.

@talex5 talex5 added the enhancement New feature or request label Feb 6, 2024
@talex5 talex5 force-pushed the threadsafe-resources branch 2 times, most recently from d695632 to ce6001f Compare February 7, 2024 10:43
@polytypic
Copy link
Contributor

polytypic commented Feb 7, 2024

It works by putting a mutex around the on_release list. I tried a lock-free version too, but it was quite messy.

I took a very brief look and, IIUC, then what you basically need is a lock-free set. on_release_cancellable adds a hook to the set and you can later decide to remove_hook it. Also, when the switch finishes, all the hooks in the set are run.

I developed a lock-free starvation avoiding data structure to deal with a similar problem in Picos. See here. Both try_attach and detach are lock-free and work in amortized constant time.

The elements in the set are mutable (think _ option Atomic.t) and detach clears the element so that the payload is not leaked (i.e. Trigger.signal effectively sets to None). Nodes are removed from the set only lazily. This is achieved by maintaining a balance. Once the balance crosses a threshold, both attach and detach operations will perform a gc to drop nodes whose elements have been detached. This ensures that the set will not use more than a constant factor more space than strictly necessary.

Because both attach and detach perform the gc at the same point (i.e. the same threshold) the gc is effectively wait-free (i.e. if you e.g. attach and that requires gc, then the only reason to require a retry is that another operation performed the gc for you).

The attach and detach operations are not wait-free, because an operation could repeatedly need to retry as other parties finish their operations more quickly, but the amounts of work performed by attach and detach operations are very similar (both make an allocation to update the Continue state record) and so it should be highly unlikely that a particular operation would indefinitely retry, which is why I call it "starvation avoiding".

@talex5
Copy link
Collaborator Author

talex5 commented Feb 8, 2024

Thanks - that does sound like what we want. I'm not sure making it lock-free is necessary yet - adding the mutex didn't seem to have any noticeable effect on the benchmarks and there isn't likely to be much contention for the Irmin use (opening a file takes much longer than taking the mutex and appending to the list). But if it becomes a problem I'll use that code.

This is needed to allow resource pools to be shared between domains.
@talex5 talex5 merged commit e9864b9 into ocaml-multicore:main Feb 9, 2024
5 checks passed
@talex5 talex5 deleted the threadsafe-resources branch February 9, 2024 11:33
talex5 added a commit to talex5/opam-repository that referenced this pull request Feb 22, 2024
CHANGES:

New features:

- eio_posix: use directory FDs instead of realpath (@talex5 ocaml-multicore/eio#694 ocaml-multicore/eio#696, reviewed by @SGrondin).
  Using realpath was an old hack from the libuv days and subject to races. It was also slow.

- Keep pool of systhreads for blocking operations (@SGrondin @talex5 ocaml-multicore/eio#681).
  This is much faster than creating a new thread for each operation.
  It mainly benefits the eio_posix backend, as that uses lots of systhreads.

- Make `Switch.on_release` thread-safe (@talex5 ocaml-multicore/eio#684, requested by @art-w and @clecat).
  This allows resource pools to be shared between domains easily.

- Add `Eio.Path.read_link` (@talex5 ocaml-multicore/eio#686).

- Add `Eio_unix.Fd.is_open` (@talex5 ocaml-multicore/eio#690).

- Include backtrace in systhread errors (@talex5 ocaml-multicore/eio#688, reviewed by @SGrondin).
  Also, add `Eio.Exn.empty_backtrace` as a convenience.

- eio.mock: add tracing support to mock backend (@talex5 ocaml-multicore/eio#687).

- Improve tracing (@talex5 ocaml-multicore/eio#675 ocaml-multicore/eio#683 ocaml-multicore/eio#676, reviewed by @SGrondin).
  Update tracing section of README and trace more things
  (`run_in_systhread`, `close`, `submit`, `traceln`, cancellation and domain spawning).

Documentation:

- Link to verification work in docs (@talex5 ocaml-multicore/eio#682).

- Add more trace diagrams to README (@talex5 ocaml-multicore/eio#698).

- Adjust COC contacts (@polytypic ocaml-multicore/eio#685, reviewed by @Sudha247).

Bug fixes:

- eio_linux: retry `openat2` on `EAGAIN` (@talex5 ocaml-multicore/eio#693, reviewed by @SGrondin).

- eio_posix and eio_windows: check for IO periodically (@talex5 ocaml-multicore/eio#674).

- Handle EPERM when trying to initialise uring (@talex5 ocaml-multicore/eio#691).
  This can happen when using a Docker container.

Build and tests:

- Benchmark `Eio_unix.run_in_systhread` (@talex5 ocaml-multicore/eio#678, reviewed by @SGrondin).

- Enable lintcstubs for `Eio_unix.Private` too (@talex5 ocaml-multicore/eio#689).

- Stat benchmark: report cleanup time and optimise (@talex5 ocaml-multicore/eio#692).

- Make benchmarks start faster (@talex5 ocaml-multicore/eio#673).

- Update build for new eio-trace CLI (@talex5 ocaml-multicore/eio#699).

- Expect opam-repo-ci tests to fail on macos (@talex5 ocaml-multicore/eio#672).
talex5 added a commit to talex5/opam-repository that referenced this pull request Feb 22, 2024
CHANGES:

New features:

- eio_posix: use directory FDs instead of realpath (@talex5 ocaml-multicore/eio#694 ocaml-multicore/eio#696, reviewed by @SGrondin).
  Using realpath was an old hack from the libuv days and subject to races. It was also slow.

- Keep pool of systhreads for blocking operations (@SGrondin @talex5 ocaml-multicore/eio#681).
  This is much faster than creating a new thread for each operation.
  It mainly benefits the eio_posix backend, as that uses lots of systhreads.

- Make `Switch.on_release` thread-safe (@talex5 ocaml-multicore/eio#684, requested by @art-w and @clecat).
  This allows resource pools to be shared between domains easily.

- Add `Eio.Path.read_link` (@talex5 ocaml-multicore/eio#686).

- Add `Eio_unix.Fd.is_open` (@talex5 ocaml-multicore/eio#690).

- Include backtrace in systhread errors (@talex5 ocaml-multicore/eio#688, reviewed by @SGrondin).
  Also, add `Eio.Exn.empty_backtrace` as a convenience.

- eio.mock: add tracing support to mock backend (@talex5 ocaml-multicore/eio#687).

- Improve tracing (@talex5 ocaml-multicore/eio#675 ocaml-multicore/eio#683 ocaml-multicore/eio#676, reviewed by @SGrondin).
  Update tracing section of README and trace more things
  (`run_in_systhread`, `close`, `submit`, `traceln`, cancellation and domain spawning).

Documentation:

- Link to verification work in docs (@talex5 ocaml-multicore/eio#682).

- Add more trace diagrams to README (@talex5 ocaml-multicore/eio#698).

- Adjust COC contacts (@polytypic ocaml-multicore/eio#685, reviewed by @Sudha247).

Bug fixes:

- eio_linux: retry `openat2` on `EAGAIN` (@talex5 ocaml-multicore/eio#693, reviewed by @SGrondin).

- eio_posix and eio_windows: check for IO periodically (@talex5 ocaml-multicore/eio#674).

- Handle EPERM when trying to initialise uring (@talex5 ocaml-multicore/eio#691).
  This can happen when using a Docker container.

Build and tests:

- Benchmark `Eio_unix.run_in_systhread` (@talex5 ocaml-multicore/eio#678, reviewed by @SGrondin).

- Enable lintcstubs for `Eio_unix.Private` too (@talex5 ocaml-multicore/eio#689).

- Stat benchmark: report cleanup time and optimise (@talex5 ocaml-multicore/eio#692).

- Make benchmarks start faster (@talex5 ocaml-multicore/eio#673).

- Update build for new eio-trace CLI (@talex5 ocaml-multicore/eio#699).

- Expect opam-repo-ci tests to fail on macos (@talex5 ocaml-multicore/eio#672).
talex5 added a commit to talex5/eio_js that referenced this pull request Aug 23, 2024
This matches the change to the same test in
ocaml-multicore/eio#684.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switches should be domain-safe
2 participants