Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] ucxx v0.41 #331

Merged
merged 29 commits into from
Dec 11, 2024
Merged

[RELEASE] ucxx v0.41 #331

merged 29 commits into from
Dec 11, 2024

Conversation

raydouglass
Copy link
Member

❄️ Code freeze for branch-24.12 and v24.12 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-24.12 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-24.12 into main for the release

raydouglass and others added 28 commits September 19, 2024 12:03
We were recently inquired by a user whether relying only on the `ucs_status_t` passed to the user callback and discarding the `ucxx::RequestTag` return from `ucxx::Endpoint::tagSend` and `ucxx::Endpoint::tagRecv` is valid.  This turns out to be a valid usage, although I'm unsure whether encouraging this is a good idea, since requests such as `ucxx::Endpoint::amRecv` require that the user retrieve the resulting buffer from the returned `ucxx::RequestAm`, and if this is discarded the buffer is lost. This PR adds a test for the aforementioned use case but makes no changes to documentation to prevent encouraging this pattern until we can decide whether we should support it or not.

For requests such as `ucxx::Endpoint::amRecv`, it might be worth studying whether we could pass the resulting buffer and any other attributes associated with it to the callback, in that case we may be able to provide a safe pattern to always use the callback if the user doesn't want to keep a reference to the returned `ucxx::Request` object.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #267
Prevent calling `set_result()`/`set_exception()` on Python futures if they are already `cancelled()`/`done()`. Calling `set_result()`/`set_exception()` when the future has already completed raises an exception that may be uncatched as it may happen on a thread different from the application, which seems to have been one of the reasons why `distributed-ucxx` tests have been hanging.

Although not all test failures have been resolved by this, hangs have seemingly also been less frequent in local tests.

Additionally set more sensible log levels for other errors coming from CPython implementation and remove the `PyErr_Print()` which will just print a message to stdout instead of a log message.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #281
Forward-merge branch-0.40 into branch-0.41
Fix bug with `create_endpoint`, where a `UCXMessageTruncatedError` may not be properly raised, as well as its use in `distributed-ucxx` not raising the exception expected by `distributed`.

Additionally, fix wireup issues in tests, where only sending (or receiving) a wireup message does not seem to suffice, but sending _and_ receiving suffices to prevent errors in tests.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #293
Rewrite tracking of distributed-ucxx communicator resources providing more robust lifetime tracking, allowing better control over termination of notifier and progress threads.

This seems to have resolved deadlock issues in Distributed for good this time and if not, it definitely improved the status substantially. Another benefit is the removal of `distributed_patches.py`, meaning there's no need anymore to monkey-patch Distributed resources to do any tracking, and instead do that on the communication resources alone which are entirely controlled by `distributed-ucxx`, in the long-run it will also cause less breakage since we were previously bound to start/stop mechanisms in Distributed to remain unchanged.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #291
Instead of creating a per-UCXX context lock that may result in race conditions during the creation of the locks, use a module-level lock that is guaranteed to be thread-safe as it occurs at import time.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #295
Pulls in a few changes from KvikIO's `Array`. Namely...

* `asarray` for easy conversion to `Array` (no-op if already an `Array`)
* Include types for a few more properties

Authors:
  - https://github.com/jakirkham

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #290
This PR updates the RMM import to use `pylibrmm` now that `rmm._lib` is deprecated . It should be merged after rapidsai/rmm#1676.

Authors:
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #283
Contributes to rapidsai/build-planning#106

Proposes specifying the RAPIDS version in `conda install` calls in CI that install CI artifacts, to reduce the risk of CI jobs picking up artifacts from other releases.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #298
These are all pure C functions implemented in Cython. They do not raise. Cython 3 adds checks for exceptions in these functions, which is unnecessary given none of these set exceptions. So add `noexcept` to turn off these Cython checks.

xref: rapidsai/kvikio#502

Authors:
  - https://github.com/jakirkham

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #306
In `Array`, `Py_ssize_t[::1]` objects are currently backed by [CPython `array`'s]( https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#cpython-array-module ) with some internal bits expressed in Cython. However these are not compatible with [Python's Limited API and Stable ABI]( https://docs.python.org/3/c-api/stable.html#c-api-stability ). To address that, switch to [Cython's own `array` type]( https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#cython-arrays ). As this is baked into Cython and doesn't use anything special, it is compatible with Python's Limited API and Stable ABI.

xref: rapidsai/build-planning#42

Authors:
  - https://github.com/jakirkham

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #307
Implements the blocking progress mode (UCX-Py default), which was still not implemented in UCXX.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Ray Douglass (https://github.com/raydouglass)

URL: #116
Ensure all Python multiprocessing tests timeout during `join` and get terminated properly, appropriately raising an error if the subprocess failed to terminate cleanly.

To simplify joining of multiple processes, a new testing function `join_processes` was added, where all processes in the list will be joined with a common timeout, if the total time elapsed is longer than the timeout the process will still be joined but wait will be non-positive, meaning `join` returns immediately and the process is later terminated with `terminate_process`.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #300
One of the goals with UCXX is to, as much as possible, prevent the user from having to manually cleanup resources, which allows for a simpler programming model and prevents memory leaks. However, there are some resources that are notably difficult to ensure they _will_ be cleaned up appropriately, threads for example.

With this PR the user will now be warned about running threads when a worker is being destroyed, but will nevertheless attempt to stop them. The expectation is that with this change the user can be told how to resolve such problems so that it can be done manually to guarantee there's no leakage of resources in the running thread. What currently happens is, when a certain codeblock causes a `ucxx::Worker` to go out-of-scope, if care is not taken to ensure, for example, the progress thread has already completed processing all pending tasks (e.g., `ucxx::Request`s), the surviving thread will end up destroying the `ucxx::Worker`, and subsequently itself, from the running progress thread which is an invalid pattern and will cause deadlocks. This is currently observed intermittently in some C++ tests, where `std::system_error` may be raised:

```
terminate called after throwing an instance of 'std::system_error'
  what():  Resource deadlock avoided
timeout: the monitored command dumped core
```

The tests are also modified to apply this pattern of calling `ucxx::Worker::stopProgressThread()` at teardown, and thus prevent errors like above from occurring.

It may still be possible to handle this issue more gracefully, but for the moment it's best to ensure the user takes care of it while a more resilient solution can be worked on.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #304
Generic callbacks should not be canceled once they start running. The guarantee provided by `ucxx::Worker` is that once the function returns it should be safe to destroy the callback and all its associated resources, which becomes invalid if the callback is scheduled for cancellation but it is already running, therefore, it's a requirement to check whether the callback is already executing and block it until it's finished. If the callback never completes this may cause an irrecoverable hang which cannot be dealt with from UCXX since it's impossible to stop a callback from executing once it has started, it's the user's responsibility to guarantee the callback must return. A warning is raised after multiples of 10 attempts have been tried to cancel a callback that is being executed and canceling did not succeed, so that the user is informed of what is happening.

The most notable issue is somewhat frequently observable in CI, where the Python async test `test_from_worker_address_multinode` would segfault, in particular with larger amount of endpoints. This was observable in those tests more frequently because there's a large amount of endpoints being created simultaneously by multiple processes, putting more pressure in the resources and causing endpoint creation to take several seconds to complete. In those cases the generic callback executing `ucp_ep_create` would take longer than the default timeout of 3 seconds and in some cases that would be interpreted as the callback timed out, since `ucp_ep_create` itself took longer than 3 seconds, causing the worker to attempt to cancel the callback while it was still executing. With this change, the callback will still timeout but only if it didn't start executing yet, if `ucp_ep_create` ends up never returning, this will cause a deadlock in the application but there's no way for UCXX to recover on its own and warnings are raised, although those hypothetical deadlocks have not been observed in local tests so far. Segfaults should not occur in this situation anymore.

Additionally, unit tests for generic callbacks are now included, which previously were a gap in the testing suite.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #303
Contributes to rapidsai/build-planning#108
Contributes to rapidsai/build-planning#111

Proposes some small packaging/CI changes, matching similar changes being made across RAPIDS.

* building `libucxx` wheels with `--no-build-isolation` (for better `sccache` hit rate)
* printing `sccache` stats to CI logs
* updating to the latest `rapids-dependency-file-generator` (v1.16.0)
* always explicitly specifying `cpp` / `python` in calls to `rapids-upload-wheels-to-s3`
* moving more one-wheel-specific logic into `build_wheel_{project}.sh` scripts, mimicking how `cudf` has structured its scripts

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #301
Extend Python benchmarks with `socket` and `asyncio.Stream{Reader,Writer}` APIs. This should help us better understand the overhead of UCXX compared to Python's internal implementations, and thus help us improve potential suboptimal code in our implementation.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #309
Fix AM receive message cancelation in C++ implementation that requires a custom `cancel()` implementation, as well as log when allocator raised an exception and return `UCXNoMemoryError` to requests where allocation failed. Reimplement Python AM RMM allocator with a pure C++ function to prevent Cython from introducing Python exception handlers that should not occur in the allocators as the C++ backend should not require the GIL.

Enable AM tests that were previously skipped and update `test_send_recv_two_workers` to match AM implementation.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - James Lamb (https://github.com/jameslamb)

URL: #315
Contributes to rapidsai/build-planning#110

Proposes adding 2 types of validation on wheels in CI, to ensure we continue to produce wheels that are suitable for PyPI.

* checks on wheel size (compressed),
  - *to be sure they're under PyPI limits*
  - *and to prompt discussion on PRs that significantly increase wheel sizes*
* checks on README formatting
  - *to ensure they'll render properly as the PyPI project homepages*
  - *e.g. like how https://github.com/scikit-learn/scikit-learn/blob/main/README.rst becomes https://pypi.org/project/scikit-learn/*

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #319
Modern Python should use `weakref.finalize` instead of `__del__`. This change removes all legacy uses of `__del__` in favor of `weakref.finalize`.

Additionally rename the `id` attribute from `ActiveClients` as it's a reserved keyword in Python, thus it's best if we use another name such as `ident`.

Closes #209

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

URL: #317
The error raised may manifest in different ways depending on the ordering and timing actions occur, as well as transports used. The approach implemented in this PR seems to prevent timing issues by waiting for the remote process to terminate entirely, and not only depend on the `ucxx.reset()` call.

In a local environment the issue was reproducible in under 5 minutes rerunning the test, with this change it didn't reproduce after 2 hours of consecutive rerunning.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #324
The Python futures pool retains a reference to `_notifier`, which prevents further cleanup of the `ucxx::{python::,}Worker`. Thus a new `clearFuturesPool` method is introduced to allow clearing the pool so that these references are dropped, which is called before the Python `_notifierThread` exits.

Additionally, move `ApplicationContext.progress_tasks` to a global `ProgressTasks` dictionary. Having the progress tasks as a member attribute of `ApplicationContext` forces us to keep a reference to it that prevents from properly cleaning up in the `weakref.finalize`. Using an external object allows us to prevent the self-reference and clean it up properly when no further references to `ApplicationContext` exist or `ucxx.reset()` is called.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #318
Contributes to rapidsai/build-planning#118

Modifies `libucxx.load_library()` in the following ways:

* prefer wheel-provided `libucxx.so` to system installation
* expose environment variable `RAPIDS_LIBUCXX_PREFER_SYSTEM_LIBRARY` for switching that preference
* load `libucxx.so` with `RTLD_LOCAL`, to prevent adding symbols to the global namespace ([dlopen docs](https://linux.die.net/man/3/dlopen))

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #322
The plan is for `pynvml` to become a metapackage in version 12, where it not anymore ship NVML bindings but instead install `nvidia-ml-py` to serve the official NVML bindings. To provide a better future support, setting an upper pin for `pynvml` is a safer choice.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - https://github.com/jakirkham

URL: #323
@raydouglass raydouglass requested review from a team as code owners November 21, 2024 20:50
@raydouglass raydouglass requested review from a team as code owners November 21, 2024 20:50
@raydouglass raydouglass requested review from bdice and removed request for a team November 21, 2024 20:50
The timeout for Python core tests seems now too tight, possibly due to high CI load at times. Double timeout from 2 minutes to 4 minutes.
@raydouglass raydouglass merged commit 7d7c2c7 into main Dec 11, 2024
1445 of 1463 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants