[RELEASE] ucxx v0.41 #331

raydouglass · 2024-11-21T20:50:49Z

❄️ Code freeze for `branch-24.12` and v24.12 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-24.12 until release (merging of this PR).

What is the purpose of this PR?

Update documentation
Allow testing for the new release
Enable a means to merge branch-24.12 into main for the release

Branch 0.41 merge branch 0.40

We were recently inquired by a user whether relying only on the `ucs_status_t` passed to the user callback and discarding the `ucxx::RequestTag` return from `ucxx::Endpoint::tagSend` and `ucxx::Endpoint::tagRecv` is valid. This turns out to be a valid usage, although I'm unsure whether encouraging this is a good idea, since requests such as `ucxx::Endpoint::amRecv` require that the user retrieve the resulting buffer from the returned `ucxx::RequestAm`, and if this is discarded the buffer is lost. This PR adds a test for the aforementioned use case but makes no changes to documentation to prevent encouraging this pattern until we can decide whether we should support it or not. For requests such as `ucxx::Endpoint::amRecv`, it might be worth studying whether we could pass the resulting buffer and any other attributes associated with it to the callback, in that case we may be able to provide a safe pattern to always use the callback if the user doesn't want to keep a reference to the returned `ucxx::Request` object. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #267

Prevent calling `set_result()`/`set_exception()` on Python futures if they are already `cancelled()`/`done()`. Calling `set_result()`/`set_exception()` when the future has already completed raises an exception that may be uncatched as it may happen on a thread different from the application, which seems to have been one of the reasons why `distributed-ucxx` tests have been hanging. Although not all test failures have been resolved by this, hangs have seemingly also been less frequent in local tests. Additionally set more sensible log levels for other errors coming from CPython implementation and remove the `PyErr_Print()` which will just print a message to stdout instead of a log message. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #281

Forward-merge branch-0.40 into branch-0.41

Fix bug with `create_endpoint`, where a `UCXMessageTruncatedError` may not be properly raised, as well as its use in `distributed-ucxx` not raising the exception expected by `distributed`. Additionally, fix wireup issues in tests, where only sending (or receiving) a wireup message does not seem to suffice, but sending _and_ receiving suffices to prevent errors in tests. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #293

Rewrite tracking of distributed-ucxx communicator resources providing more robust lifetime tracking, allowing better control over termination of notifier and progress threads. This seems to have resolved deadlock issues in Distributed for good this time and if not, it definitely improved the status substantially. Another benefit is the removal of `distributed_patches.py`, meaning there's no need anymore to monkey-patch Distributed resources to do any tracking, and instead do that on the communication resources alone which are entirely controlled by `distributed-ucxx`, in the long-run it will also cause less breakage since we were previously bound to start/stop mechanisms in Distributed to remain unchanged. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #291

Contributes to rapidsai/build-planning#94 Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - James Lamb (https://github.com/jameslamb) URL: #296

Instead of creating a per-UCXX context lock that may result in race conditions during the creation of the locks, use a module-level lock that is guaranteed to be thread-safe as it occurs at import time. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #295

Pulls in a few changes from KvikIO's `Array`. Namely... * `asarray` for easy conversion to `Array` (no-op if already an `Array`) * Include types for a few more properties Authors: - https://github.com/jakirkham Approvers: - Peter Andreas Entschev (https://github.com/pentschev) - Lawrence Mitchell (https://github.com/wence-) URL: #290

This PR updates the RMM import to use `pylibrmm` now that `rmm._lib` is deprecated . It should be merged after rapidsai/rmm#1676. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #283

Contributes to rapidsai/build-planning#106 Proposes specifying the RAPIDS version in `conda install` calls in CI that install CI artifacts, to reduce the risk of CI jobs picking up artifacts from other releases. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Bradley Dice (https://github.com/bdice) URL: #298

These are all pure C functions implemented in Cython. They do not raise. Cython 3 adds checks for exceptions in these functions, which is unnecessary given none of these set exceptions. So add `noexcept` to turn off these Cython checks. xref: rapidsai/kvikio#502 Authors: - https://github.com/jakirkham Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #306

In `Array`, `Py_ssize_t[::1]` objects are currently backed by [CPython `array`'s]( https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#cpython-array-module ) with some internal bits expressed in Cython. However these are not compatible with [Python's Limited API and Stable ABI]( https://docs.python.org/3/c-api/stable.html#c-api-stability ). To address that, switch to [Cython's own `array` type]( https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#cython-arrays ). As this is baked into Cython and doesn't use anything special, it is compatible with Python's Limited API and Stable ABI. xref: rapidsai/build-planning#42 Authors: - https://github.com/jakirkham Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) - Peter Andreas Entschev (https://github.com/pentschev) URL: #307

Implements the blocking progress mode (UCX-Py default), which was still not implemented in UCXX. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) - AJ Schmidt (https://github.com/ajschmidt8) - Ray Douglass (https://github.com/raydouglass) URL: #116

Ensure all Python multiprocessing tests timeout during `join` and get terminated properly, appropriately raising an error if the subprocess failed to terminate cleanly. To simplify joining of multiple processes, a new testing function `join_processes` was added, where all processes in the list will be joined with a common timeout, if the total time elapsed is longer than the timeout the process will still be joined but wait will be non-positive, meaning `join` returns immediately and the process is later terminated with `terminate_process`. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #300

One of the goals with UCXX is to, as much as possible, prevent the user from having to manually cleanup resources, which allows for a simpler programming model and prevents memory leaks. However, there are some resources that are notably difficult to ensure they _will_ be cleaned up appropriately, threads for example. With this PR the user will now be warned about running threads when a worker is being destroyed, but will nevertheless attempt to stop them. The expectation is that with this change the user can be told how to resolve such problems so that it can be done manually to guarantee there's no leakage of resources in the running thread. What currently happens is, when a certain codeblock causes a `ucxx::Worker` to go out-of-scope, if care is not taken to ensure, for example, the progress thread has already completed processing all pending tasks (e.g., `ucxx::Request`s), the surviving thread will end up destroying the `ucxx::Worker`, and subsequently itself, from the running progress thread which is an invalid pattern and will cause deadlocks. This is currently observed intermittently in some C++ tests, where `std::system_error` may be raised: ``` terminate called after throwing an instance of 'std::system_error' what(): Resource deadlock avoided timeout: the monitored command dumped core ``` The tests are also modified to apply this pattern of calling `ucxx::Worker::stopProgressThread()` at teardown, and thus prevent errors like above from occurring. It may still be possible to handle this issue more gracefully, but for the moment it's best to ensure the user takes care of it while a more resilient solution can be worked on. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #304

Generic callbacks should not be canceled once they start running. The guarantee provided by `ucxx::Worker` is that once the function returns it should be safe to destroy the callback and all its associated resources, which becomes invalid if the callback is scheduled for cancellation but it is already running, therefore, it's a requirement to check whether the callback is already executing and block it until it's finished. If the callback never completes this may cause an irrecoverable hang which cannot be dealt with from UCXX since it's impossible to stop a callback from executing once it has started, it's the user's responsibility to guarantee the callback must return. A warning is raised after multiples of 10 attempts have been tried to cancel a callback that is being executed and canceling did not succeed, so that the user is informed of what is happening. The most notable issue is somewhat frequently observable in CI, where the Python async test `test_from_worker_address_multinode` would segfault, in particular with larger amount of endpoints. This was observable in those tests more frequently because there's a large amount of endpoints being created simultaneously by multiple processes, putting more pressure in the resources and causing endpoint creation to take several seconds to complete. In those cases the generic callback executing `ucp_ep_create` would take longer than the default timeout of 3 seconds and in some cases that would be interpreted as the callback timed out, since `ucp_ep_create` itself took longer than 3 seconds, causing the worker to attempt to cancel the callback while it was still executing. With this change, the callback will still timeout but only if it didn't start executing yet, if `ucp_ep_create` ends up never returning, this will cause a deadlock in the application but there's no way for UCXX to recover on its own and warnings are raised, although those hypothetical deadlocks have not been observed in local tests so far. Segfaults should not occur in this situation anymore. Additionally, unit tests for generic callbacks are now included, which previously were a gap in the testing suite. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #303

Contributes to rapidsai/build-planning#108 Contributes to rapidsai/build-planning#111 Proposes some small packaging/CI changes, matching similar changes being made across RAPIDS. * building `libucxx` wheels with `--no-build-isolation` (for better `sccache` hit rate) * printing `sccache` stats to CI logs * updating to the latest `rapids-dependency-file-generator` (v1.16.0) * always explicitly specifying `cpp` / `python` in calls to `rapids-upload-wheels-to-s3` * moving more one-wheel-specific logic into `build_wheel_{project}.sh` scripts, mimicking how `cudf` has structured its scripts Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Bradley Dice (https://github.com/bdice) URL: #301

Extend Python benchmarks with `socket` and `asyncio.Stream{Reader,Writer}` APIs. This should help us better understand the overhead of UCXX compared to Python's internal implementations, and thus help us improve potential suboptimal code in our implementation. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #309

Fix AM receive message cancelation in C++ implementation that requires a custom `cancel()` implementation, as well as log when allocator raised an exception and return `UCXNoMemoryError` to requests where allocation failed. Reimplement Python AM RMM allocator with a pure C++ function to prevent Cython from introducing Python exception handlers that should not occur in the allocators as the C++ backend should not require the GIL. Enable AM tests that were previously skipped and update `test_send_recv_two_workers` to match AM implementation. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) - James Lamb (https://github.com/jameslamb) URL: #315

Contributes to rapidsai/build-planning#110 Proposes adding 2 types of validation on wheels in CI, to ensure we continue to produce wheels that are suitable for PyPI. * checks on wheel size (compressed), - *to be sure they're under PyPI limits* - *and to prompt discussion on PRs that significantly increase wheel sizes* * checks on README formatting - *to ensure they'll render properly as the PyPI project homepages* - *e.g. like how https://github.com/scikit-learn/scikit-learn/blob/main/README.rst becomes https://pypi.org/project/scikit-learn/* Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Bradley Dice (https://github.com/bdice) URL: #319

Modern Python should use `weakref.finalize` instead of `__del__`. This change removes all legacy uses of `__del__` in favor of `weakref.finalize`. Additionally rename the `id` attribute from `ActiveClients` as it's a reserved keyword in Python, thus it's best if we use another name such as `ident`. Closes #209 Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Kyle Edwards (https://github.com/KyleFromNVIDIA) URL: #317

The error raised may manifest in different ways depending on the ordering and timing actions occur, as well as transports used. The approach implemented in this PR seems to prevent timing issues by waiting for the remote process to terminate entirely, and not only depend on the `ucxx.reset()` call. In a local environment the issue was reproducible in under 5 minutes rerunning the test, with this change it didn't reproduce after 2 hours of consecutive rerunning. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #324

The Python futures pool retains a reference to `_notifier`, which prevents further cleanup of the `ucxx::{python::,}Worker`. Thus a new `clearFuturesPool` method is introduced to allow clearing the pool so that these references are dropped, which is called before the Python `_notifierThread` exits. Additionally, move `ApplicationContext.progress_tasks` to a global `ProgressTasks` dictionary. Having the progress tasks as a member attribute of `ApplicationContext` forces us to keep a reference to it that prevents from properly cleaning up in the `weakref.finalize`. Using an external object allows us to prevent the self-reference and clean it up properly when no further references to `ApplicationContext` exist or `ucxx.reset()` is called. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #318

Contributes to rapidsai/build-planning#118 Modifies `libucxx.load_library()` in the following ways: * prefer wheel-provided `libucxx.so` to system installation * expose environment variable `RAPIDS_LIBUCXX_PREFER_SYSTEM_LIBRARY` for switching that preference * load `libucxx.so` with `RTLD_LOCAL`, to prevent adding symbols to the global namespace ([dlopen docs](https://linux.die.net/man/3/dlopen)) Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Peter Andreas Entschev (https://github.com/pentschev) URL: #322

The plan is for `pynvml` to become a metapackage in version 12, where it not anymore ship NVML bindings but instead install `nvidia-ml-py` to serve the official NVML bindings. To provide a better future support, setting an upper pin for `pynvml` is a safer choice. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - https://github.com/jakirkham URL: #323

The timeout for Python core tests seems now too tight, possibly due to high CI load at times. Double timeout from 2 minutes to 4 minutes.

raydouglass and others added 28 commits September 19, 2024 12:03

DOC v0.41 Updates [skip ci]

5c74510

Merge branch 'branch-0.40' into branch-0.41-merge-branch-0.40

74073bb

Merge pull request #288 from vyasr/branch-0.41-merge-branch-0.40

aa56d1b

Branch 0.41 merge branch 0.40

Merge pull request #294 from rapidsai/branch-0.40

d8d5ca7

Forward-merge branch-0.40 into branch-0.41

Prune workflows based on changed files (#296)

49bac3e

Contributes to rapidsai/build-planning#94 Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - James Lamb (https://github.com/jameslamb) URL: #296

raydouglass requested review from a team as code owners November 21, 2024 20:50

raydouglass requested review from bdice and removed request for a team November 21, 2024 20:50

Increase Python core tests timeout (#332)

243d143

The timeout for Python core tests seems now too tight, possibly due to high CI load at times. Double timeout from 2 minutes to 4 minutes.

raydouglass merged commit 7d7c2c7 into main Dec 11, 2024
1445 of 1463 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RELEASE] ucxx v0.41 #331

[RELEASE] ucxx v0.41 #331

raydouglass commented Nov 21, 2024

[RELEASE] ucxx v0.41 #331

[RELEASE] ucxx v0.41 #331

Conversation

raydouglass commented Nov 21, 2024

❄️ Code freeze for branch-24.12 and v24.12 release

What does this mean?

What is the purpose of this PR?

❄️ Code freeze for `branch-24.12` and v24.12 release