-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RELEASE] ucxx v0.41 #331
Merged
Merged
[RELEASE] ucxx v0.41 #331
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Branch 0.41 merge branch 0.40
We were recently inquired by a user whether relying only on the `ucs_status_t` passed to the user callback and discarding the `ucxx::RequestTag` return from `ucxx::Endpoint::tagSend` and `ucxx::Endpoint::tagRecv` is valid. This turns out to be a valid usage, although I'm unsure whether encouraging this is a good idea, since requests such as `ucxx::Endpoint::amRecv` require that the user retrieve the resulting buffer from the returned `ucxx::RequestAm`, and if this is discarded the buffer is lost. This PR adds a test for the aforementioned use case but makes no changes to documentation to prevent encouraging this pattern until we can decide whether we should support it or not. For requests such as `ucxx::Endpoint::amRecv`, it might be worth studying whether we could pass the resulting buffer and any other attributes associated with it to the callback, in that case we may be able to provide a safe pattern to always use the callback if the user doesn't want to keep a reference to the returned `ucxx::Request` object. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #267
Prevent calling `set_result()`/`set_exception()` on Python futures if they are already `cancelled()`/`done()`. Calling `set_result()`/`set_exception()` when the future has already completed raises an exception that may be uncatched as it may happen on a thread different from the application, which seems to have been one of the reasons why `distributed-ucxx` tests have been hanging. Although not all test failures have been resolved by this, hangs have seemingly also been less frequent in local tests. Additionally set more sensible log levels for other errors coming from CPython implementation and remove the `PyErr_Print()` which will just print a message to stdout instead of a log message. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #281
Forward-merge branch-0.40 into branch-0.41
Fix bug with `create_endpoint`, where a `UCXMessageTruncatedError` may not be properly raised, as well as its use in `distributed-ucxx` not raising the exception expected by `distributed`. Additionally, fix wireup issues in tests, where only sending (or receiving) a wireup message does not seem to suffice, but sending _and_ receiving suffices to prevent errors in tests. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #293
Rewrite tracking of distributed-ucxx communicator resources providing more robust lifetime tracking, allowing better control over termination of notifier and progress threads. This seems to have resolved deadlock issues in Distributed for good this time and if not, it definitely improved the status substantially. Another benefit is the removal of `distributed_patches.py`, meaning there's no need anymore to monkey-patch Distributed resources to do any tracking, and instead do that on the communication resources alone which are entirely controlled by `distributed-ucxx`, in the long-run it will also cause less breakage since we were previously bound to start/stop mechanisms in Distributed to remain unchanged. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #291
Contributes to rapidsai/build-planning#94 Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - James Lamb (https://github.com/jameslamb) URL: #296
Instead of creating a per-UCXX context lock that may result in race conditions during the creation of the locks, use a module-level lock that is guaranteed to be thread-safe as it occurs at import time. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #295
Pulls in a few changes from KvikIO's `Array`. Namely... * `asarray` for easy conversion to `Array` (no-op if already an `Array`) * Include types for a few more properties Authors: - https://github.com/jakirkham Approvers: - Peter Andreas Entschev (https://github.com/pentschev) - Lawrence Mitchell (https://github.com/wence-) URL: #290
This PR updates the RMM import to use `pylibrmm` now that `rmm._lib` is deprecated . It should be merged after rapidsai/rmm#1676. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #283
Contributes to rapidsai/build-planning#106 Proposes specifying the RAPIDS version in `conda install` calls in CI that install CI artifacts, to reduce the risk of CI jobs picking up artifacts from other releases. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Bradley Dice (https://github.com/bdice) URL: #298
These are all pure C functions implemented in Cython. They do not raise. Cython 3 adds checks for exceptions in these functions, which is unnecessary given none of these set exceptions. So add `noexcept` to turn off these Cython checks. xref: rapidsai/kvikio#502 Authors: - https://github.com/jakirkham Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #306
In `Array`, `Py_ssize_t[::1]` objects are currently backed by [CPython `array`'s]( https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#cpython-array-module ) with some internal bits expressed in Cython. However these are not compatible with [Python's Limited API and Stable ABI]( https://docs.python.org/3/c-api/stable.html#c-api-stability ). To address that, switch to [Cython's own `array` type]( https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#cython-arrays ). As this is baked into Cython and doesn't use anything special, it is compatible with Python's Limited API and Stable ABI. xref: rapidsai/build-planning#42 Authors: - https://github.com/jakirkham Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) - Peter Andreas Entschev (https://github.com/pentschev) URL: #307
Implements the blocking progress mode (UCX-Py default), which was still not implemented in UCXX. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) - AJ Schmidt (https://github.com/ajschmidt8) - Ray Douglass (https://github.com/raydouglass) URL: #116
Ensure all Python multiprocessing tests timeout during `join` and get terminated properly, appropriately raising an error if the subprocess failed to terminate cleanly. To simplify joining of multiple processes, a new testing function `join_processes` was added, where all processes in the list will be joined with a common timeout, if the total time elapsed is longer than the timeout the process will still be joined but wait will be non-positive, meaning `join` returns immediately and the process is later terminated with `terminate_process`. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #300
One of the goals with UCXX is to, as much as possible, prevent the user from having to manually cleanup resources, which allows for a simpler programming model and prevents memory leaks. However, there are some resources that are notably difficult to ensure they _will_ be cleaned up appropriately, threads for example. With this PR the user will now be warned about running threads when a worker is being destroyed, but will nevertheless attempt to stop them. The expectation is that with this change the user can be told how to resolve such problems so that it can be done manually to guarantee there's no leakage of resources in the running thread. What currently happens is, when a certain codeblock causes a `ucxx::Worker` to go out-of-scope, if care is not taken to ensure, for example, the progress thread has already completed processing all pending tasks (e.g., `ucxx::Request`s), the surviving thread will end up destroying the `ucxx::Worker`, and subsequently itself, from the running progress thread which is an invalid pattern and will cause deadlocks. This is currently observed intermittently in some C++ tests, where `std::system_error` may be raised: ``` terminate called after throwing an instance of 'std::system_error' what(): Resource deadlock avoided timeout: the monitored command dumped core ``` The tests are also modified to apply this pattern of calling `ucxx::Worker::stopProgressThread()` at teardown, and thus prevent errors like above from occurring. It may still be possible to handle this issue more gracefully, but for the moment it's best to ensure the user takes care of it while a more resilient solution can be worked on. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #304
Generic callbacks should not be canceled once they start running. The guarantee provided by `ucxx::Worker` is that once the function returns it should be safe to destroy the callback and all its associated resources, which becomes invalid if the callback is scheduled for cancellation but it is already running, therefore, it's a requirement to check whether the callback is already executing and block it until it's finished. If the callback never completes this may cause an irrecoverable hang which cannot be dealt with from UCXX since it's impossible to stop a callback from executing once it has started, it's the user's responsibility to guarantee the callback must return. A warning is raised after multiples of 10 attempts have been tried to cancel a callback that is being executed and canceling did not succeed, so that the user is informed of what is happening. The most notable issue is somewhat frequently observable in CI, where the Python async test `test_from_worker_address_multinode` would segfault, in particular with larger amount of endpoints. This was observable in those tests more frequently because there's a large amount of endpoints being created simultaneously by multiple processes, putting more pressure in the resources and causing endpoint creation to take several seconds to complete. In those cases the generic callback executing `ucp_ep_create` would take longer than the default timeout of 3 seconds and in some cases that would be interpreted as the callback timed out, since `ucp_ep_create` itself took longer than 3 seconds, causing the worker to attempt to cancel the callback while it was still executing. With this change, the callback will still timeout but only if it didn't start executing yet, if `ucp_ep_create` ends up never returning, this will cause a deadlock in the application but there's no way for UCXX to recover on its own and warnings are raised, although those hypothetical deadlocks have not been observed in local tests so far. Segfaults should not occur in this situation anymore. Additionally, unit tests for generic callbacks are now included, which previously were a gap in the testing suite. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #303
Contributes to rapidsai/build-planning#108 Contributes to rapidsai/build-planning#111 Proposes some small packaging/CI changes, matching similar changes being made across RAPIDS. * building `libucxx` wheels with `--no-build-isolation` (for better `sccache` hit rate) * printing `sccache` stats to CI logs * updating to the latest `rapids-dependency-file-generator` (v1.16.0) * always explicitly specifying `cpp` / `python` in calls to `rapids-upload-wheels-to-s3` * moving more one-wheel-specific logic into `build_wheel_{project}.sh` scripts, mimicking how `cudf` has structured its scripts Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Bradley Dice (https://github.com/bdice) URL: #301
Extend Python benchmarks with `socket` and `asyncio.Stream{Reader,Writer}` APIs. This should help us better understand the overhead of UCXX compared to Python's internal implementations, and thus help us improve potential suboptimal code in our implementation. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #309
Fix AM receive message cancelation in C++ implementation that requires a custom `cancel()` implementation, as well as log when allocator raised an exception and return `UCXNoMemoryError` to requests where allocation failed. Reimplement Python AM RMM allocator with a pure C++ function to prevent Cython from introducing Python exception handlers that should not occur in the allocators as the C++ backend should not require the GIL. Enable AM tests that were previously skipped and update `test_send_recv_two_workers` to match AM implementation. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) - James Lamb (https://github.com/jameslamb) URL: #315
Contributes to rapidsai/build-planning#110 Proposes adding 2 types of validation on wheels in CI, to ensure we continue to produce wheels that are suitable for PyPI. * checks on wheel size (compressed), - *to be sure they're under PyPI limits* - *and to prompt discussion on PRs that significantly increase wheel sizes* * checks on README formatting - *to ensure they'll render properly as the PyPI project homepages* - *e.g. like how https://github.com/scikit-learn/scikit-learn/blob/main/README.rst becomes https://pypi.org/project/scikit-learn/* Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Bradley Dice (https://github.com/bdice) URL: #319
Modern Python should use `weakref.finalize` instead of `__del__`. This change removes all legacy uses of `__del__` in favor of `weakref.finalize`. Additionally rename the `id` attribute from `ActiveClients` as it's a reserved keyword in Python, thus it's best if we use another name such as `ident`. Closes #209 Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Kyle Edwards (https://github.com/KyleFromNVIDIA) URL: #317
The error raised may manifest in different ways depending on the ordering and timing actions occur, as well as transports used. The approach implemented in this PR seems to prevent timing issues by waiting for the remote process to terminate entirely, and not only depend on the `ucxx.reset()` call. In a local environment the issue was reproducible in under 5 minutes rerunning the test, with this change it didn't reproduce after 2 hours of consecutive rerunning. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #324
The Python futures pool retains a reference to `_notifier`, which prevents further cleanup of the `ucxx::{python::,}Worker`. Thus a new `clearFuturesPool` method is introduced to allow clearing the pool so that these references are dropped, which is called before the Python `_notifierThread` exits. Additionally, move `ApplicationContext.progress_tasks` to a global `ProgressTasks` dictionary. Having the progress tasks as a member attribute of `ApplicationContext` forces us to keep a reference to it that prevents from properly cleaning up in the `weakref.finalize`. Using an external object allows us to prevent the self-reference and clean it up properly when no further references to `ApplicationContext` exist or `ucxx.reset()` is called. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #318
Contributes to rapidsai/build-planning#118 Modifies `libucxx.load_library()` in the following ways: * prefer wheel-provided `libucxx.so` to system installation * expose environment variable `RAPIDS_LIBUCXX_PREFER_SYSTEM_LIBRARY` for switching that preference * load `libucxx.so` with `RTLD_LOCAL`, to prevent adding symbols to the global namespace ([dlopen docs](https://linux.die.net/man/3/dlopen)) Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Peter Andreas Entschev (https://github.com/pentschev) URL: #322
The plan is for `pynvml` to become a metapackage in version 12, where it not anymore ship NVML bindings but instead install `nvidia-ml-py` to serve the official NVML bindings. To provide a better future support, setting an upper pin for `pynvml` is a safer choice. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - https://github.com/jakirkham URL: #323
The timeout for Python core tests seems now too tight, possibly due to high CI load at times. Double timeout from 2 minutes to 4 minutes.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
❄️ Code freeze for
branch-24.12
and v24.12 releaseWhat does this mean?
Only critical/hotfix level issues should be merged into
branch-24.12
until release (merging of this PR).What is the purpose of this PR?
branch-24.12
intomain
for the release