Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BlockingMode._fd_reader_callback asyncio task not end #1072

Closed
luweizheng opened this issue Sep 22, 2024 · 17 comments
Closed

BlockingMode._fd_reader_callback asyncio task not end #1072

luweizheng opened this issue Sep 22, 2024 · 17 comments

Comments

@luweizheng
Copy link

Hi there,

I am the maintainer of xoscar and xorbits. xoscar is a lightweight actor programming framework that enables inter-process and inter-node communication. We use ucx-py to accelerate communication. There have been no issues before, but recently, using ucx-py has been consistently reporting the following error.

It seems that there are some asyncio tasks not end?

Exception in callback <bound method BlockingMode._fd_reader_callback of <ucp.continuous_ucx_progress.BlockingMode object at 0x71df9c35c910>>
handle: <Handle BlockingMode._fd_reader_callback>
Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 61, in uvloop.loop.Handle._run
  File "/home/xor/.conda/envs/xor/lib/python3.11/site-packages/ucp/continuous_ucx_progress.py", line 85, in _fd_reader_callback
    assert self.asyncio_task is None or self.asyncio_task.done()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

As this is only a assert statement, I delete this line. After commenting out this assert line, the entire program can run but will report another error.

Task was destroyed but it is pending!
task: <Task pending name='Task-102' coro=<BlockingMode._arm_worker() running at /fs/fast/share/pingtai_cc/envs/cudf/lib/python3.11/site-packages/ucp/continuous_ucx_progress.py:110> wait_for=<_SyncSocketReaderFuture pending cb=[Task.task_wakeup()]>>

In terms of performance for communication and computation across computing nodes, now using ucx-py is slightly slower than using unixsocket. Perviously, when no error like this, ucx-py is faster than unixsocket.

This part feels difficult to debug. Are there any clues to help with debugging?

@pentschev
Copy link
Member

Hi @luweizheng , thanks for the report, I wasn't familiar with xorbits/xoscar and it's nice to see you've been using UCX-Py for your projects!

Those errors are generally nothing to worry about, I reckon it is not nice to have them and the reason they occur is UCX-Py attempts to do as much as possible for the user, in this particular case it means we launch an asynchronous task to keep progressing the worker without having to pass that responsibility to the user. I'd also like to point out that they've always been there, some proof is in this 1.5 year old PR where I've attempted to resolve this but failed so far. The problem with that is there's no good way to stop that task when UCX-Py doesn't control the event loop, but the application (xoscar in this case) does, so when the event loop closes UCX-Py doesn't know about that and can't do anything to stop the task that cannot be progressed anymore. On the application end you should be able to fix it though by running ucp.reset() before the event loop is closed, could you try that?

As for performance, I don't think that is in any way related to this. It is possible though that either your application or socket library in Python became more efficient over previous releases. Would you mind sharing what was the previous and what is the current performance difference between socket and UCX-Py?

Another change that may have had impact in performance is in UCX itself, where UCX v1.16 which just recently became supported by UCX-Py switches to protov2 as default (UCX_PROTO_ENABLE=y, previously it was n), and although UCX-Py still uses protov1 as default it's also possible that something may have fall behind in UCX since now protov1 is legacy and not updated anymore. To verify that you may try to downgrade UCX to 1.15.x and also set UCX_PROTO_ENABLE=y explicitly to see whether you observe anything different regarding performance. Finally, depending on whether anything on your system may have changed, UCX may now select a different set of transports that may have different performance characteristics, one case I know can have impact is the use of shared memory which does NOT support endpoint error handling which is enabled by default in UCX-Py here and here since at least #829 .

Finally, I'd like to point out that UCX-Py is going to be archived some time in the future in favor of UCXX, it contains an almost identical API, so hopefully for you simply changing the install requirements and moving import ucp to import ucxx should suffice, therefore I'd recommend switching soon.

@luweizheng
Copy link
Author

Thank you for your long response. I have tried all the methods mentioned above, except for using ucp.reset.

The good news is that after changing import ucp to import ucxx, the assert errors are finally gone.

The bad news is that using ucx version 1.15.x (which I installed with conda, and ucx_info -v also shows 1.15.0) and setting export UCX_PROTO_ENABLE=y is still slower than using sockets. endpoint_error_handling=True is only used within ucx-py, and it seems to cause an error with ucxx.

Actually, I install ucxx with pip install ucxx-cu12 and run it on a CPU computing node. Do I need to compile the CPU version by myself? I find that there is no packages on both pypi and conda. And I am not sure the GPU version works well with CPU.

@luweizheng
Copy link
Author

As you mention that "UCX may now select a different set of transports that may have different performance characteristics". For InfiniBand on CPU node, now my TLS environment variable is ib,tcp and SOCKADDR_TLS_PRIORITY is tcp. Is this correct?

@pentschev
Copy link
Member

Thank you for your long response. I have tried all the methods mentioned above, except for using ucp.reset.

And does ucp.reset() ultimately fix the issue in xoscar?

The good news is that after changing import ucp to import ucxx, the assert errors are finally gone.

UCXX also has a different default progress mode than UCX-Py, which instead of being run as an asynchronous task it's actually a separate C++ thread that notifies Python futures. It may also have different performance characteristics which will depend on the workload running, hopefully it will perform slightly better for the majority of cases. The blocking progress mode which is the current default in UCX-Py is being worked on in rapidsai/ucxx#116 .

The bad news is that using ucx version 1.15.x (which I installed with conda, and ucx_info -v also shows 1.15.0) and setting export UCX_PROTO_ENABLE=y is still slower than using sockets. endpoint_error_handling=True is only used within ucx-py, and it seems to cause an error with ucxx.

Can you provide details on how much slow is today vs how much that was previously? Also what error are you seeing? endpoint_error_handling is still used in UCXX.

Actually, I install ucxx with pip install ucxx-cu12 and run it on a CPU computing node. Do I need to compile the CPU version by myself? I find that there is no packages on both pypi and conda. And I am not sure the GPU version works well with CPU.

I think we currently don't provide a CPU-only UCXX wheel package, I can't recall if this is just because there was nobody using that or if there were other technical limitations. Could you please file an issue in https://github.com/rapidsai/ucxx/issues about that, and please add the same details about being used by xorbits/xoscar, I'll then make sure the relevant people can respond on what can be done.

As you mention that "UCX may now select a different set of transports that may have different performance characteristics". For InfiniBand on CPU node, now my TLS environment variable is ib,tcp and SOCKADDR_TLS_PRIORITY is tcp. Is this correct?

In the past we used to suggest setting UCX_TLS/UCX_SOCKADDR_TLS_PRIORITY because of several issues that existed and affected UCX-Py significantly. This is not anymore the case in recent versions of UCX and I would therefore suggest you don't set any of these variables. They may still be used for debugging, but generally speaking it's best practice to let UCX determine the appropriate transports.

@luweizheng
Copy link
Author

@pentschev

I have not attempted ucp.reset() because I'm actually unsure where it would be appropriate to add it.

I tried recreating a conda environment, installing ucx version 1.15, and setting environment variable export UCX_PROTO_ENABLE=y, but these efforts did not yield much effect. By "effect," I mean that I expected UCX, as the communication backend on a 2-node setup, to be faster than socket.

The error encountered with endpoint_error_handling=True seems like it should be reported as an issue in the ucxx repository.

@pentschev
Copy link
Member

I have not attempted ucp.reset() because I'm actually unsure where it would be appropriate to add it.

That would be just before your event loop closes, which is often either before the end of run_until_complete() or just before the application terminates.

I tried recreating a conda environment, installing ucx version 1.15, and setting environment variable export UCX_PROTO_ENABLE=y, but these efforts did not yield much effect. By "effect," I mean that I expected UCX, as the communication backend on a 2-node setup, to be faster than socket.

You still haven't provided some reference of how much slower we're talking about. Do you have numbers you can provide?

The error encountered with endpoint_error_handling=True seems like it should be reported as an issue in the ucxx repository.

Yes, please do so.

@luweizheng
Copy link
Author

Hi @pentschev

Since I have moved to ucxx from ucx-py, there is no assert self.asyncio_task is None or self.asyncio_task.done() asyncio task error anymore. Then, following your suggestion, I did not set the UCX_TLS environment variable during ucxx.init(), and let UCX find it on its own.

To show the numbers, I run some TPC-H queries, a data analysis benchmark. In terms of Query 3, the UCX backend takes 53 seconds, while UNIX sockets take only 35 seconds.

I've noticed that when using UCX in our software, the CPU usage is high as I check htop. Within xorbits/xoscar, even without performing data analysis tasks such as TPC-H, the CPU usage of two processes exceeds 100%. However, when using sockets, there are no processes with CPU usage surpassing 100%. This seems unexpected. Ideally, it should result in a lower CPU usage rate on UCX backend.

So maybe we need to optimize our code on how to use UCX?

@pentschev
Copy link
Member

To show the numbers, I run some TPC-H queries, a data analysis benchmark. In terms of Query 3, the UCX backend takes 53 seconds, while UNIX sockets take only 35 seconds.

Is there an easy way to reproduce that?

Note that establishing UCX endpoints can be more costly than regular sockets depending on what transports are used. Therefore, depending on what exactly you're timing, the transports that are used by UCX, how many times endpoints get created during the workflow, whether the transfers are "large enough" or just plenty of small transfers, I wouldn't be surprised if sockets can outperform UCX. With that said, it would be useful if we could have more information about those details, such as what your system looks like w.r.t. network interfaces, how many endpoints get created, sizes and amounts of messages, etc.

I've noticed that when using UCX in our software, the CPU usage is high as I check htop. Within xorbits/xoscar, even without performing data analysis tasks such as TPC-H, the CPU usage of two processes exceeds 100%. However, when using sockets, there are no processes with CPU usage surpassing 100%. This seems unexpected. Ideally, it should result in a lower CPU usage rate on UCX backend.

So maybe we need to optimize our code on how to use UCX?

It seems like you're running in polling mode. Do you happen to specify progress_mode to ucxx.init() or set the UCXPY_PROGRESS_MODE environment variable?

@luweizheng
Copy link
Author

luweizheng commented Oct 13, 2024

Hi @pentschev

As I move from ucx-py to ucxx. I conduct some benchmark tests based on the README page of ucxx. Can you help me analyze this?

I install ucxx by pip install ucxx-cu12. My hardware consists of each node having 8 NVIDIA A800 NVLink GPUs (NVLink Bandwidth 400GB/s) and 4 200G InfiniBand NICs. This cluster has been trained an LLM and can achieve FLOPS on par with other NVIDIA clusters. The intra-node GPU send/recv bandwidth meets expectations, but the inter-node send/recv does not quite match the bandwidth that these hardware components are capable of providing. Specifically:

  1. intra-node GPU send/recv:
python -m ucxx.benchmarks.send_recv  \
  --backend ucxx-async  \
  --object_type rmm  \
  --server-dev 0 \
  --client-dev 3  \ 
  --n-iter 10 \
  --n-bytes 1Gb \
  --n-buffers 2

Result:

Device(s)                 | 0, 3
================================================================================
Bandwidth (average)       | 170.02 GiB/s
Bandwidth (median)        | 170.00 GiB/s
Latency (average)         | 10955243 ns
Latency (median)          | 10957014 ns
  1. inter-node

results of object_type with rmm or cupy are the same

one compute node that acts as server:

python -m ucxx.benchmarks.send_recv \
  --backend ucxx-async  \
  --object_type rmm \
  --n-iter 3  \
  --n-bytes 1Gb  \
  --server-only  \
  --server-dev 0

another compute node that acts as client:

python -m ucxx.benchmarks.send_recv \
  --backend ucxx-async \
  --object_type rmm \
  --n-iter 3 \
  --n-bytes 1Gb \
  --client-only \
  --server-address 192.168.1.64 \
  --port 40295 \
  --client-dev 0

I check that the server-address is the Infiniband network interface by command ip a.

Result:

Bandwidth (average)       | 482.24 MiB/s
Bandwidth (median)        | 482.61 MiB/s
Latency (average)         | 1977605542 ns
Latency (median)          | 1976077659 ns

And I use ucx_perftest to test the performance of IB + GPU:

Server:

CUDA_VISIBLE_DEVICES=0 UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc,cuda_copy ucx_perftest -t tag_bw -m cuda -s 100000000 -n 10 -p 9999

Client:

CUDA_VISIBLE_DEVICES=0 UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc,cuda_copy ucx_perftest 192.168.1.64 -t tag_bw -m cuda -s 1000000000 -n 10 -p 9999

I get average bandwidth 19905.54 MB/s.

  1. intra-node CPU-only

Server:

python -m ucxx.benchmarks.send_recv  
  --n-bytes 1Gb  \
  --server-only\
  --object_type numpy  \
  --n-iter 3

Client:

python -m ucxx.benchmarks.send_recv     \
  --backend ucxx-async     \
  --object_type numpy     \
  --n-iter 3 \
  --n-bytes 1Gb \
  --client-only \
  --server-address 192.168.1.64 \
  --port 58129

Result:

Bandwidth (average)       | 1.41 GiB/s
Bandwidth (median)        | 1.41 GiB/s
Latency (average)         | 659478115 ns
Latency (median)          | 661842484 ns

The bandwidth should be 10+GiB/s?

And I use ucx_perftest to test the performance of IB.

On the server side:

ucx_perftest -t tag_bw -s 1000000000 -n 20 -p 9999

On the client side:

ucx_perftest 192.168.1.65 -t tag_bw -s 1000000000 -n 20 -p 9999

I get:

bandwidth (MB/s) average 46830.10, overall 46830.10.

Seems better than the python benchmark code?

@pentschev
Copy link
Member

Unfortunately I can't see anything obviously wrong nor reproduce what you're observing. For me this is what I get on a DGX-1 with ConnectX-4:

ucx_perftest
$ ucx_perftest -t tag_bw -s 1000000000 -n 20 -p 9999
[1728905866.183627] [dgx14:3888014:0]        perftest.c:793  UCX  WARN  CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
Waiting for connection...
^C
(rn-240924) pentschev@dgx14:~$ ucx_perftest -t tag_bw -s 1000000000 -n 20 -p 9999 10.33.227.163
[1728905870.345216] [dgx14:3888057:0]        perftest.c:793  UCX  WARN  CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
Final:                    20      0.473 59230.042 59230.042    16101.19   16101.19          17          17
send_recv --backend ucxx-async
$ python -m ucxx.benchmarks.send_recv   --backend ucxx-async   --object_type rmm   --n-iter 3   --n-bytes 1Gb   --client-only   --server-address 10.33.227.163 --client-dev 0 --port 44604
Client connecting to server at 10.33.227.163:44604
Roundtrip benchmark
================================================================================
Iterations                | 3
Bytes                     | 0.93 GiB
Number of buffers         | 1
Object type               | rmm
Reuse allocation          | False
Transfer API              | TAG
Progress mode             | thread
UCX_TLS                   | all
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | 0, 0
================================================================================
Bandwidth (average)       | 9.96 GiB/s
Bandwidth (median)        | 9.97 GiB/s
Latency (average)         | 93474441 ns
Latency (median)          | 93390427 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 9.99 GiB/s, 93230532ns
1                         | 9.97 GiB/s, 93390427ns
2                         | 9.93 GiB/s, 93802365ns
send_recv --backend ucxx-core
$ python -m ucxx.benchmarks.send_recv   --backend ucxx-core   --object_type rmm   --n-iter 3   --n-bytes 1Gb   --client-only   --server-address 10.33.227.163 --client-dev 0 --port 35792 --reuse-alloc
Client connecting to server at 10.33.227.163:35792
Roundtrip benchmark
================================================================================
Iterations                | 3
Bytes                     | 0.93 GiB
Number of buffers         | 1
Object type               | rmm
Reuse allocation          | True
Transfer API              | TAG
Progress mode             | thread
Asyncio wait              | False
Delay progress            | False
UCX_TLS                   | all
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | 0, 0
================================================================================
Bandwidth (average)       | 17.51 GiB/s
Bandwidth (median)        | 17.71 GiB/s
Latency (average)         | 53178034 ns
Latency (median)          | 52588915 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 18.42 GiB/s, 50551416ns
1                         | 17.71 GiB/s, 52588915ns
2                         | 16.51 GiB/s, 56393771ns

As you can see, the async backend indeed is a bit slower, but ucxx-core is consistent with ucx_perftest. What versions of UCX and UCXX are you running?

Could you also try with ucxx-core as above, and also separately with polling progress mode by setting UCXPY_PROGRESS_MODE=polling?

@luweizheng
Copy link
Author

Hi @pentschev

The issue I've found is that the intra-node communication bandwidth is much lower than expected. My hardware should be a bit more powerful than the one you listed; I'm using ConnectX-6 200Gbps network cards. The result I got with ucxx is only 482.24MB/s, while your result is 9.9GB/s. My question is, is there a problem with the way I've installed it or am I missing some necessary packages?

I installed with pip install ucxx-cu12. Is there some necessary packages I need to install?

@luweizheng
Copy link
Author

UPDATE:

I created a new conda environment and installed the precompiled ucxx using conda/mamba, and tested it. It seems that the bandwidth is much higher than what I got when I installed it using pip. Maybe the conda/mamba version shows the expected performance based on my hardware?

conda/mamba version
python -m ucxx.benchmarks.send_recv   --backend ucxx-core   --object_type rmm   --n-iter 3   --n-bytes 1000000000   --client-only   --server-address 192.168.1.64   --port 13777 
Client connecting to server at 192.168.1.64:13777

Roundtrip benchmark
================================================================================
Iterations                | 3
Bytes                     | 953.67 MiB
Number of buffers         | 1
Object type               | rmm
Reuse allocation          | False
Transfer API              | TAG
Progress mode             | thread
Asyncio wait              | False
Delay progress            | False
UCX_TLS                   | all
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | 0, 0
================================================================================
Bandwidth (average)       | 33.23 GiB/s
Bandwidth (median)        | 33.20 GiB/s
Latency (average)         | 28026924 ns
Latency (median)          | 28050057 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 33.30 GiB/s, 27966320ns
1                         | 33.19 GiB/s, 28064393ns
2                         | 33.20 GiB/s, 28050057ns
pip version
python -m ucxx.benchmarks.send_recv   --backend ucxx-core   --object_type rmm   --n-iter 3   --n-bytes 1000000000   --client-only   --server-address 192.168.1.64   --port 13777 
Client connecting to server at 192.168.1.64:13777

Roundtrip benchmark
================================================================================
Iterations                | 3
Bytes                     | 0.93 GiB
Number of buffers         | 1
Object type               | rmm
Reuse allocation          | False
Transfer API              | TAG
Progress mode             | thread
Asyncio wait              | False
Delay progress            | False
UCX_TLS                   | all
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | 0, 0
================================================================================
Bandwidth (average)       | 481.51 MiB/s
Bandwidth (median)        | 480.77 MiB/s
Latency (average)         | 1980607852 ns
Latency (median)          | 1983620332 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 480.77 MiB/s, 1983620332ns
1                         | 473.06 MiB/s, 2015982331ns
2                         | 491.02 MiB/s, 1942220891ns

@pentschev
Copy link
Member

Can you post when you have your pip environment active:

  1. Output of the benchmark running with UCX_LOG_LEVEL=INFO;
  2. Output of which ucx_perftest;
  3. The path to you pip environment.

I think there may be some issue with the UCX pip package, I normally use conda so I don't often see that, plus I think the user base of the UCX pip package is slim to none, so there may be some issues with it we may need to fix and what I've asked above will help us understanding more.

@luweizheng
Copy link
Author

  1. UCX_LOG_LEVEL
Client connecting to server at 192.168.1.2:13777
Roundtrip benchmark
================================================================================
Iterations                | 3
Bytes                     | 0.93 GiB
Number of buffers         | 1
Object type               | rmm
Reuse allocation          | False
Transfer API              | TAG
Progress mode             | thread
Asyncio wait              | False
Delay progress            | False
UCX_TLS                   | all
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | 0, 0
================================================================================
Bandwidth (average)       | 479.68 MiB/s
Bandwidth (median)        | 479.67 MiB/s
Latency (average)         | 1988160465 ns
Latency (median)          | 1988188649 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 479.67 MiB/s, 1988188649ns
1                         | 479.23 MiB/s, 1990000957ns
2                         | 480.13 MiB/s, 1986291788ns
[1729003823.126436] [a800-5:27214:0]     ucp_context.c:1969 UCX  INFO  Version 1.14.1 (loaded from /fs/fast/u20200002/envs/xor/lib/python3.11/site-packages/ucxx/_lib/../../ucxx_cu12.libs/libucp-47506503.so.0.0.0)
[1729003823.158192] [a800-5:27214:0]          parser.c:2000 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=INFO UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1
[1729003823.713420] [a800-5:27214:1]      ucp_worker.c:1783 UCX  INFO  ep_cfg[2]: tag(tcp/ib0.8066)
[1729003823.730701] [a800-5:27214:1]      ucp_worker.c:1783 UCX  INFO      ep_cfg[3]: tag(tcp/ib0.8066)
  1. ucx_perftest
which ucx_perftest
/fs/fast/u20200002/envs/xor/bin/ucx_perftest
  1. pip environment
which pip
/fs/fast/u20200002/envs/xor/bin/pip

@pentschev
Copy link
Member

Thanks for reporting back. After discussing internally I now realize that the performance aspect is indeed expected, the reason is pip packages are NOT built with verbs/rdmacm support, and the reason for that is there's no rdma-core package available in for pip, whereas rdma-core is available for conda.

Unfortunately, you won't be able to leverage better performance with the default pip install. However, the UCX pip package is intentionally built in a way that it will pick up the UCX system install if one is available and only if not available it will use its own binaries, with that in mind you may still be able to resolve your problem if you have UCX installed on your system built with either rdma-core or MOFED support, but again you'll need to provide a build with that support yourself.

@luweizheng
Copy link
Author

Thanks for all your replies.
The initial issue of the AssertionError has not happened. I will close this issue.

But the end-to-end entire frameworks (xorbits for distributed dataframe, and xoscar for actor and communication) still face performance issues even with the conda package that has rdma-core. I will investigate it, and I may open new issues in ucx-py or ucxx repo.

@pentschev
Copy link
Member

I'm glad you were now unblocked. If you're still seeing performance issues I think the easiest way to determine whether this is due to some regression in UCXX would be to run xorbits/xoscar with both UCX-Py and UCXX. From this thread I had the impression that there's no history of what was the previous performance you obtained, and as we've seen it could be some of the new packages may be lacking features that you were previously using. With that in mind, I would say we want to know whether you see any performance regression due to the move from UCX-Py to UCXX and/or with latest UCX packages, keep in mind UCX pip packages are only available since June, so if you were using UCX-Py previously with pip then you must have provided your own UCX build that had all the capabilities for the system where you built it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants