Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to run cudf_merge benchmark on a node with 4 H100 #1088

Open
orliac opened this issue Oct 23, 2024 · 3 comments
Open

Failing to run cudf_merge benchmark on a node with 4 H100 #1088

orliac opened this issue Oct 23, 2024 · 3 comments

Comments

@orliac
Copy link

orliac commented Oct 23, 2024

Hi there,
I'm facing issue when trying to run the cudf_merge benchmark locally on a node that hosts 4 h100:

        GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV6     NV6     NV6     SYS     SYS     SYS     24-31   3               N/A
GPU1    NV6      X      NV6     NV6     SYS     SYS     SYS     24-31   3               N/A
GPU2    NV6     NV6      X      NV6     SYS     SYS     SYS     40-47   5               N/A
GPU3    NV6     NV6     NV6      X      SYS     SYS     SYS     40-47   5               N/A
NIC0    SYS     SYS     SYS     SYS      X      PIX     SYS
NIC1    SYS     SYS     SYS     SYS     PIX      X      SYS
NIC2    SYS     SYS     SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_2
  NIC1: mlx5_3
  NIC2: mlx5_bond_0

I can run the benchmark over any pair of GPUs with no issue:

python -m ucp.benchmarks.cudf_merge --devs 0,1 --chunk-size 200_000_000 --iter 10

ucx-py-cu12            0.40.0

[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: pip install --upgrade pip
[1729696581.284731] [kh013:1596654:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696584.749090] [kh013:1596654:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696586.301065] [kh013:1596677:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696586.311799] [kh013:1596678:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696586.928897] [kh013:1596677:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696586.928901] [kh013:1596678:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696586.949084] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.949104] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.971682] [kh013:1596654:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#2 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.988770] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.992192] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.994004] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696586.994007] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696587.000277] [kh013:1596654:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696587.123809] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696587.135094] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696587.136850] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696587.137598] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
cuDF merge benchmark
--------------------------------------------------------------------------------------------------------------
Device(s)                 | [0, 1]
Chunks per device         | 1
Rows per chunk            | 200000000
Total data processed      | 119.21 GiB
Data processed per iter   | 11.92 GiB
Row matching fraction     | 0.3
==============================================================================================================
Wall-clock                | 3.00 s
Bandwidth                 | 24.88 GiB/s
Throughput                | 39.68 GiB/s
==============================================================================================================
Run                       | Wall-clock                | Bandwidth                 | Throughput
0                         | 161.36 ms                 | 108.39 GiB/s              | 73.88 GiB/s
1                         | 360.91 ms                 | 18.61 GiB/s               | 33.03 GiB/s
2                         | 455.87 ms                 | 13.33 GiB/s               | 26.15 GiB/s
3                         | 383.55 ms                 | 16.98 GiB/s               | 31.08 GiB/s
4                         | 169.31 ms                 | 90.74 GiB/s               | 70.41 GiB/s
5                         | 474.04 ms                 | 12.65 GiB/s               | 25.15 GiB/s
6                         | 293.38 ms                 | 25.85 GiB/s               | 40.63 GiB/s
7                         | 370.52 ms                 | 17.85 GiB/s               | 32.17 GiB/s
8                         | 161.21 ms                 | 108.41 GiB/s              | 73.95 GiB/s
9                         | 169.63 ms                 | 90.10 GiB/s               | 70.28 GiB/s

But it fails to run over the 4 devices:

python -m ucp.benchmarks.cudf_merge --devs 0,1,2,3 --chunk-size 200_000_000 --iter 10

[1729696635.679592] [kh013:1596934:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696639.178149] [kh013:1596934:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696642.222481] [kh013:1596952:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696642.243167] [kh013:1596955:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696642.245405] [kh013:1596953:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696642.247080] [kh013:1596954:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696642.977422] [kh013:1596952:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696642.980930] [kh013:1596955:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696642.997896] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.000134] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.012295] [kh013:1596954:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696643.014948] [kh013:1596953:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696643.020409] [kh013:1596934:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#2 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.035409] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.035662] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.039847] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.041863] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.043738] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.043739] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.046726] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.048626] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.049389] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.050116] [kh013:1596934:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.050260] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.076486] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.087261] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.089062] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.089778] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.103213] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.114753] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.176847] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#8 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.184527] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.185616] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#8 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.186623] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.187346] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#9 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.187424] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#9 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
Task exception was never retrieved
future: <Task finished name='Task-7' coro=<_listener_handler_coroutine() done, defined at /work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py:140> exception=UCXCanceled("<[Recv #002] ep: 0x7f5af660f140, tag: 0xedf8353cc3df7250, nbytes: 8, type: <class 'array.array'>>: ")>
Traceback (most recent call last):
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 190, in _listener_handler_coroutine
    await func(ep)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 106, in server_handler
    worker_results = await recv_pickled_msg(ep)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 75, in recv_pickled_msg
    msg = await ep.recv_obj()
          ^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 863, in recv_obj
    await self.recv(nbytes, tag=tag)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 737, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ucp._libs.exceptions.UCXCanceled: <[Recv #002] ep: 0x7f5af660f140, tag: 0xedf8353cc3df7250, nbytes: 8, type: <class 'array.array'>>: 
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<_listener_handler_coroutine() done, defined at /work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py:140> exception=UCXCanceled("<[Recv #002] ep: 0x7f5af660f080, tag: 0xd49e6a08b8eeaedd, nbytes: 8, type: <class 'array.array'>>: ")>
Traceback (most recent call last):
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 190, in _listener_handler_coroutine
    await func(ep)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 106, in server_handler
    worker_results = await recv_pickled_msg(ep)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 75, in recv_pickled_msg
    msg = await ep.recv_obj()
          ^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 863, in recv_obj
    await self.recv(nbytes, tag=tag)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 737, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ucp._libs.exceptions.UCXCanceled: <[Recv #002] ep: 0x7f5af660f080, tag: 0xd49e6a08b8eeaedd, nbytes: 8, type: <class 'array.array'>>: 
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<_listener_handler_coroutine() done, defined at /work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py:140> exception=UCXCanceled("<[Recv #002] ep: 0x7f5af660f0c0, tag: 0xbe680e4915f49a08, nbytes: 8, type: <class 'array.array'>>: ")>
Traceback (most recent call last):
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 190, in _listener_handler_coroutine
    await func(ep)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 106, in server_handler
    worker_results = await recv_pickled_msg(ep)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 75, in recv_pickled_msg
    msg = await ep.recv_obj()
          ^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 863, in recv_obj
    await self.recv(nbytes, tag=tag)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 737, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ucp._libs.exceptions.UCXCanceled: <[Recv #002] ep: 0x7f5af660f0c0, tag: 0xbe680e4915f49a08, nbytes: 8, type: <class 'array.array'>>: 
Task exception was never retrieved
future: <Task finished name='Task-6' coro=<_listener_handler_coroutine() done, defined at /work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py:140> exception=UCXCanceled("<[Recv #002] ep: 0x7f5af660f100, tag: 0xf1c3abccc2be7d01, nbytes: 8, type: <class 'array.array'>>: ")>
Traceback (most recent call last):
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 190, in _listener_handler_coroutine
    await func(ep)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 106, in server_handler
    worker_results = await recv_pickled_msg(ep)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 75, in recv_pickled_msg
    msg = await ep.recv_obj()
          ^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 863, in recv_obj
    await self.recv(nbytes, tag=tag)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 737, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ucp._libs.exceptions.UCXCanceled: <[Recv #002] ep: 0x7f5af660f100, tag: 0xf1c3abccc2be7d01, nbytes: 8, type: <class 'array.array'>>: 
^CProcess Process-1:
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
Process Process-3:
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/cudf_merge.py", line 633, in <module>
Process Process-2:
    main()
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/cudf_merge.py", line 590, in main
    stats = [server_queue.get() for i in range(args.n_chunks)]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/cudf_merge.py", line 590, in <listcomp>
    stats = [server_queue.get() for i in range(args.n_chunks)]
             ^^^^^^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/queues.py", line 103, in get
    res = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/connection.py", line 216, in recv_bytes
Traceback (most recent call last):
    buf = self._recv_bytes(maxlength)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
    buf = self._recv(4)
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 125, in _server_process
    ret = loop.run_until_complete(run())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          ^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/asyncio/base_events.py", line 640, in run_until_complete
    self.run_forever()
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/connection.py", line 395, in _recv
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/asyncio/base_events.py", line 607, in run_forever
    self._run_once()
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/asyncio/base_events.py", line 1884, in _run_once
    event_list = self._selector.select(timeout)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

My environment:

Package                Version
---------------------- -----------
cachetools             5.5.0
click                  8.1.7
cloudpickle            3.1.0
cuda-python            12.6.0
cudf-cu12              24.10.1
cupy-cuda12x           13.3.0
dask                   2024.9.0
dask-cudf-cu12         24.10.1
dask-expr              1.1.14
distributed            2024.9.0
fastrlock              0.8.2
fsspec                 2024.10.0
importlib_metadata     8.5.0
Jinja2                 3.1.4
libcudf-cu12           24.10.1
llvmlite               0.43.0
locket                 1.0.0
markdown-it-py         3.0.0
MarkupSafe             3.0.2
mdurl                  0.1.2
msgpack                1.1.0
numba                  0.60.0
numpy                  2.0.2
nvtx                   0.2.10
packaging              24.1
pandas                 2.2.2
partd                  1.4.2
pip                    23.2.1
psutil                 6.1.0
pyarrow                17.0.0
Pygments               2.18.0
pylibcudf-cu12         24.10.1
pynvjitlink-cu12       0.3.0
python-dateutil        2.9.0.post0
pytz                   2024.2
PyYAML                 6.0.2
rapids-dask-dependency 24.10.0
rich                   13.9.2
rmm-cu12               24.10.0
setuptools             65.5.0
six                    1.16.0
sortedcontainers       2.4.0
tblib                  3.0.0
toolz                  1.0.0
tornado                6.4.1
typing_extensions      4.12.2
tzdata                 2024.2
ucx-py-cu12            0.40.0
urllib3                2.2.3
zict                   3.0.0
zipp                   3.20.2

Any idea?

Also, I'm surprised by the variability of the benchmark over the successive 10 iterations.

And finally, is it expected that the benchmark saturate the available bandwidth between the GPUs?

@pentschev
Copy link
Member

Indeed that doesn't look right. I do not have immediate access to a system with H100s but I can try to do that tomorrow. In the meantime I was able to run that on a DGX-1 and the results I see are much more in line with what we would expect, although I had to reduce to chunk size to 100M due to the amount of memory in the V100s:

2 GPUs
$ python -m ucp.benchmarks.cudf_merge --devs 0,1 --chunk-size 100_000_000 --iter 10
cuDF merge benchmark
--------------------------------------------------------------------------------------------------------------
Device(s)                 | [0, 1]
Chunks per device         | 1
Rows per chunk            | 100000000
Total data processed      | 59.60 GiB
Data processed per iter   | 5.96 GiB
Row matching fraction     | 0.3
==============================================================================================================
Wall-clock                | 3.11 s
Bandwidth                 | 19.62 GiB/s
Throughput                | 19.16 GiB/s
==============================================================================================================
Run                       | Wall-clock                | Bandwidth                 | Throughput
0                         | 310.42 ms                 | 19.59 GiB/s               | 19.20 GiB/s
1                         | 309.82 ms                 | 19.70 GiB/s               | 19.24 GiB/s
2                         | 309.94 ms                 | 19.64 GiB/s               | 19.23 GiB/s
3                         | 309.69 ms                 | 19.66 GiB/s               | 19.25 GiB/s
4                         | 310.31 ms                 | 19.51 GiB/s               | 19.21 GiB/s
5                         | 310.90 ms                 | 19.64 GiB/s               | 19.17 GiB/s
6                         | 310.08 ms                 | 19.69 GiB/s               | 19.22 GiB/s
7                         | 312.31 ms                 | 19.68 GiB/s               | 19.09 GiB/s
8                         | 310.86 ms                 | 19.58 GiB/s               | 19.17 GiB/s
9                         | 310.12 ms                 | 19.49 GiB/s               | 19.22 GiB/s
4 GPUs
$ python -m ucp.benchmarks.cudf_merge --devs 0,1,2,3 --chunk-size 100_000_000 --iter 10
cuDF merge benchmark
--------------------------------------------------------------------------------------------------------------
Device(s)                 | [0, 1, 2, 3]
Chunks per device         | 1
Rows per chunk            | 100000000
Total data processed      | 119.21 GiB
Data processed per iter   | 11.92 GiB
Row matching fraction     | 0.3
==============================================================================================================
Wall-clock                | 2.59 s
Bandwidth                 | 54.69 GiB/s
Throughput                | 45.95 GiB/s
==============================================================================================================
Run                       | Wall-clock                | Bandwidth                 | Throughput
0                         | 259.75 ms                 | 52.47 GiB/s               | 45.89 GiB/s
1                         | 258.23 ms                 | 54.83 GiB/s               | 46.16 GiB/s
2                         | 257.99 ms                 | 55.27 GiB/s               | 46.21 GiB/s
3                         | 259.56 ms                 | 54.79 GiB/s               | 45.93 GiB/s
4                         | 260.32 ms                 | 53.85 GiB/s               | 45.79 GiB/s
5                         | 258.22 ms                 | 55.15 GiB/s               | 46.16 GiB/s
6                         | 258.11 ms                 | 55.54 GiB/s               | 46.19 GiB/s
7                         | 258.34 ms                 | 54.97 GiB/s               | 46.14 GiB/s
8                         | 258.46 ms                 | 55.07 GiB/s               | 46.12 GiB/s
9                         | 258.89 ms                 | 55.12 GiB/s               | 46.05 GiB/s
8 GPUs
$ python -m ucp.benchmarks.cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --
iter 10
cuDF merge benchmark
--------------------------------------------------------------------------------------------------------------
Device(s)                 | [0, 1, 2, 3, 4, 5, 6, 7]
Chunks per device         | 1
Rows per chunk            | 100000000
Total data processed      | 238.42 GiB
Data processed per iter   | 23.84 GiB
Row matching fraction     | 0.3
==============================================================================================================
Wall-clock                | 5.18 s
Bandwidth                 | 12.45 GiB/s
Throughput                | 45.99 GiB/s
==============================================================================================================
Run                       | Wall-clock                | Bandwidth                 | Throughput
0                         | 519.35 ms                 | 12.41 GiB/s               | 45.91 GiB/s
1                         | 521.78 ms                 | 12.41 GiB/s               | 45.69 GiB/s
2                         | 519.69 ms                 | 12.42 GiB/s               | 45.88 GiB/s
3                         | 514.90 ms                 | 12.45 GiB/s               | 46.30 GiB/s
4                         | 517.55 ms                 | 12.46 GiB/s               | 46.07 GiB/s
5                         | 515.59 ms                 | 12.51 GiB/s               | 46.24 GiB/s
6                         | 521.08 ms                 | 12.43 GiB/s               | 45.75 GiB/s
7                         | 514.37 ms                 | 12.41 GiB/s               | 46.35 GiB/s
8                         | 516.32 ms                 | 12.48 GiB/s               | 46.18 GiB/s
9                         | 516.35 ms                 | 12.47 GiB/s               | 46.17 GiB/s

Based on the affinity reported by your system as the output of nvidia-smi topo -m I suspect this is a only a partition of a node, is that right? Are you able to get a full node allocation to test it as well? Could you also try to disable InfiniBand (UCX_TLS=^rc) and later NVLink (UCX_TLS=cuda_ipc) to see whether the errors and the variability go away? Could you also report what cat /proc/cpuinfo shows (I want that we confirm that only the cores with matching affinity are available to your node partition, but I'm not entirely sure only /proc/cpuinfo will provide us with that information)?

And finally, is it expected that the benchmark saturate the available bandwidth between the GPUs?

That is a good question, I'm not sure we have done such testing in the past but for the 2 GPU case we get 85-90% of expected, with ~19.5GiB/s where ucx_perftest alone reports ~22.5GiB/s (note ucx_perftest reports GB/s and not GiB/s, I've done the conversion on my own so we're comparing at right scale):

ucx_perftest 2xV100 ``` $ ucx_perftest -t tag_bw -m cuda -s 1000000000 -n 1000 & ucx_perftest -t tag_bw -m cuda -s 1000000000 -n 1000 localhost [1] 1875533 [1729719349.121679] [dgx13:1875534:0] perftest.c:793 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted. [1729719349.121731] [dgx13:1875533:0] perftest.c:793 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted. Waiting for connection... Accepted connection from 127.0.0.1:33582 +----------------------------------------------------------------------------------------------------------+ +--------------+--------------+------------------------------+---------------------+-----------------------+ | API: protocol layer | | | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) | | Test: tag match bandwidth | +--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+ | Data layout: (automatic) | | Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall | | Send memory: cuda | +--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+ | Recv memory: cuda | | Message size: 1000000000 | | Window size: 32 | +----------------------------------------------------------------------------------------------------------+ [thread 0] 57 3.855 18078.963 18078.963 52750.50 52750.50 55 55 [thread 0] 82 41114.179 41216.364 25133.049 23138.24 37945.03 24 40 [thread 0] 107 41118.761 41216.316 28890.821 23138.27 33009.60 24 35 [thread 0] 132 41119.304 41216.402 31225.212 23138.22 30541.80 24 32 [thread 0] 157 41119.606 41216.240 32816.140 23138.31 29061.14 24 30 [thread 0] 182 41119.606 41216.278 33970.005 23138.29 28074.01 24 29 [thread 0] 207 41119.798 41216.326 34845.164 23138.27 27368.91 24 29 [thread 0] 232 41119.923 41216.316 35531.711 23138.27 26840.09 24 28 [thread 0] 257 41120.042 41216.516 36084.708 23138.16 26428.77 24 28 [thread 0] 282 41120.065 41216.278 36539.634 23138.29 26099.72 24 27 [thread 0] 307 41120.199 41216.364 36920.475 23138.24 25830.50 24 27 [thread 0] 332 41120.195 41216.278 37243.954 23138.29 25606.15 24 27 [thread 0] 357 41120.195 41216.478 37522.142 23138.18 25416.31 24 27 [thread 0] 382 41120.199 41216.326 37763.908 23138.27 25253.59 24 26 [thread 0] 407 41120.263 41216.354 37975.975 23138.25 25112.57 24 26 [thread 0] 432 41120.337 41216.364 38163.497 23138.24 24989.17 24 26 [thread 0] 457 41120.364 41216.316 38330.501 23138.27 24880.30 24 26 [thread 0] 482 41120.434 41216.440 38480.186 23138.20 24783.52 24 26 [thread 0] 507 41120.435 41216.288 38615.103 23138.29 24696.93 24 26 [thread 0] 532 41120.435 41216.316 38737.340 23138.27 24618.99 24 26 [thread 0] 557 41120.453 41216.402 38848.609 23138.22 24548.48 24 26 [thread 0] 582 41120.453 41216.316 38950.314 23138.27 24484.38 24 26 [thread 0] 607 41120.510 41216.278 39043.641 23138.29 24425.86 24 26 [thread 0] 632 41120.543 41216.326 39129.585 23138.27 24372.21 24 26 [thread 0] 657 41120.543 41216.402 39208.992 23138.22 24322.85 24 26 [thread 0] 682 41120.550 41216.240 39282.572 23138.31 24277.29 24 25 [thread 0] 707 41120.577 41216.316 39350.950 23138.27 24235.10 24 25 [thread 0] 732 41120.577 41216.478 39414.664 23138.18 24195.93 24 25 [thread 0] 757 41120.568 41216.240 39474.161 23138.31 24159.46 24 25 [thread 0] 782 41120.568 41216.326 39529.857 23138.27 24125.42 24 25 [thread 0] 807 41120.567 41216.316 39582.102 23138.27 24093.57 24 25 [thread 0] 832 41120.553 41216.478 39631.211 23138.18 24063.72 24 25 [thread 0] 857 41120.577 41216.240 39677.449 23138.31 24035.68 24 25 [thread 0] 882 41120.581 41216.402 39721.070 23138.22 24009.28 24 25 [thread 0] 907 41120.581 41216.316 39762.284 23138.27 23984.39 24 25 [thread 0] 932 41120.581 41216.326 39801.288 23138.27 23960.89 24 25 [thread 0] 957 41120.591 41216.316 39838.253 23138.27 23938.66 24 25 [thread 0] 982 41120.599 41216.316 39873.336 23138.27 23917.60 24 25 Final: 1000 41120.606 114493.397 41216.497 8329.51 23138.17 9 24 ```

Also note that a DGX-1 doesn't have NVSwitch connecting all GPUs, so when scaling to all of them we expect to be limited by the Connect-X 4 bandwidth.

@orliac
Copy link
Author

orliac commented Oct 24, 2024

Thanks @pentschev for the quick feedback.

No, I'm using a full node in exclusive mode for these tests. There are 8 NUMA nodes of 8 CPU physical cores, and GPUs are connected by pairs on a same NUMA node. I can provide the full lstopo output if of interest.

Note that we may have an issue on our side wrt to affinity as actually nothing forces Slurm to allocate CPU cores on the same NUMA node of the GPU when you request a single GPU. But in this case I'm using the full node, and at least the list of CPU cores is not empty, so I assume it works as expected on that side. But to be checked.

export UCX_TLS=^rc changes nothing.

export UCX_TLS=^cuda_ipc allows me to run the 4 GPUs case, but gives (as expected) lower bandwidths and does not help with the repeatability:

2 GPUs
[1729749691.952192] [kh060:2253630:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729749695.174229] [kh060:2253630:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729749696.726614] [kh060:2253657:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729749696.732299] [kh060:2253658:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729749696.961931] [kh060:2253658:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729749696.969666] [kh060:2253657:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729749696.981722] [kh060:2253658:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749696.989279] [kh060:2253657:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749697.003612] [kh060:2253630:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#2 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749697.019853] [kh060:2253658:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749697.023013] [kh060:2253657:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749697.025036] [kh060:2253657:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729749697.025051] [kh060:2253658:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729749697.030642] [kh060:2253630:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729749697.054524] [kh060:2253658:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749697.064794] [kh060:2253657:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749697.066477] [kh060:2253657:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729749697.067184] [kh060:2253658:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
cuDF merge benchmark
--------------------------------------------------------------------------------------------------------------
Device(s)                 | [0, 1]
Chunks per device         | 1
Rows per chunk            | 100000000
Total data processed      | 59.60 GiB
Data processed per iter   | 5.96 GiB
Row matching fraction     | 0.3
==============================================================================================================
Wall-clock                | 2.59 s
Bandwidth                 | 11.39 GiB/s
Throughput                | 22.99 GiB/s
==============================================================================================================
Run                       | Wall-clock                | Bandwidth                 | Throughput
0                         | 206.63 ms                 | 15.48 GiB/s               | 28.85 GiB/s
1                         | 510.53 ms                 | 5.01 GiB/s                | 11.68 GiB/s
2                         | 234.18 ms                 | 13.01 GiB/s               | 25.45 GiB/s
3                         | 285.94 ms                 | 10.00 GiB/s               | 20.85 GiB/s
4                         | 226.55 ms                 | 13.60 GiB/s               | 26.31 GiB/s
5                         | 207.24 ms                 | 15.42 GiB/s               | 28.76 GiB/s
6                         | 216.79 ms                 | 14.46 GiB/s               | 27.49 GiB/s
7                         | 207.29 ms                 | 15.41 GiB/s               | 28.75 GiB/s
8                         | 277.84 ms                 | 10.38 GiB/s               | 21.45 GiB/s
9                         | 214.85 ms                 | 14.68 GiB/s               | 27.74 GiB/s

4 GPUs
[1729749706.301045] [kh060:2253823:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729749709.522614] [kh060:2253823:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729749712.463292] [kh060:2253839:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729749712.475283] [kh060:2253841:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729749712.476138] [kh060:2253840:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729749712.480276] [kh060:2253842:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729749712.720902] [kh060:2253840:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729749712.723496] [kh060:2253841:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729749712.723721] [kh060:2253842:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729749712.729463] [kh060:2253839:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729749712.731856] [kh060:2253841:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.737099] [kh060:2253839:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.741351] [kh060:2253840:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.744331] [kh060:2253842:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.754557] [kh060:2253823:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#2 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.771086] [kh060:2253841:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.774935] [kh060:2253839:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.776781] [kh060:2253840:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.778541] [kh060:2253842:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.780713] [kh060:2253839:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729749712.780713] [kh060:2253841:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729749712.780795] [kh060:2253842:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729749712.780814] [kh060:2253840:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729749712.786425] [kh060:2253823:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729749712.812465] [kh060:2253842:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.822722] [kh060:2253841:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.824464] [kh060:2253841:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729749712.825180] [kh060:2253842:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729749712.913412] [kh060:2253840:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.913563] [kh060:2253841:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#8 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.921241] [kh060:2253839:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.922209] [kh060:2253840:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729749712.923049] [kh060:2253839:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729749712.923769] [kh060:2253840:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#8 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729749712.924376] [kh060:2253840:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#9 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729749712.924412] [kh060:2253841:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#9 tag(rc_mlx5/mlx5_2:1)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
cuDF merge benchmark
--------------------------------------------------------------------------------------------------------------
Device(s)                 | [0, 1, 2, 3]
Chunks per device         | 1
Rows per chunk            | 100000000
Total data processed      | 119.21 GiB
Data processed per iter   | 11.92 GiB
Row matching fraction     | 0.3
==============================================================================================================
Wall-clock                | 5.47 s
Bandwidth                 | 6.94 GiB/s
Throughput                | 21.78 GiB/s
==============================================================================================================
Run                       | Wall-clock                | Bandwidth                 | Throughput
0                         | 391.79 ms                 | 10.15 GiB/s               | 30.43 GiB/s
1                         | 742.05 ms                 | 5.02 GiB/s                | 16.06 GiB/s
2                         | 731.00 ms                 | 5.00 GiB/s                | 16.31 GiB/s
3                         | 682.10 ms                 | 5.45 GiB/s                | 17.48 GiB/s
4                         | 608.11 ms                 | 6.12 GiB/s                | 19.60 GiB/s
5                         | 518.30 ms                 | 7.44 GiB/s                | 23.00 GiB/s
6                         | 539.24 ms                 | 6.98 GiB/s                | 22.11 GiB/s
7                         | 388.66 ms                 | 10.26 GiB/s               | 30.67 GiB/s
8                         | 398.94 ms                 | 9.98 GiB/s                | 29.88 GiB/s
9                         | 468.55 ms                 | 8.28 GiB/s                | 25.44 GiB/s

Yes, if you can access a H100 node and reproduce the case that would be a great point of comparison. Thanks again.

@pentschev
Copy link
Member

I was able to reproduce the variability consistently in a H100 node, both by getting a partial and full node allocations, so presumably this is not related to what I initially thought. I was also able to reproduce the some errors preventing the run to succeed with 4 GPUs, and by the looks of it on my end it was during the establishment of endpoints, which I've previously observed flakiness in Dask clusters when a large number of endpoints are attempted to be created simultaneously, but the errors exactly as you've posted I was unable to observe.

I do not have a lead yet as to what happens in H100s, my first guess is that it's related to suboptimal paths or the lack of assigning proper affinity to each process. I'll see if I can do more testing tomorrow or early next week.

To be honest, this is a benchmark that, to my knowledge, is not often used, I haven't touched this myself probably in 2 years. Would you mind briefly describing how did you come across it and why are you interested in this one specifically?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants