You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On our Nvidia Bluefield 2, the mlx5_0 memory domain cannot be opened:
$ ~/local/ucx-bisect/bin/ucx_info -T
[1741378519.572699] [<hostname>-dpu:3871684:0] ib_mlx5.c:610 UCX ERROR mlx5_0: both WC and NC_DEDICATED UAR allocation types are not supported
[1741378519.591187] [<hostname>-dpu:3871684:0] ib_mlx5.c:610 UCX ERROR mlx5_0: both WC and NC_DEDICATED UAR allocation types are not supported
# < failed to open memory domain mlx5_0 >
#
# System topology
#
# +--------+----------+
# | | |
# | MB/s | mlx5_0 |
# | | |
# +--------+----------+
# | | |
# | mlx5_0 | - |
# | | |
# +--------+----------+
#
# NUMA memory latency
#
# +--------+----------+
# | | |
# | device | mlx5_0 |
# | | |
# +--------+----------+
# | | |
# | nsec | 100.0 |
# | | |
# +--------+----------+
# Memory latency is calculated according to the CPU affinity
This is a regression: The first failing commit is ce38486.
I am pretty sure this is the underlying issue of a failure to import memory handles on the DPU:
(Reproducer) Run make cpu && ./cpu on the CPU and make dpu && ./dpu on the DPU. Simply copy the mkey from the CPU shell to the DPU.
This case fails since 684f818, so I'm not sure just how related this is.
Steps to Reproduce
Command line ucx_info -T
UCX version used + UCX configure flags ce38486 or v1.18.0.
OS version + CPU architecture Ubuntu 20.04.3 LTS on aarch64
cat /etc/issue Ubuntu 20.04.3 LTS \n \l
uname -a Linux <hostname>-dpu 5.4.0-1023-bluefield #26-Ubuntu SMP PREEMPT Wed Dec 1 23:59:51 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release DOCA_v1.2.1_BlueField_OS_Ubuntu_20.04-5.4.0-1023-bluefield-5.5-2.1.7.0-3.8.5.12027-1.signed-aarch64
For RDMA/IB/RoCE related issues:
Driver version:
MLNX_OFED version ofed_info -s MLNX_OFED_LINUX-5.5-2.1.7.0:
ibstat
CA 'mlx5_0'
CA type: MT41686
Number of ports: 1
Firmware version: 24.35.3502
Hardware version: 1
Node GUID: 0x1070fd03002e730a
System image GUID: 0x1070fd03002e7308
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 60
LMC: 0
SM lid: 60
Capability mask: 0xa651e84a
Port GUID: 0x1070fd03002e730a
Link layer: InfiniBand
Describe the bug
On our Nvidia Bluefield 2, the
mlx5_0
memory domain cannot be opened:This is a regression: The first failing commit is ce38486.
I am pretty sure this is the underlying issue of a failure to import memory handles on the DPU:
(Reproducer) Run
make cpu && ./cpu
on the CPU andmake dpu && ./dpu
on the DPU. Simply copy the mkey from the CPU shell to the DPU.This case fails since 684f818, so I'm not sure just how related this is.
Steps to Reproduce
ucx_info -T
ce38486 or v1.18.0.
None.
Setup and versions
Ubuntu 20.04.3 LTS
on aarch64cat /etc/issue
Ubuntu 20.04.3 LTS \n \l
uname -a
Linux <hostname>-dpu 5.4.0-1023-bluefield #26-Ubuntu SMP PREEMPT Wed Dec 1 23:59:51 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
cat /etc/mlnx-release
DOCA_v1.2.1_BlueField_OS_Ubuntu_20.04-5.4.0-1023-bluefield-5.5-2.1.7.0-3.8.5.12027-1.signed-aarch64
ofed_info -s
MLNX_OFED_LINUX-5.5-2.1.7.0:
ibstat
ibv_devinfo -vv
ucx_info -d
I'm happy to produce any further diagnostics you might need.
The text was updated successfully, but these errors were encountered: