-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SEGV in libfabric when using SST on Summit #2485
Comments
Unfortunately ADIOS2 does not check the return value of the libfabric calls in source/adios2/toolkit/sst/dp/rdma_dp.c. I hacked the source to check for an error and it appears that the call to fi_domain on line 228 of that file is failing with error 22, invalid argument. |
After some digging I discovered the FI_LOG_LEVEL environment variable. Setting it to FI_LOG_LEVEL=debug I get the following (and the SEGV in a different location!)
The message "unsupported endpoint type" appears regularly in this output; possibly this is the smoking gun? |
The lack of error checking is an issue that I will address. Given the version of libfabric, it's possible that this is an MR_CACHE issue, as there seems to be some issues caused by the default MR_CACHE being used with the rxm;verbs provider. I am working to verify this now, but having some trouble accessing Summit. If this is an MR_CACHE issue, it should be resolvable by either using libfabric 1.9.0, or setting the environment variable |
Hi Philip, thanks for the reply. Unfortunately export FI_MR_CACHE_MAX_COUNT=0 did not seem to solve the issue:
|
I notice that my install of libfabric 1.11 was built by spack for one of our dependencies. The system has a libfabric1.7 module. If I build ADIOS against this version I no longer get the error, suggesting it is an issue either with the more recent libfabric or with the way it was built. |
Is it possible that the install of 1.11 simply was not built with the correct interfaces? Comparing the configure script between the Summit 1.7 module and the spack 1.11 install:
we observe that 1.7 was built only with "sockets" whereas 1.11 was built with"rxm","mrail","verbs" but not "sockets". |
I would recommend building libfabric 1.9.0; there seems to be some incompatibility with 1.11.0 that I am investigating. The system install of libfabric 1.7.0 does not offer RDMA support, so SST is falling back to sockets support instead. |
I rebuilt with 1.9 and indeed the problem seems to be fixed. Thank you for your help! |
When running the unit tests I came across an issue with SST on Summit:
The test is run on a single node interactive session as:
jsrun -n 1 ./test
Note that this is using the system installed version of adios2.5.0. The error occurs in the following code, which is executed in a separate thread:
As the module doesn't have debug symbols, I hand-built a version of 2.5.0 circa mid-March (git commit f23e72c) and built my library against it. This gives us more information:
Thus it appears that the libfabric domain pointer is null causing the SEGV in fi_endpoint when libfabric tries to dereference it.
The bug also appears if I run the sst_conn_tool provided with ADIOS:
I also tested the latest 2.6.0 git revision (e9b41b1, October 7th) and encountered the same issue with sst_conn_tool:
Desktop (please complete the following information):
Summit with modules:
The system installed libfabric version is 1.11.0.
ADIOS2 built as:
The text was updated successfully, but these errors were encountered: