Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libfabric 1.6+: Document SST Work-Arounds #1134

Merged
merged 4 commits into from
Nov 3, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions docs/source/backends/adios2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,24 @@ Ignore the 30GB initialization phases.
.. image:: ./memory_groupbased_nosteps.png
:alt: Memory usage of group-based iteration without using steps


Known Issues
------------

.. warning::

Nov 1st, 2021 (`ADIOS2 2887 <https://github.com/ornladios/ADIOS2/issues/2887>`__):
The fabric selection in ADIOS2 has was designed for libfabric 1.6.
With newer versions of libfabric, the following workaround is needed to guide the selection of a functional fabric for RDMA support:

The following environment variables can be set as work-arounds on Cray systems, when working with ADIOS2 SST:

.. code-block:: bash

export FABRIC_IFACE=mlx5_0 # ADIOS SST: select interface (1 NIC on Summit)
export FI_OFI_RXM_USE_SRX=1 # libfabric: use shared receive context from MSG provider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add FI_PROVIDER=verbs? That's the fabric provider that ADIOS2 successfully selects with libfabric1.6.2 on Summit.
I'm not really sure about the significance of FI_PSM2_DISCONNECT=1 mentioned here and here. I'm currently testing if that one makes a difference.

Copy link
Member Author

@ax3l ax3l Nov 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this, that would be a question for ornladios/ADIOS2#2887 . Can you please raise it there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This being said, these variables did not give me successful runs at 1024 nodes (unlike I recently had with libfabric1.6), so I'll next need to have smaller-scale debug runs to see what's going on.



Selected References
-------------------

Expand Down