UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence #10474

michal-shalev · 2025-02-05T09:11:13Z

What?

Ensure Strong Fence is used only in scenarios where it is genuinely required.

Add unflushed_lanes to EP struct
Add fence_seq to EP and worker structs
Add UCP_FENCE_MODE_EP_BASED fence mode
Implement per-EP fences
Handle fence during one-sided operations (instead of worker fence)
Add a correctness test for EP-based fence mode

Why?

The current implementation of ucp_worker_fence always uses a Strong fence regardless of whether multiple lanes were used for operations.
This inefficiency occurs because the system lacks runtime information on which lanes were used.
This leads to suboptimal performance in Single-Rail scenarios, even though Weak Fence would suffice.

How?

To dynamically select between Strong and Weak Fence modes, this PR introduces a mechanism that tracks lane usage at runtime and applies the appropriate fencing at the EP level.

brminich

can we also avoid fence if all previously issued operations are already completed?

src/ucp/core/ucp_ep.inl

src/ucp/rma/flush.c

src/ucp/core/ucp_context.c

src/ucp/core/ucp_ep.inl

src/ucp/proto/proto_multi.inl

src/ucp/proto/proto_single.inl

src/ucp/core/ucp_ep.inl

test/gtest/ucp/test_ucp_rma.cc

Artemy-Mellanox · 2025-03-03T09:19:13Z

test/gtest/ucp/test_ucp_rma.cc

+    test_ep_based_fence_atomic();
+}
+
+UCP_INSTANTIATE_TEST_CASE_TLS(test_ucp_ep_based_fence, all, "all")


what if we, to test for different transports, use UCP_INSTANTIATE_TEST_CASE?

With UCP_INSTANTIATE_TEST_CASE tcp gets stuck, it looks like the same issue with the unimplemented ucp_wireup_ep_flush(), like althca issue.

Artemy-Mellanox · 2025-03-03T09:40:36Z

src/ucp/core/ucp_ep.h

@@ -545,6 +545,10 @@ typedef struct ucp_ep_ext {
                                                     arrived before the first one */
    } am;

+    ucp_lane_map_t   unflushed_lanes; /* Bitmap of lanes which have unflushed


we can't put this into flush_state which union with ep_match ?
this is because we use RMA operations before ucp_ep_flush_state_reset?

src/ucp/core/ucp_context.c

src/ucp/core/ucp_types.h

brminich · 2025-03-03T09:16:19Z

src/ucp/core/ucp_context.c

@@ -378,11 +378,11 @@ static ucs_config_field_t ucp_context_config_table[] = {
   "another thread, or incoming active messages, but consumes more resources.",
   ucs_offsetof(ucp_context_config_t, flush_worker_eps), UCS_CONFIG_TYPE_BOOL},

-  {"FENCE_MODE", "auto",
+  {"FENCE_MODE", "strong",


i'd keep auto, because now strong may be relaxed to weak even if requested explicitly

@yosefe what do you think about keeping "auto"?
EP based will be the default fence mode in the future and currently strong fence is the default anyways because proto_enable is true and max_rma_lanes is 1 by default.

now "strong" doing what "auto" did before change and no way to set real "strong"

can we keep auto, as the default, that will select strong/weak based on protocol version as today?
we can probably remove it after ep_based is comleted including the queue

@yosefe Yes, thanks

brminich · 2025-03-03T09:18:42Z

src/ucp/core/ucp_context.c

-             ((context->config.ext.max_rma_lanes > 1) ||
-              context->config.ext.proto_enable));
+    if ((context->config.ext.fence_mode == UCP_FENCE_MODE_STRONG) &&
+        ((context->config.ext.max_rma_lanes == 1) ||


this check is not correct for proto v2. Even if max_rma_lanes==1 protov2 may use different tls for different message sizes

This is why there's a || and not && between the conditions, or maybe I didn't understand

seems it should be && instead

Actually, according to De Morgan's law, it should be &&

brminich · 2025-03-03T10:11:51Z

src/ucp/rma/flush.c

+    void *request;
+
+    request = ucp_ep_flush_internal(ep, 0, &ucp_request_null_param, NULL,
+                                    ucp_ep_flushed_callback, "ep_fence_strong");


we have to make sure that UCT_FLUSH_FLAG_REMOTE is passed to uct_ep_flush. probably better to pass it explicitly from this fence function somehow

brminich · 2025-03-04T09:12:24Z

src/ucp/core/ucp_context.c

@@ -2232,7 +2232,7 @@ static ucs_status_t ucp_fill_config(ucp_context_h context,
    }

    if ((context->config.ext.fence_mode == UCP_FENCE_MODE_STRONG) &&
-        ((context->config.ext.max_rma_lanes == 1) ||
+        ((context->config.ext.max_rma_lanes == 1) &&


you can remove one pair of brackets now

I returned the "auto" fence mode

src/ucp/proto/proto_multi.inl

test/gtest/ucp/test_ucp_rma.cc

src/ucp/rma/rma.inl

src/ucp/core/ucp_ep.inl

Artemy-Mellanox · 2025-03-04T10:37:30Z

src/ucp/core/ucp_context.c

@@ -378,11 +378,11 @@ static ucs_config_field_t ucp_context_config_table[] = {
   "another thread, or incoming active messages, but consumes more resources.",
   ucs_offsetof(ucp_context_config_t, flush_worker_eps), UCS_CONFIG_TYPE_BOOL},

-  {"FENCE_MODE", "auto",
+  {"FENCE_MODE", "strong",


now "strong" doing what "auto" did before change and no way to set real "strong"

src/ucp/core/ucp_context.c

src/ucp/core/ucp_ep.c

Artemy-Mellanox · 2025-03-05T10:04:02Z

test/gtest/ucp/test_ucp_rma.cc

+    }
+
+    int is_fence_required() {
+        return sender().ep()->ext->fence_seq < sender().ep()->worker->fence_seq;


get_ep_fence_seq() < get_worker_fence_seq()

Artemy-Mellanox · 2025-03-05T10:06:55Z

test/gtest/ucp/test_ucp_rma.cc

+        test_ucp_memheap::init();
+    }
+
+    uint32_t get_worker_fence_seq() {


in tests we are usually not that verbose, imo worker_seq(), ep_seq() would be enough

Artemy-Mellanox · 2025-03-05T10:08:55Z

test/gtest/ucp/test_ucp_rma.cc

+    void do_fence() {
+        uint32_t worker_fence_seq_before = get_worker_fence_seq();
+        sender().fence();
+        uint32_t worker_fence_seq_after = get_worker_fence_seq();


worker_fence_seq_after, ep_fence_seq_after looks redundant

Artemy-Mellanox · 2025-03-05T10:09:35Z

test/gtest/ucp/test_ucp_rma.cc

+        perform_nbx(op, sbuf, size, target, rkey);
+        do_fence();
+
+        bool strong_fence_happened = is_fence_required() && is_strong_fence();


strong_fence_expected

Artemy-Mellanox · 2025-03-05T10:10:30Z

test/gtest/ucp/test_ucp_rma.cc

+        rbuf.rkey(sender(), rkey);
+
+        if (op == OP_ATOMIC) {
+            perform_nbx_with_fence(op, sbuf.ptr(), ATOMIC_SIZE,


maybe check for 32 and 64 bits?

Artemy-Mellanox · 2025-03-05T10:24:46Z

src/ucp/core/ucp_ep.h

+        } \
+        \
+        if (_status != UCS_OK) { \
+            _ret = UCS_STATUS_PTR(_status); \


let's not assume the caller will use goto, pass _failed instead of _ret,_err_label

maybe make it compound statement expression and return status

like we do in ucp_request_get_param()
or better inline function than?

src/ucp/core/ucp_ep.h

src/ucp/rma/flush.c

test/gtest/ucp/test_ucp_rma.cc

src/ucp/core/ucp_context.c

yosefe · 2025-03-09T10:10:30Z

src/ucp/core/ucp_context.c

+    }
+
+    if (context->config.ext.fence_mode == UCP_FENCE_MODE_AUTO) {


if ((context->config.ext.fence_mode == UCP_FENCE_MODE_EP_BASED) { if (context->config.ext.proto_enable) { ucs_error(".."); goto .. } } else if (context->config.ext.fence_mode == UCP_FENCE_MODE_AUTO) { ... }

If fence_mode is UCP_FENCE_MODE_EP_BASED but proto_enable is true, worker_fence_mode isn't set to UCP_FENCE_MODE_EP_BASED, how about this?

yosefe · 2025-03-09T10:13:53Z

src/ucp/core/ucp_ep.h

+#define ucp_ep_handle_fence_if_required(_ep, _status, _ret, _err_label) \
+{ \
+    if ((_ep)->ext->fence_seq < (_ep)->worker->fence_seq) { \
+        if (!ucs_is_pow2_or_zero((_ep)->ext->unflushed_lanes)) { \


inverse order, use likely/unlikely

Added ucs_likely/unlikely

yosefe · 2025-03-09T10:13:58Z

src/ucp/core/ucp_ep.h

@@ -81,6 +81,21 @@ typedef uint16_t                   ucp_ep_flags_t;
 #define ucp_ep_refcount_assert(_ep, _type_refcount, _cmp, _val) \
    ucp_ep_refcount_field_assert(_ep, refcounts._type_refcount, _cmp, _val)

+#define ucp_ep_handle_fence_if_required(_ep, _status, _ret, _err_label) \
+{ \
+    if ((_ep)->ext->fence_seq < (_ep)->worker->fence_seq) { \


yosefe · 2025-03-09T12:53:05Z

test/gtest/ucp/test_ucp_fence.cc

@@ -177,7 +194,7 @@ UCS_TEST_P(test_ucp_fence64, atomic_add_fadd) {
                   &test_ucp_fence64::blocking_fadd<uint64_t>);
 }

-UCS_TEST_P(test_ucp_fence64, atomic_add_fadd_strong, "FENCE_MODE=strong") {
+UCS_TEST_P(test_ucp_fence64, atomic_add_fadd_strong) {


why removed FENCE_MODE=strong?

it was the default, reverting

yosefe · 2025-03-09T13:01:36Z

test/gtest/ucp/test_ucp_rma.cc

+        perform_nbx(op, sbuf, size, target, rkey);
+        do_fence();
+
+        bool strong_fence_expected = is_fence_required() && is_strong_fence();


do we really have to check is_fence_required() here, we just did fence in line 575?

Yes, the call to do_fence() only increments the worker’s fence sequence, but the actual fencing only happens inside the next operation (when the next perform_nbx() call checks and applies fencing logic).

yosefe · 2025-03-09T13:03:36Z

test/gtest/ucp/test_ucp_rma.cc

+
+        if (op == OP_ATOMIC) {
+            perform_nbx_with_fence(op, sbuf.ptr(), sizeof(uint32_t),
+                                   (uint64_t)rbuf.ptr(), rkey);


can we pass rbuf.ptr() also as void*?

AFAIU rbuf.ptr() cannot be passed as void* for remote_addr because ucp_put_nbx() explicitly requires a uint64_t remote address.

yosefe · 2025-03-09T13:04:50Z

test/gtest/ucp/test_ucp_rma.cc

+        flush_workers();
+    }
+private:
+    static constexpr uint64_t TEST_BUF_SIZE = 1000000;


yosefe · 2025-03-09T13:08:50Z

test/gtest/ucp/test_ucp_rma.cc

+        if (strong_fence_expected) {
+            EXPECT_EQ(worker_fence_seq(), ep_fence_seq());
+        } else {
+            EXPECT_NE(worker_fence_seq(), ep_fence_seq());
+        }


is there any data-validation test that fence really forced ordering of operations on target side?
for example that after flushing, rbuf is equal to the 2nd sbuf that was done after the fence (for this will need sbuf1 and sbuf2 before and after the fence).

This test does not explicitly validate that the fence enforces ordering on the target side. However, test_ucp_rma_order::test_ordering() already verifies ordering on the target side, and I added a variant to validate EP-based fencing.

ofirfarjun7 · 2025-03-10T10:04:35Z

src/ucp/rma/put_offload.c

@@ -27,6 +27,11 @@ static ucs_status_t ucp_proto_put_offload_short_progress(uct_pending_req_t *self
    ucs_status_t status;
    uct_rkey_t tl_rkey;

+    if (!(req->flags & UCP_REQUEST_FLAG_PROTO_INITIALIZED)) {


Maybe we can avoid this branch using array of two masks?
WDYT @yosefe? will it benefit performance?

Could you please elaborate? @ofirfarjun7

Following Ofir's suggestion, I committed a change to update unflushed_lanes branchlessly using a signed boolean mask (-!!) without adding a branch, WDYT? @ofirfarjun7 @yosefe
1cfc7d8

yosefe · 2025-03-10T12:13:10Z

src/ucp/rma/put_offload.c

+    req->send.ep->ext->unflushed_lanes |=
+            UCS_BIT(spriv->super.lane) &
+            -!!(req->flags & UCP_REQUEST_FLAG_PROTO_INITIALIZED);
+    req->flags |= UCP_REQUEST_FLAG_PROTO_INITIALIZED;


how does the assembly compare? seems more complicated IMO

I used Godbolt to check:

#define FLAG 0x8 #define LANE 0x6 void update_with_branch(uint64_t *value, uint64_t flags) { if (flags & FLAG) { *value |= LANE; } } void update_branchless(uint64_t *value, uint64_t flags) { *value |= LANE & -!!(flags & FLAG); }

update_with_branch: and esi, 8 je .L1 or QWORD PTR [rdi], 6 .L1: rep ret update_branchless: sal rsi, 60 sar rsi, 63 and esi, 6 or QWORD PTR [rdi], rsi ret

This shows that update_with_branch includes a conditional branch (je .L1),
while update_branchless avoids branching entirely by using bitwise operations.

but actually we have 2 flags, can you check in godbolt something like

#define FLAG1 0x8 #define FLAG2 0x20 void update_with_branch(uint64_t *value, uint64_t flags) { if (flags & FLAG1) { *value |= FLAG2; } } void update_branchless(uint64_t *value, uint64_t flags) { *value |= FLAG1 & -!!(flags & FLAG2); }

I've updated my original comment @yosefe

BTW, maybe we could update unflushed_lanes unconditionally?

I think it's the best solution so far, pushed another commit

michal-shalev added the WIP-DNM Work in progress / Do not review label Feb 5, 2025

michal-shalev self-assigned this Feb 5, 2025

michal-shalev marked this pull request as draft February 5, 2025 09:12

michal-shalev changed the title ~~UCP/RMA/FLUSH: Add unflushed_lanes to ucp_ep~~ UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence Feb 5, 2025

michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch 3 times, most recently from 16299af to f438a28 Compare February 9, 2025 17:29

michal-shalev changed the title ~~UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence~~ UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence (POC) Feb 9, 2025

michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from 480b81a to e7a3f46 Compare February 9, 2025 18:06

michal-shalev requested review from Artemy-Mellanox, gleon99 and yosefe February 9, 2025 18:07

michal-shalev changed the title ~~UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence (POC)~~ UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence (PoC) Feb 9, 2025

michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from e7a3f46 to 089e5f7 Compare February 10, 2025 08:38

michal-shalev added 4 commits February 10, 2025 09:10

UCP/RMA/FLUSH: Add unflushed_lanes to ucp_ep

7d123db

UCP/RMA/FLUSH: Add sequence numbers to EP and worker structs

74f5e73

UCP/RMA/FLUSH: Delete auto fence mode and change default to strong

8d7bf47

UCP/RMA/FLUSH: Add EP-based fence mode

4b5fbc7

michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from 089e5f7 to 9ce878a Compare February 10, 2025 09:14

michal-shalev requested a review from brminich February 10, 2025 09:16

UCP/RMA/FLUSH: Implement per-EP weak and strong fences

c8d8e9f

michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from 9ce878a to 506f621 Compare February 10, 2025 10:05

brminich reviewed Feb 10, 2025

View reviewed changes

src/ucp/core/ucp_ep.inl Outdated Show resolved Hide resolved

src/ucp/core/ucp_ep.inl Outdated Show resolved Hide resolved

src/ucp/rma/flush.c Outdated Show resolved Hide resolved

src/ucp/rma/flush.c Outdated Show resolved Hide resolved

michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch 2 times, most recently from 9a69c5d to 1c0d9f0 Compare February 20, 2025 00:04

UCP/RMA/FLUSH: Handle fence during one-sided operations

57e8752

michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from 1c0d9f0 to e6279da Compare February 23, 2025 12:28

michal-shalev added 2 commits February 23, 2025 13:00

UCP/RMA/FLUSH: Move unflushed_lanes from the union

a33f74d

UCP/RMA/FLUSH: Move fence_seq from the union

2d459bd

michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from e6279da to 4e485d8 Compare February 23, 2025 13:02

Artemy-Mellanox reviewed Mar 3, 2025

View reviewed changes

brminich reviewed Mar 3, 2025

View reviewed changes

UCP/RMA/FLUSH: PR fixes 2.0

d139b75

michal-shalev requested review from Artemy-Mellanox and brminich March 3, 2025 12:19

UCP/RMA/FLUSH: PR fixes 3.0

22aef72

brminich reviewed Mar 4, 2025

View reviewed changes

Artemy-Mellanox reviewed Mar 4, 2025

View reviewed changes

UCP/RMA/FLUSH: PR fixes 4.0

471cfbe

michal-shalev requested review from Artemy-Mellanox and brminich March 4, 2025 16:43

UCP/RMA/FLUSH: PR fixes 5.0

0101b84

Artemy-Mellanox reviewed Mar 5, 2025

View reviewed changes

UCP/RMA/FLUSH: PR fixes 6.0

df2b83f

gleon99 reviewed Mar 5, 2025

View reviewed changes

src/ucp/core/ucp_ep.h Outdated Show resolved Hide resolved

UCP/RMA/FLUSH: PR fixes 7.0

74d5cdb

gleon99 reviewed Mar 5, 2025

View reviewed changes

src/ucp/rma/flush.c Outdated Show resolved Hide resolved

test/gtest/ucp/test_ucp_rma.cc Outdated Show resolved Hide resolved

test/gtest/ucp/test_ucp_rma.cc Outdated Show resolved Hide resolved

test/gtest/ucp/test_ucp_rma.cc Outdated Show resolved Hide resolved

UCP/RMA/FLUSH: PR fixes 8.0

09eee92

michal-shalev requested a review from gleon99 March 9, 2025 09:17

UCP/RMA/FLUSH: PR fixes 9.0

9126795

michal-shalev requested a review from Artemy-Mellanox March 9, 2025 10:08

yosefe reviewed Mar 9, 2025

View reviewed changes

gleon99 requested a review from ofirfarjun7 March 9, 2025 14:30

UCP/RMA/FLUSH: PR fixes 10.0

c28dd5a

michal-shalev requested a review from yosefe March 9, 2025 20:29

ofirfarjun7 reviewed Mar 10, 2025

View reviewed changes

UCP/RMA/FLUSH: PR fixes 11.0

1cfc7d8

michal-shalev requested a review from ofirfarjun7 March 10, 2025 12:11

yosefe reviewed Mar 10, 2025

View reviewed changes

UCP/RMA/FLUSH: PR fixes 12.0

1b09153

		}

		if (context->config.ext.fence_mode == UCP_FENCE_MODE_AUTO) {

UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence #10474

Are you sure you want to change the base?

UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence #10474

Conversation

michal-shalev commented Feb 5, 2025 • edited Loading

What?

Why?

How?

brminich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ofirfarjun7 Mar 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michal-shalev Mar 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michal-shalev Mar 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michal-shalev commented Feb 5, 2025 •

edited

Loading

ofirfarjun7 Mar 10, 2025 •

edited

Loading

michal-shalev Mar 10, 2025 •

edited

Loading

michal-shalev Mar 10, 2025 •

edited

Loading