Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence #10474

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

michal-shalev
Copy link
Contributor

@michal-shalev michal-shalev commented Feb 5, 2025

What?

Ensure Strong Fence is used only in scenarios where it is genuinely required.

  • Add unflushed_lanes to EP struct
  • Add fence_seq to EP and worker structs
  • Add UCP_FENCE_MODE_EP_BASED fence mode
  • Implement per-EP fences
  • Handle fence during one-sided operations (instead of worker fence)
  • Add a correctness test for EP-based fence mode

Why?

The current implementation of ucp_worker_fence always uses a Strong fence regardless of whether multiple lanes were used for operations.
This inefficiency occurs because the system lacks runtime information on which lanes were used.
This leads to suboptimal performance in Single-Rail scenarios, even though Weak Fence would suffice.

How?

To dynamically select between Strong and Weak Fence modes, this PR introduces a mechanism that tracks lane usage at runtime and applies the appropriate fencing at the EP level.

@michal-shalev michal-shalev added the WIP-DNM Work in progress / Do not review label Feb 5, 2025
@michal-shalev michal-shalev self-assigned this Feb 5, 2025
@michal-shalev michal-shalev marked this pull request as draft February 5, 2025 09:12
@michal-shalev michal-shalev changed the title UCP/RMA/FLUSH: Add unflushed_lanes to ucp_ep UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence Feb 5, 2025
@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch 3 times, most recently from 16299af to f438a28 Compare February 9, 2025 17:29
@michal-shalev michal-shalev changed the title UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence (POC) Feb 9, 2025
@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from 480b81a to e7a3f46 Compare February 9, 2025 18:06
@michal-shalev michal-shalev changed the title UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence (POC) UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence (PoC) Feb 9, 2025
@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from e7a3f46 to 089e5f7 Compare February 10, 2025 08:38
@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from 089e5f7 to 9ce878a Compare February 10, 2025 09:14
@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from 9ce878a to 506f621 Compare February 10, 2025 10:05
Copy link
Contributor

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also avoid fence if all previously issued operations are already completed?

@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch 2 times, most recently from 9a69c5d to 1c0d9f0 Compare February 20, 2025 00:04
@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from 1c0d9f0 to e6279da Compare February 23, 2025 12:28
@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from e6279da to 4e485d8 Compare February 23, 2025 13:02
test_ep_based_fence_atomic();
}

UCP_INSTANTIATE_TEST_CASE_TLS(test_ucp_ep_based_fence, all, "all")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we, to test for different transports, use UCP_INSTANTIATE_TEST_CASE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
With UCP_INSTANTIATE_TEST_CASE tcp gets stuck, it looks like the same issue with the unimplemented ucp_wireup_ep_flush(), like althca issue.

@@ -545,6 +545,10 @@ typedef struct ucp_ep_ext {
arrived before the first one */
} am;

ucp_lane_map_t unflushed_lanes; /* Bitmap of lanes which have unflushed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't put this into flush_state which union with ep_match ?
this is because we use RMA operations before ucp_ep_flush_state_reset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly

@@ -378,11 +378,11 @@ static ucs_config_field_t ucp_context_config_table[] = {
"another thread, or incoming active messages, but consumes more resources.",
ucs_offsetof(ucp_context_config_t, flush_worker_eps), UCS_CONFIG_TYPE_BOOL},

{"FENCE_MODE", "auto",
{"FENCE_MODE", "strong",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd keep auto, because now strong may be relaxed to weak even if requested explicitly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe what do you think about keeping "auto"?
EP based will be the default fence mode in the future and currently strong fence is the default anyways because proto_enable is true and max_rma_lanes is 1 by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now "strong" doing what "auto" did before change and no way to set real "strong"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep auto, as the default, that will select strong/weak based on protocol version as today?
we can probably remove it after ep_based is comleted including the queue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe Yes, thanks

((context->config.ext.max_rma_lanes > 1) ||
context->config.ext.proto_enable));
if ((context->config.ext.fence_mode == UCP_FENCE_MODE_STRONG) &&
((context->config.ext.max_rma_lanes == 1) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is not correct for proto v2. Even if max_rma_lanes==1 protov2 may use different tls for different message sizes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why there's a || and not && between the conditions, or maybe I didn't understand

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it should be && instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, according to De Morgan's law, it should be &&

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

void *request;

request = ucp_ep_flush_internal(ep, 0, &ucp_request_null_param, NULL,
ucp_ep_flushed_callback, "ep_fence_strong");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to make sure that UCT_FLUSH_FLAG_REMOTE is passed to uct_ep_flush. probably better to pass it explicitly from this fence function somehow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -2232,7 +2232,7 @@ static ucs_status_t ucp_fill_config(ucp_context_h context,
}

if ((context->config.ext.fence_mode == UCP_FENCE_MODE_STRONG) &&
((context->config.ext.max_rma_lanes == 1) ||
((context->config.ext.max_rma_lanes == 1) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove one pair of brackets now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I returned the "auto" fence mode

@@ -378,11 +378,11 @@ static ucs_config_field_t ucp_context_config_table[] = {
"another thread, or incoming active messages, but consumes more resources.",
ucs_offsetof(ucp_context_config_t, flush_worker_eps), UCS_CONFIG_TYPE_BOOL},

{"FENCE_MODE", "auto",
{"FENCE_MODE", "strong",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now "strong" doing what "auto" did before change and no way to set real "strong"

}

int is_fence_required() {
return sender().ep()->ext->fence_seq < sender().ep()->worker->fence_seq;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_ep_fence_seq() < get_worker_fence_seq()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

test_ucp_memheap::init();
}

uint32_t get_worker_fence_seq() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in tests we are usually not that verbose, imo worker_seq(), ep_seq() would be enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

void do_fence() {
uint32_t worker_fence_seq_before = get_worker_fence_seq();
sender().fence();
uint32_t worker_fence_seq_after = get_worker_fence_seq();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worker_fence_seq_after, ep_fence_seq_after looks redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

perform_nbx(op, sbuf, size, target, rkey);
do_fence();

bool strong_fence_happened = is_fence_required() && is_strong_fence();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strong_fence_expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

rbuf.rkey(sender(), rkey);

if (op == OP_ATOMIC) {
perform_nbx_with_fence(op, sbuf.ptr(), ATOMIC_SIZE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe check for 32 and 64 bits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

} \
\
if (_status != UCS_OK) { \
_ret = UCS_STATUS_PTR(_status); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • let's not assume the caller will use goto, pass _failed instead of _ret,_err_label
  • maybe make it compound statement expression and return status

like we do in ucp_request_get_param()
or better inline function than?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@michal-shalev michal-shalev requested a review from gleon99 March 9, 2025 09:17
Comment on lines 2234 to 2236
}

if (context->config.ext.fence_mode == UCP_FENCE_MODE_AUTO) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if ((context->config.ext.fence_mode == UCP_FENCE_MODE_EP_BASED) {
 if (context->config.ext.proto_enable) {
   ucs_error("..");
   goto ..
 } 
} else if (context->config.ext.fence_mode == UCP_FENCE_MODE_AUTO) {
   ...
}


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
If fence_mode is UCP_FENCE_MODE_EP_BASED but proto_enable is true, worker_fence_mode isn't set to UCP_FENCE_MODE_EP_BASED, how about this?
image

#define ucp_ep_handle_fence_if_required(_ep, _status, _ret, _err_label) \
{ \
if ((_ep)->ext->fence_seq < (_ep)->worker->fence_seq) { \
if (!ucs_is_pow2_or_zero((_ep)->ext->unflushed_lanes)) { \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inverse order, use likely/unlikely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ucs_likely/unlikely

@@ -81,6 +81,21 @@ typedef uint16_t ucp_ep_flags_t;
#define ucp_ep_refcount_assert(_ep, _type_refcount, _cmp, _val) \
ucp_ep_refcount_field_assert(_ep, refcounts._type_refcount, _cmp, _val)

#define ucp_ep_handle_fence_if_required(_ep, _status, _ret, _err_label) \
{ \
if ((_ep)->ext->fence_seq < (_ep)->worker->fence_seq) { \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlikely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -177,7 +194,7 @@ UCS_TEST_P(test_ucp_fence64, atomic_add_fadd) {
&test_ucp_fence64::blocking_fadd<uint64_t>);
}

UCS_TEST_P(test_ucp_fence64, atomic_add_fadd_strong, "FENCE_MODE=strong") {
UCS_TEST_P(test_ucp_fence64, atomic_add_fadd_strong) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why removed FENCE_MODE=strong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was the default, reverting

perform_nbx(op, sbuf, size, target, rkey);
do_fence();

bool strong_fence_expected = is_fence_required() && is_strong_fence();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really have to check is_fence_required() here, we just did fence in line 575?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the call to do_fence() only increments the worker’s fence sequence, but the actual fencing only happens inside the next operation (when the next perform_nbx() call checks and applies fencing logic).


if (op == OP_ATOMIC) {
perform_nbx_with_fence(op, sbuf.ptr(), sizeof(uint32_t),
(uint64_t)rbuf.ptr(), rkey);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pass rbuf.ptr() also as void*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU rbuf.ptr() cannot be passed as void* for remote_addr because ucp_put_nbx() explicitly requires a uint64_t remote address.
image

flush_workers();
}
private:
static constexpr uint64_t TEST_BUF_SIZE = 1000000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +581 to +585
if (strong_fence_expected) {
EXPECT_EQ(worker_fence_seq(), ep_fence_seq());
} else {
EXPECT_NE(worker_fence_seq(), ep_fence_seq());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any data-validation test that fence really forced ordering of operations on target side?
for example that after flushing, rbuf is equal to the 2nd sbuf that was done after the fence (for this will need sbuf1 and sbuf2 before and after the fence).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not explicitly validate that the fence enforces ordering on the target side. However, test_ucp_rma_order::test_ordering() already verifies ordering on the target side, and I added a variant to validate EP-based fencing.

@gleon99 gleon99 requested a review from ofirfarjun7 March 9, 2025 14:30
@michal-shalev michal-shalev requested a review from yosefe March 9, 2025 20:29
@@ -27,6 +27,11 @@ static ucs_status_t ucp_proto_put_offload_short_progress(uct_pending_req_t *self
ucs_status_t status;
uct_rkey_t tl_rkey;

if (!(req->flags & UCP_REQUEST_FLAG_PROTO_INITIALIZED)) {
Copy link
Contributor

@ofirfarjun7 ofirfarjun7 Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can avoid this branch using array of two masks?
WDYT @yosefe? will it benefit performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate? @ofirfarjun7

Copy link
Contributor Author

@michal-shalev michal-shalev Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following Ofir's suggestion, I committed a change to update unflushed_lanes branchlessly using a signed boolean mask (-!!) without adding a branch, WDYT? @ofirfarjun7 @yosefe
1cfc7d8

Comment on lines 32 to 35
req->send.ep->ext->unflushed_lanes |=
UCS_BIT(spriv->super.lane) &
-!!(req->flags & UCP_REQUEST_FLAG_PROTO_INITIALIZED);
req->flags |= UCP_REQUEST_FLAG_PROTO_INITIALIZED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does the assembly compare? seems more complicated IMO

Copy link
Contributor Author

@michal-shalev michal-shalev Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Godbolt to check:

#define FLAG 0x8
#define LANE 0x6

void update_with_branch(uint64_t *value, uint64_t flags) {
    if (flags & FLAG) {
        *value |= LANE;
    }
}

void update_branchless(uint64_t *value, uint64_t flags) {
    *value |= LANE & -!!(flags & FLAG);
}
update_with_branch:
        and     esi, 8
        je      .L1
        or      QWORD PTR [rdi], 6
.L1:
        rep ret
update_branchless:
        sal     rsi, 60
        sar     rsi, 63
        and     esi, 6
        or      QWORD PTR [rdi], rsi
        ret

This shows that update_with_branch includes a conditional branch (je .L1),
while update_branchless avoids branching entirely by using bitwise operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but actually we have 2 flags, can you check in godbolt something like

#define FLAG1 0x8
#define FLAG2 0x20

void update_with_branch(uint64_t *value, uint64_t flags) {
    if (flags & FLAG1) {
        *value |= FLAG2;
    }
}

void update_branchless(uint64_t *value, uint64_t flags) {
    *value |= FLAG1 & -!!(flags & FLAG2);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated my original comment @yosefe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, maybe we could update unflushed_lanes unconditionally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the best solution so far, pushed another commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants