-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence #10474
base: master
Are you sure you want to change the base?
UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence #10474
Conversation
16299af
to
f438a28
Compare
480b81a
to
e7a3f46
Compare
e7a3f46
to
089e5f7
Compare
089e5f7
to
9ce878a
Compare
9ce878a
to
506f621
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also avoid fence if all previously issued operations are already completed?
9a69c5d
to
1c0d9f0
Compare
1c0d9f0
to
e6279da
Compare
e6279da
to
4e485d8
Compare
test_ep_based_fence_atomic(); | ||
} | ||
|
||
UCP_INSTANTIATE_TEST_CASE_TLS(test_ucp_ep_based_fence, all, "all") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we, to test for different transports, use UCP_INSTANTIATE_TEST_CASE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -545,6 +545,10 @@ typedef struct ucp_ep_ext { | |||
arrived before the first one */ | |||
} am; | |||
|
|||
ucp_lane_map_t unflushed_lanes; /* Bitmap of lanes which have unflushed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't put this into flush_state
which union with ep_match
?
this is because we use RMA operations before ucp_ep_flush_state_reset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly
src/ucp/core/ucp_context.c
Outdated
@@ -378,11 +378,11 @@ static ucs_config_field_t ucp_context_config_table[] = { | |||
"another thread, or incoming active messages, but consumes more resources.", | |||
ucs_offsetof(ucp_context_config_t, flush_worker_eps), UCS_CONFIG_TYPE_BOOL}, | |||
|
|||
{"FENCE_MODE", "auto", | |||
{"FENCE_MODE", "strong", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd keep auto
, because now strong may be relaxed to weak even if requested explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe what do you think about keeping "auto"?
EP based will be the default fence mode in the future and currently strong fence is the default anyways because proto_enable
is true and max_rma_lanes
is 1 by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now "strong" doing what "auto" did before change and no way to set real "strong"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep auto, as the default, that will select strong/weak based on protocol version as today?
we can probably remove it after ep_based is comleted including the queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe Yes, thanks
src/ucp/core/ucp_context.c
Outdated
((context->config.ext.max_rma_lanes > 1) || | ||
context->config.ext.proto_enable)); | ||
if ((context->config.ext.fence_mode == UCP_FENCE_MODE_STRONG) && | ||
((context->config.ext.max_rma_lanes == 1) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is not correct for proto v2. Even if max_rma_lanes==1
protov2 may use different tls for different message sizes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why there's a ||
and not &&
between the conditions, or maybe I didn't understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems it should be &&
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, according to De Morgan's law, it should be &&
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/ucp/rma/flush.c
Outdated
void *request; | ||
|
||
request = ucp_ep_flush_internal(ep, 0, &ucp_request_null_param, NULL, | ||
ucp_ep_flushed_callback, "ep_fence_strong"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have to make sure that UCT_FLUSH_FLAG_REMOTE
is passed to uct_ep_flush. probably better to pass it explicitly from this fence function somehow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/ucp/core/ucp_context.c
Outdated
@@ -2232,7 +2232,7 @@ static ucs_status_t ucp_fill_config(ucp_context_h context, | |||
} | |||
|
|||
if ((context->config.ext.fence_mode == UCP_FENCE_MODE_STRONG) && | |||
((context->config.ext.max_rma_lanes == 1) || | |||
((context->config.ext.max_rma_lanes == 1) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove one pair of brackets now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I returned the "auto" fence mode
src/ucp/core/ucp_context.c
Outdated
@@ -378,11 +378,11 @@ static ucs_config_field_t ucp_context_config_table[] = { | |||
"another thread, or incoming active messages, but consumes more resources.", | |||
ucs_offsetof(ucp_context_config_t, flush_worker_eps), UCS_CONFIG_TYPE_BOOL}, | |||
|
|||
{"FENCE_MODE", "auto", | |||
{"FENCE_MODE", "strong", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now "strong" doing what "auto" did before change and no way to set real "strong"
test/gtest/ucp/test_ucp_rma.cc
Outdated
} | ||
|
||
int is_fence_required() { | ||
return sender().ep()->ext->fence_seq < sender().ep()->worker->fence_seq; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_ep_fence_seq() < get_worker_fence_seq()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test/gtest/ucp/test_ucp_rma.cc
Outdated
test_ucp_memheap::init(); | ||
} | ||
|
||
uint32_t get_worker_fence_seq() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in tests we are usually not that verbose, imo worker_seq(), ep_seq() would be enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test/gtest/ucp/test_ucp_rma.cc
Outdated
void do_fence() { | ||
uint32_t worker_fence_seq_before = get_worker_fence_seq(); | ||
sender().fence(); | ||
uint32_t worker_fence_seq_after = get_worker_fence_seq(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worker_fence_seq_after, ep_fence_seq_after looks redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test/gtest/ucp/test_ucp_rma.cc
Outdated
perform_nbx(op, sbuf, size, target, rkey); | ||
do_fence(); | ||
|
||
bool strong_fence_happened = is_fence_required() && is_strong_fence(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strong_fence_expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test/gtest/ucp/test_ucp_rma.cc
Outdated
rbuf.rkey(sender(), rkey); | ||
|
||
if (op == OP_ATOMIC) { | ||
perform_nbx_with_fence(op, sbuf.ptr(), ATOMIC_SIZE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe check for 32 and 64 bits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/ucp/core/ucp_ep.h
Outdated
} \ | ||
\ | ||
if (_status != UCS_OK) { \ | ||
_ret = UCS_STATUS_PTR(_status); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- let's not assume the caller will use goto, pass _failed instead of _ret,_err_label
- maybe make it compound statement expression and return status
like we do in ucp_request_get_param()
or better inline function than?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/ucp/core/ucp_context.c
Outdated
} | ||
|
||
if (context->config.ext.fence_mode == UCP_FENCE_MODE_AUTO) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ((context->config.ext.fence_mode == UCP_FENCE_MODE_EP_BASED) {
if (context->config.ext.proto_enable) {
ucs_error("..");
goto ..
}
} else if (context->config.ext.fence_mode == UCP_FENCE_MODE_AUTO) {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/ucp/core/ucp_ep.h
Outdated
#define ucp_ep_handle_fence_if_required(_ep, _status, _ret, _err_label) \ | ||
{ \ | ||
if ((_ep)->ext->fence_seq < (_ep)->worker->fence_seq) { \ | ||
if (!ucs_is_pow2_or_zero((_ep)->ext->unflushed_lanes)) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inverse order, use likely/unlikely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added ucs_likely/unlikely
src/ucp/core/ucp_ep.h
Outdated
@@ -81,6 +81,21 @@ typedef uint16_t ucp_ep_flags_t; | |||
#define ucp_ep_refcount_assert(_ep, _type_refcount, _cmp, _val) \ | |||
ucp_ep_refcount_field_assert(_ep, refcounts._type_refcount, _cmp, _val) | |||
|
|||
#define ucp_ep_handle_fence_if_required(_ep, _status, _ret, _err_label) \ | |||
{ \ | |||
if ((_ep)->ext->fence_seq < (_ep)->worker->fence_seq) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unlikely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test/gtest/ucp/test_ucp_fence.cc
Outdated
@@ -177,7 +194,7 @@ UCS_TEST_P(test_ucp_fence64, atomic_add_fadd) { | |||
&test_ucp_fence64::blocking_fadd<uint64_t>); | |||
} | |||
|
|||
UCS_TEST_P(test_ucp_fence64, atomic_add_fadd_strong, "FENCE_MODE=strong") { | |||
UCS_TEST_P(test_ucp_fence64, atomic_add_fadd_strong) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why removed FENCE_MODE=strong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was the default, reverting
perform_nbx(op, sbuf, size, target, rkey); | ||
do_fence(); | ||
|
||
bool strong_fence_expected = is_fence_required() && is_strong_fence(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really have to check is_fence_required() here, we just did fence in line 575?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the call to do_fence()
only increments the worker’s fence sequence, but the actual fencing only happens inside the next operation (when the next perform_nbx()
call checks and applies fencing logic).
|
||
if (op == OP_ATOMIC) { | ||
perform_nbx_with_fence(op, sbuf.ptr(), sizeof(uint32_t), | ||
(uint64_t)rbuf.ptr(), rkey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we pass rbuf.ptr() also as void*?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test/gtest/ucp/test_ucp_rma.cc
Outdated
flush_workers(); | ||
} | ||
private: | ||
static constexpr uint64_t TEST_BUF_SIZE = 1000000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if (strong_fence_expected) { | ||
EXPECT_EQ(worker_fence_seq(), ep_fence_seq()); | ||
} else { | ||
EXPECT_NE(worker_fence_seq(), ep_fence_seq()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any data-validation test that fence really forced ordering of operations on target side?
for example that after flushing, rbuf is equal to the 2nd sbuf that was done after the fence (for this will need sbuf1 and sbuf2 before and after the fence).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test does not explicitly validate that the fence enforces ordering on the target side. However, test_ucp_rma_order::test_ordering()
already verifies ordering on the target side, and I added a variant to validate EP-based fencing.
src/ucp/rma/put_offload.c
Outdated
@@ -27,6 +27,11 @@ static ucs_status_t ucp_proto_put_offload_short_progress(uct_pending_req_t *self | |||
ucs_status_t status; | |||
uct_rkey_t tl_rkey; | |||
|
|||
if (!(req->flags & UCP_REQUEST_FLAG_PROTO_INITIALIZED)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can avoid this branch using array of two masks?
WDYT @yosefe? will it benefit performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate? @ofirfarjun7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following Ofir's suggestion, I committed a change to update unflushed_lanes
branchlessly using a signed boolean mask (-!!) without adding a branch, WDYT? @ofirfarjun7 @yosefe
1cfc7d8
src/ucp/rma/put_offload.c
Outdated
req->send.ep->ext->unflushed_lanes |= | ||
UCS_BIT(spriv->super.lane) & | ||
-!!(req->flags & UCP_REQUEST_FLAG_PROTO_INITIALIZED); | ||
req->flags |= UCP_REQUEST_FLAG_PROTO_INITIALIZED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does the assembly compare? seems more complicated IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used Godbolt to check:
#define FLAG 0x8
#define LANE 0x6
void update_with_branch(uint64_t *value, uint64_t flags) {
if (flags & FLAG) {
*value |= LANE;
}
}
void update_branchless(uint64_t *value, uint64_t flags) {
*value |= LANE & -!!(flags & FLAG);
}
update_with_branch:
and esi, 8
je .L1
or QWORD PTR [rdi], 6
.L1:
rep ret
update_branchless:
sal rsi, 60
sar rsi, 63
and esi, 6
or QWORD PTR [rdi], rsi
ret
This shows that update_with_branch
includes a conditional branch (je .L1
),
while update_branchless
avoids branching entirely by using bitwise operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but actually we have 2 flags, can you check in godbolt something like
#define FLAG1 0x8
#define FLAG2 0x20
void update_with_branch(uint64_t *value, uint64_t flags) {
if (flags & FLAG1) {
*value |= FLAG2;
}
}
void update_branchless(uint64_t *value, uint64_t flags) {
*value |= FLAG1 & -!!(flags & FLAG2);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated my original comment @yosefe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, maybe we could update unflushed_lanes unconditionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the best solution so far, pushed another commit
What?
Ensure Strong Fence is used only in scenarios where it is genuinely required.
unflushed_lanes
to EP structfence_seq
to EP and worker structsUCP_FENCE_MODE_EP_BASED
fence modeWhy?
The current implementation of
ucp_worker_fence
always uses a Strong fence regardless of whether multiple lanes were used for operations.This inefficiency occurs because the system lacks runtime information on which lanes were used.
This leads to suboptimal performance in Single-Rail scenarios, even though Weak Fence would suffice.
How?
To dynamically select between Strong and Weak Fence modes, this PR introduces a mechanism that tracks lane usage at runtime and applies the appropriate fencing at the EP level.