-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16039 object: fix EC aggregation wrong peer address #14593
Conversation
Ticket title is 'wrong peer address used for EC aggregation when dkey switched' |
@@ -2202,11 +2213,12 @@ agg_dkey(daos_handle_t ih, vos_iter_entry_t *entry, | |||
DP_UOID(agg_entry->ae_oid), DP_KEY(&agg_entry->ae_dkey), | |||
agg_entry->ae_is_leader ? "yes" : "no"); | |||
agg_reset_dkey_entry(&agg_param->ap_agg_entry, entry); | |||
rc = agg_get_obj_handle(agg_entry, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: looks moving the peer location setting code in agg_get_obj_handle() to a separate function and call it here would be more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, how about make another cleanup patch, this PR only fix functional logic and make it fix the problem as fast as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/tests/suite/daos_obj_ec.c
Outdated
DAOS_OBJ_EC_AGG_LEADER_DIFF | DAOS_FAIL_ALWAYS, 0, NULL); | ||
|
||
print_message("wait for 30 seconds for EC aggregation.\n"); | ||
sleep(30); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks old tests used FORCE_EC_AGG fault injection to reduce test time, any reason we replace it with sleep(30)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old test use FORCE_EC_AGG also with sleep(30), FORCE_EC_AGG flag seems not really useful and will cause EC agg fail with DER_INPROGRESS
And this test case need another fail loc on server side DAOS_OBJ_EC_AGG_LEADER_DIFF, seems cannot set two fail loc at the same time, so I did not use FORCE_EC_AGG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I was thinking the FORCE_EC_AGG is for reducing test time. As for setting two fail locs, we could change aggregation code to force aggregation for you new fail loc (similar to FORCE_EC_AGG).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ci test report an failure, should be just the ec agg not finished yet.
I'll change as what you suggested force agg for the new fail loc
@@ -2202,11 +2213,12 @@ agg_dkey(daos_handle_t ih, vos_iter_entry_t *entry, | |||
DP_UOID(agg_entry->ae_oid), DP_KEY(&agg_entry->ae_dkey), | |||
agg_entry->ae_is_leader ? "yes" : "no"); | |||
agg_reset_dkey_entry(&agg_param->ap_agg_entry, entry); | |||
rc = agg_get_obj_handle(agg_entry, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/tests/suite/daos_obj_ec.c
Outdated
DAOS_OBJ_EC_AGG_LEADER_DIFF | DAOS_FAIL_ALWAYS, 0, NULL); | ||
|
||
print_message("wait for 30 seconds for EC aggregation.\n"); | ||
sleep(30); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I was thinking the FORCE_EC_AGG is for reducing test time. As for setting two fail locs, we could change aggregation code to force aggregation for you new fail loc (similar to FORCE_EC_AGG).
Test stage Functional Hardware Medium Verbs Provider completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14593/1/testReport/ |
0673518
to
202a57c
Compare
Test stage Functional Hardware Medium UCX Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14593/3/execution/node/1620/log |
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14593/3/execution/node/1424/log |
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14593/3/testReport/ |
202a57c
to
4e1d82c
Compare
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14593/4/testReport/ |
Test stage Functional Hardware Medium UCX Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14593/5/execution/node/1578/log |
|
||
ec_verify_parity_data(&req, "d_key", "a_key", (daos_size_t)0, | ||
full_update_size, verify_data, DAOS_TX_NONE, true); | ||
trigger_and_wait_ec_aggreation_2dkeys(arg, &oid, 1, "d_key1", "d_key2", "a_key", 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to be clear, because this was discussed in chat, there is no way to trigger this with one failure and 1 parity target? Multiple dkeys alone is not a sufficient condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiple dkey is not a sufficient cond, need multiple dkeys share a parity shard and with different parity shard, select the shared shard as agg leader.
"no way to trigger this with one failure" possibly trigger with one failure, but with less possibility.
Fix EC aggregation wrong peer address when multiple dkeys on same EC agg leader shard. Change existing test cases to cover that case. Required-githooks: true Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>
Required-githooks: true Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>
4e1d82c
to
ab3b6cb
Compare
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14593/6/execution/node/1183/log |
Test stage Functional Hardware Medium UCX Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14593/8/execution/node/1444/log |
several failed case is cart_ctl failure due to https://daosio.atlassian.net/browse/DAOS-16008 |
DAOS-16039 object: fix EC aggregation wrong peer address (#14593) DAOS-16009 rebuild: fix O_TRUNC file size related handling DAOS-15056 rebuild: add rpt to the rgt list properly (#13862) DAOS-15517 rebuild: refine lock handling for rpt list (#14064) DAOS-13812 container: fix destroy vs lookup (#12757) DAOS-15627 dtx: redunce stack usage for DTX resync to avoid overflow (#14189) DAOS-14845 rebuild: do not wait for EC agg for reclaim (#13610) Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com> Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com> Signed-off-by: Jeff Olivier <jeffolivier@google.com> Signed-off-by: Wang, Di <wddi218@gmail.com> Signed-off-by: Di Wang <di.wang@intel.com> Signed-off-by: Wang Shilong <shilong.wang@intel.com> Signed-off-by: Fan Yong <fan.yong@intel.com>
DAOS-16039 object: fix EC aggregation wrong peer address (#14593) DAOS-16009 rebuild: fix O_TRUNC file size related handling DAOS-15056 rebuild: add rpt to the rgt list properly (#13862) DAOS-15517 rebuild: refine lock handling for rpt list (#14064) DAOS-13812 container: fix destroy vs lookup (#12757) DAOS-15627 dtx: redunce stack usage for DTX resync to avoid overflow (#14189) DAOS-14845 rebuild: do not wait for EC agg for reclaim (#13610) Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com> Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com> Signed-off-by: Jeff Olivier <jeffolivier@google.com> Signed-off-by: Wang, Di <wddi218@gmail.com> Signed-off-by: Di Wang <di.wang@intel.com> Signed-off-by: Wang Shilong <shilong.wang@intel.com> Signed-off-by: Fan Yong <fan.yong@intel.com>
…14593) * DAOS-16039 object: fix EC aggregation wrong peer address Fix EC aggregation wrong peer address when multiple dkeys on same EC agg leader shard. Change existing test cases to cover that case. Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>
Fix EC aggregation wrong peer address when multiple dkeys on same EC agg leader shard.
Change existing test cases to cover that case.
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: