-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-15627 dtx: redunce stack usage for DTX resync to avoid overflow #14189
Conversation
Ticket title is 'osa/offline_reintegration.py:OSAOfflineReintegration.test_osa_offline_reintegration_without_checksum - /usr/bin/daos_engine exited: signal: aborted (core dumped)' |
fec26d5
to
5c782bc
Compare
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14189/4/execution/node/1409/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14189/4/execution/node/1455/log |
Dynamically allcated some large variables that are for DTX iteration and resync to avoid potential ULT stack overflow. The patch also adjusts DTX re-index ULT stop logic to avoid being stopped by DTX resync too early before completed DTX re-index. Signed-off-by: Fan Yong <fan.yong@intel.com>
5c782bc
to
f0a845e
Compare
* by DTX may be not released because DTX resync was in-progress at that time. When arrive | ||
* here, DTX resync must has completed globally. Let's release related resource. | ||
*/ | ||
if (unlikely(cont_child->sc_dtx_delay_reset == 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems the DTX resync possibly not caused by rebuild? (DTX resync may start, and no rebuild happended).
if the rebuild not started or even disabled, which part can do this kind of resource freeing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the DTX resync is triggered by opening container, then the DTX table cannot be released since the opened container will use it. Related resource will be released via DTX aggregation step by step.
@@ -1743,9 +1747,15 @@ stop_dtx_reindex_ult(struct ds_cont_child *cont) | |||
if (dtx_cont_opened(cont)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this happen? I think we should assert(!dtx_cont_opened()), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked dtx_cont_close() closer, looks the a "dtx_cont_open()" could be called before "stop_dtx_reindex_ult()", then we need to check dtx_cont_openend() here.
Maybe it's better to move this check out of stop_dtx_reintdex_ult(). (like checking dtx_cont_opened() before calling stop_dtx_reindex_ult() in dtx_cont_close()), and put an assert here, since the stop_dtx_reindex_ult() is called in cont_child_stop() as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop_dtx_reindex_ult() may be called after DTX resync completed globally, then there may be race with container open. It may be not impossible to adjust related logic to reduce race, but checking here is more safe and simple.
*/ | ||
if (unlikely(cont_child->sc_dtx_delay_reset == 1)) { | ||
stop_dtx_reindex_ult(cont_child, true); | ||
vos_dtx_cache_reset(cont_child->sc_hdl, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still looks complicated to me. cont_child_stop() is only called when:
- Destroy container.
- Destroy pool.
- Shutdown engine.
So long running ULT which accessing the container needs be aborted in cont_child_stop(). Why don't we simply move all the stuff in cont_child_destroy_one() into cont_child_stop()? That includes:
- Stop resyncing ULT.
- Stop reindex ULT.
- Stop scrubbing ULT.
- Stop rebuild container scan.
In current implementation, these cleanup are duplicated (or missed) in pool destroy & engine shutdown cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion with Fanyong, I now understand the 'dtx_delay_reset' flag better.
It looks to me that "dtx cont resync" and "dtx cont open" both need to access the 'cache' being generated by "dtx cont reindex", and we don't enforce the order of "stop resync", "stop reindex" and "cont close".
Then my proposal is to introduce a user reference for the 'cache', and "dtx cont resync", "dtx cont reindex", "dtx cont open" all need to hold a refcount of the "cache", once the refcount drops to zero, the cache will be reset. Does it sound simpler? Of course, I'm fine with current approach either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just discussed with niu about related logic. Some of above cleanup work is necessary, but not related with the this patch and also will not simplify the patch. We can handle them in other ticket(s) later.
@@ -815,6 +815,10 @@ cont_child_stop(struct ds_cont_child *cont_child) | |||
* never be started at all | |||
*/ | |||
cont_child->sc_stopping = 1; | |||
|
|||
/* Stop DTX reindex by force. */ | |||
stop_dtx_reindex_ult(cont_child, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it is ok to stop re-index ULT before stop re-sync ULT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we decide to force stop the re-index ULT, it means that we want to stop the container, under such case, it does not care about the in-processing DTX resync that may fail out.
*/ | ||
if (unlikely(cont_child->sc_dtx_delay_reset == 1)) { | ||
stop_dtx_reindex_ult(cont_child, true); | ||
vos_dtx_cache_reset(cont_child->sc_hdl, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion with Fanyong, I now understand the 'dtx_delay_reset' flag better.
It looks to me that "dtx cont resync" and "dtx cont open" both need to access the 'cache' being generated by "dtx cont reindex", and we don't enforce the order of "stop resync", "stop reindex" and "cont close".
Then my proposal is to introduce a user reference for the 'cache', and "dtx cont resync", "dtx cont reindex", "dtx cont open" all need to hold a refcount of the "cache", once the refcount drops to zero, the cache will be reset. Does it sound simpler? Of course, I'm fine with current approach either.
@@ -1658,7 +1658,7 @@ ds_cont_local_open(uuid_t pool_uuid, uuid_t cont_hdl_uuid, uuid_t cont_uuid, | |||
DF_UUID": %d\n", DP_UUID(cont_uuid), hdl->sch_cont->sc_open); | |||
|
|||
hdl->sch_cont->sc_open--; | |||
dtx_cont_close(hdl->sch_cont); | |||
dtx_cont_close(hdl->sch_cont, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why it needs a 'force' flag for this dtx_cont_close().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller of dtx_cont_close() may destroy the container after calling dtx_cont_close(), then it will use "force" mode.
@@ -1743,9 +1747,15 @@ stop_dtx_reindex_ult(struct ds_cont_child *cont) | |||
if (dtx_cont_opened(cont)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked dtx_cont_close() closer, looks the a "dtx_cont_open()" could be called before "stop_dtx_reindex_ult()", then we need to check dtx_cont_openend() here.
Maybe it's better to move this check out of stop_dtx_reintdex_ult(). (like checking dtx_cont_opened() before calling stop_dtx_reindex_ult() in dtx_cont_close()), and put an assert here, since the stop_dtx_reindex_ult() is called in cont_child_stop() as well.
The cache is vos level, and attached to vos container. Usually we will not reset the cache unless the vos container is closed. But because the cache may take too much DRAM, it is expected to release DRAM even if the vos container is opened but without upper layer user. So the reference on such cache maybe inconvenient. |
There are serval possible cases may trigger stop_dtx_reindex_ult(), moving the check of dtx_cont_opened() to callers' logic will generate more repeated code. On the other hand, if some new caller is introduced in future, it also needs to handle the open status by itself. But if we hide the check inside stop_dtx_reindex_ult(), then all callers do not need to care about that. So current implementation seems more transparent for the callers. On the other hand, the cont_close may be yield, then the race between open and close exist. It seems more clean to me to hide the race handling inside {start, stop}_dtx_reindex_ult() instead of check these things everywhere. |
Ping @daos-stack/daos-gatekeeper , thanks! |
/* Give chance to DTX reindex ULT for exit. */ | ||
while (unlikely(cont->sc_dtx_reindex)) | ||
ABT_thread_yield(); | ||
|
||
/* Make sure checksum scrubbing has stopped */ | ||
ABT_mutex_lock(cont->sc_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we release the lock and take the lock again after removing the loop
DAOS-16039 object: fix EC aggregation wrong peer address (#14593) DAOS-16009 rebuild: fix O_TRUNC file size related handling DAOS-15056 rebuild: add rpt to the rgt list properly (#13862) DAOS-15517 rebuild: refine lock handling for rpt list (#14064) DAOS-13812 container: fix destroy vs lookup (#12757) DAOS-15627 dtx: redunce stack usage for DTX resync to avoid overflow (#14189) DAOS-14845 rebuild: do not wait for EC agg for reclaim (#13610) Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com> Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com> Signed-off-by: Jeff Olivier <jeffolivier@google.com> Signed-off-by: Wang, Di <wddi218@gmail.com> Signed-off-by: Di Wang <di.wang@intel.com> Signed-off-by: Wang Shilong <shilong.wang@intel.com> Signed-off-by: Fan Yong <fan.yong@intel.com>
DAOS-16039 object: fix EC aggregation wrong peer address (#14593) DAOS-16009 rebuild: fix O_TRUNC file size related handling DAOS-15056 rebuild: add rpt to the rgt list properly (#13862) DAOS-15517 rebuild: refine lock handling for rpt list (#14064) DAOS-13812 container: fix destroy vs lookup (#12757) DAOS-15627 dtx: redunce stack usage for DTX resync to avoid overflow (#14189) DAOS-14845 rebuild: do not wait for EC agg for reclaim (#13610) Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com> Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com> Signed-off-by: Jeff Olivier <jeffolivier@google.com> Signed-off-by: Wang, Di <wddi218@gmail.com> Signed-off-by: Di Wang <di.wang@intel.com> Signed-off-by: Wang Shilong <shilong.wang@intel.com> Signed-off-by: Fan Yong <fan.yong@intel.com>
Dynamically allcated some large variables that are for DTX iteration and resync to avoid potential ULT stack overflow.
The patch also adjusts DTX re-index ULT stop logic to avoid being stopped by DTX resync too early before completed DTX re-index.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: