-
Notifications
You must be signed in to change notification settings - Fork 59
fix(bulk-load): update cleanup download task #570
Conversation
@@ -590,12 +590,15 @@ error_code replica_bulk_loader::remove_local_bulk_load_dir(const std::string &bu | |||
} | |||
|
|||
// ThreadPool: THREAD_POOL_REPLICATION | |||
void replica_bulk_loader::cleanup_download_task() | |||
bool replica_bulk_loader::cleanup_download_task() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The returned value is never used.
cleanup_download_task
is running in REPLICATION thread pool, it can't be blocked, otherwise critical tasks will be blocked too.- Do we need to loop over to wait when the returned value of
CLEANUP_TASK
is false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This CLEANUP_TASK
is widely used in replica_context.cpp
, context switch cleanup
functions. You can see cancel
function in task.cpp
, if it cost too long to cancel the function, it will assert. It is the answer of Q2.
In this case, CLEANUP_TASK
will not return false, so the returned value is not used. I just would like use the marco.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you ensure the returned value is always true, keep that dassert
there.
I'd more like to see a program dies because of the bad assumption (such as the assert here) rather than the program being slow due to an unknown reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think you misunderstand my view.
CLEANUP_TASK_ALWAYS won't wait while cancel the task, CLEANUP_TASK provide force
parameter, if force = true
, cancel task will wait for this task finished, otherwise not.
when cancel downloading task, I would like to wait for task finished, so I should use CLEANUP_TASK and set force = true
. You can see the cancel function below:
rdsn/src/runtime/task/task.cpp
Lines 310 to 369 in 113e7e6
bool task::cancel(bool wait_until_finished, /*out*/ bool *finished /*= nullptr*/) | |
{ | |
task_state READY_STATE = TASK_STATE_READY; | |
task *current_tsk = get_current_task(); | |
bool finish = false; | |
bool succ = false; | |
if (current_tsk != this) { | |
if (_state.compare_exchange_strong( | |
READY_STATE, TASK_STATE_CANCELLED, std::memory_order_relaxed)) { | |
succ = true; | |
finish = true; | |
} else { | |
task_state old_state = READY_STATE; | |
if (old_state == TASK_STATE_CANCELLED) { | |
succ = false; // this cancellation fails | |
finish = true; | |
} else if (old_state == TASK_STATE_FINISHED) { | |
succ = false; | |
finish = true; | |
} else if (wait_until_finished) { | |
_wait_for_cancel = true; | |
bool r = wait_on_cancel(); | |
dassert( | |
r, | |
"wait failed, it is only possible when task runs for more than 0x0fffffff ms"); | |
succ = false; | |
finish = true; | |
} else { | |
succ = false; | |
finish = false; | |
} | |
} | |
} else { | |
// task cancel itself | |
// for timer task, we should set _wait_for_cancel flag to | |
// prevent timer task from enqueueing again | |
_wait_for_cancel = true; | |
} | |
if (current_tsk != nullptr) { | |
current_tsk->spec().on_task_cancel_post.execute(current_tsk, this, succ); | |
} | |
if (succ) { | |
spec().on_task_cancelled.execute(this); | |
signal_waiters(); | |
// we call clear_callback only cancelling succeed. | |
// otherwise, task will successfully exececuted and clear_callback will be called | |
// in "exec_internal". | |
clear_non_trivial_on_task_end(); | |
} | |
if (finished) | |
*finished = finish; | |
return succ; | |
} |
When wait_until_finished = true
, and current task is not task who will be canceled, if cancel cost too long, it will assert on L335, otherwise, finished
will always be true.
#define CLEANUP_TASK(task_, force)
{
task_ptr t = task_;
if (t != nullptr) {
bool finished;
t->cancel(force, &finished);
if (!finished && !dsn_task_is_running_inside(task_.get()))
return false;
task_ = nullptr;
}
}
As a result, finished
will always be true, CLEANUP_TASK will not return false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if cancel cost too long, it will assert on L335, otherwise, finished will always be true.
Yes. If cancellation waits for more than 0xFFFFFFF ms (this number sucks), it asserts. But this means the cancellation will block for a very long time. 0xFFFFFFF ms = 268,435 seconds!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the worst case, actually it will not cost such a long time.
When a task is created, its states is READY, while it is executed, its state is RUNNING, and its state is FINISH when it is finished. From the code in my previous comment, only the task whose state is RUNNING should call function wait_on_cancel
. Each downloading task only download one file, run in threadpool THREAD_POOL_REPLICATION_LONG
whose worker_count is 8. For each partition, there are only at most eight downloading tasks are running, it won't cost such a long time.
Besides, clean up downloading tasks will only be used when downloading succeed, pause bulk load and cleanup context, tasks are usually be finished in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the code, cancel tasks without waiting for it.
This pull request fix a bug while cleanup download task.
It used to use
CLEANUP_TASK_ALWAYS
marco to cleanup download task, it will cancel task without waiting, the task must be not running at this time. However, when pause or cancel bulk load, it is possible that downloading task is still running, this pull request fix this problem.