Skip to content
This repository was archived by the owner on Jun 23, 2022. It is now read-only.

fix(bulk-load): update cleanup download task #570

Merged
merged 3 commits into from
Jul 27, 2020

Conversation

hycdong
Copy link
Contributor

@hycdong hycdong commented Jul 23, 2020

This pull request fix a bug while cleanup download task.
It used to use CLEANUP_TASK_ALWAYS marco to cleanup download task, it will cancel task without waiting, the task must be not running at this time. However, when pause or cancel bulk load, it is possible that downloading task is still running, this pull request fix this problem.

@hycdong hycdong marked this pull request as ready for review July 23, 2020 07:42
@@ -590,12 +590,15 @@ error_code replica_bulk_loader::remove_local_bulk_load_dir(const std::string &bu
}

// ThreadPool: THREAD_POOL_REPLICATION
void replica_bulk_loader::cleanup_download_task()
bool replica_bulk_loader::cleanup_download_task()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The returned value is never used.
  2. cleanup_download_task is running in REPLICATION thread pool, it can't be blocked, otherwise critical tasks will be blocked too.
  3. Do we need to loop over to wait when the returned value of CLEANUP_TASK is false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CLEANUP_TASK is widely used in replica_context.cpp, context switch cleanup functions. You can see cancel function in task.cpp, if it cost too long to cancel the function, it will assert. It is the answer of Q2.
In this case, CLEANUP_TASK will not return false, so the returned value is not used. I just would like use the marco.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you ensure the returned value is always true, keep that dassert there.

I'd more like to see a program dies because of the bad assumption (such as the assert here) rather than the program being slow due to an unknown reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think you misunderstand my view.
CLEANUP_TASK_ALWAYS won't wait while cancel the task, CLEANUP_TASK provide force parameter, if force = true, cancel task will wait for this task finished, otherwise not.
when cancel downloading task, I would like to wait for task finished, so I should use CLEANUP_TASK and set force = true. You can see the cancel function below:

bool task::cancel(bool wait_until_finished, /*out*/ bool *finished /*= nullptr*/)
{
task_state READY_STATE = TASK_STATE_READY;
task *current_tsk = get_current_task();
bool finish = false;
bool succ = false;
if (current_tsk != this) {
if (_state.compare_exchange_strong(
READY_STATE, TASK_STATE_CANCELLED, std::memory_order_relaxed)) {
succ = true;
finish = true;
} else {
task_state old_state = READY_STATE;
if (old_state == TASK_STATE_CANCELLED) {
succ = false; // this cancellation fails
finish = true;
} else if (old_state == TASK_STATE_FINISHED) {
succ = false;
finish = true;
} else if (wait_until_finished) {
_wait_for_cancel = true;
bool r = wait_on_cancel();
dassert(
r,
"wait failed, it is only possible when task runs for more than 0x0fffffff ms");
succ = false;
finish = true;
} else {
succ = false;
finish = false;
}
}
} else {
// task cancel itself
// for timer task, we should set _wait_for_cancel flag to
// prevent timer task from enqueueing again
_wait_for_cancel = true;
}
if (current_tsk != nullptr) {
current_tsk->spec().on_task_cancel_post.execute(current_tsk, this, succ);
}
if (succ) {
spec().on_task_cancelled.execute(this);
signal_waiters();
// we call clear_callback only cancelling succeed.
// otherwise, task will successfully exececuted and clear_callback will be called
// in "exec_internal".
clear_non_trivial_on_task_end();
}
if (finished)
*finished = finish;
return succ;
}

When wait_until_finished = true, and current task is not task who will be canceled, if cancel cost too long, it will assert on L335, otherwise, finished will always be true.

#define CLEANUP_TASK(task_, force)                                                                 
     {                                                                                              
         task_ptr t = task_;                                                                        
         if (t != nullptr) {                                                                        
             bool finished;                                                                         
             t->cancel(force, &finished);                                                           
             if (!finished && !dsn_task_is_running_inside(task_.get()))                              
                 return false;                                                                      
             task_ = nullptr;                                                                       
         }                                                                                          
     }                     

As a result, finished will always be true, CLEANUP_TASK will not return false.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cancel cost too long, it will assert on L335, otherwise, finished will always be true.

Yes. If cancellation waits for more than 0xFFFFFFF ms (this number sucks), it asserts. But this means the cancellation will block for a very long time. 0xFFFFFFF ms = 268,435 seconds!

Copy link
Contributor Author

@hycdong hycdong Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the worst case, actually it will not cost such a long time.
When a task is created, its states is READY, while it is executed, its state is RUNNING, and its state is FINISH when it is finished. From the code in my previous comment, only the task whose state is RUNNING should call function wait_on_cancel. Each downloading task only download one file, run in threadpool THREAD_POOL_REPLICATION_LONG whose worker_count is 8. For each partition, there are only at most eight downloading tasks are running, it won't cost such a long time.
Besides, clean up downloading tasks will only be used when downloading succeed, pause bulk load and cleanup context, tasks are usually be finished in most cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code, cancel tasks without waiting for it.

@levy5307 levy5307 merged commit 8a56c95 into XiaoMi:master Jul 27, 2020
@hycdong hycdong deleted the fix-download-task branch July 28, 2020 06:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants