Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limits for add and recover balance data or zone balance jobs #4104

Merged
merged 10 commits into from
Apr 7, 2022

Conversation

panda-sheep
Copy link
Contributor

@panda-sheep panda-sheep commented Mar 31, 2022

What type of PR is this?

  • bug
  • feature
  • enhancement

What problem(s) does this PR solve?

Issue(s) number:

Description:

Only affects balance data or zone balance job

1) When adding a job
If a space has failed balance data or zone balance jobs
You cannot add new balance data or zone balance jobs to this space, 
you must first recover the failed balance data or zone balance jobs.
If the job keeps failing, you can stop the job first, and then add a new job of this type

2) When recovering the job
For  balance data or zone balance job.
If you recover a failed job, you can recover.
If you recover a stopped job, compare it by starttime to see if there is an updated failed job and finished job of this type after the stopped job.
If there is, it cannot be recovered, otherwise it can be recovered

For the same type of job, when recovering multiple stopped jobs at a time. 
Only recover the latest stopped job. 
 
The same type of STOPPED job exists in the following form:
job1(stopped) job2(failed)
recover job job1 failed
recover job job2 success

job1(stopped) job2(fininshed) job3(stopped)
recover job job1 failed
recover job job3 success
recover job job1, job3 Only job3 can recover

How do you solve it?

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Tests:

  • Unit test(positive and negative cases)
  • Function test
  • Performance test
  • N/A

Affects:

  • Documentation affected (Please add the label if documentation needs to be modified.)
  • Incompatibility (If it breaks the compatibility, please describe it and add the label.)
  • If it's needed to cherry-pick (If cherry-pick to some branches is required, please label the destination version(s).)
  • Performance impacted: Consumes more CPU/Memory

Release notes:

Please confirm whether to be reflected in release notes and how to describe:

ex. Fixed the bug .....

@panda-sheep panda-sheep requested a review from a team as a code owner March 31, 2022 04:31
@Sophie-Xie Sophie-Xie added this to the v3.1.0 milestone Mar 31, 2022
src/clients/meta/MetaClient.cpp Outdated Show resolved Hide resolved
src/meta/processors/job/AdminJobProcessor.cpp Outdated Show resolved Hide resolved
src/meta/processors/job/JobDescription.h Show resolved Hide resolved
src/meta/processors/job/AdminJobProcessor.cpp Outdated Show resolved Hide resolved
src/meta/processors/job/JobManager.cpp Outdated Show resolved Hide resolved
@Sophie-Xie Sophie-Xie added the cherry-pick-v3.1 PR: need cherry-pick to this version label Apr 1, 2022
@critical27
Copy link
Contributor

Ready?

Copy link
Contributor

@critical27 critical27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, @liwenhui-soul please confirm if this PR satisfy.

@panda-sheep
Copy link
Contributor Author

Ready?

yeah

critical27
critical27 previously approved these changes Apr 1, 2022
@panda-sheep panda-sheep requested a review from pengweisong April 1, 2022 09:06
@codecov-commenter
Copy link

Codecov Report

Merging #4104 (fe9d964) into master (5d96c60) will decrease coverage by 0.01%.
The diff coverage is 67.20%.

@@            Coverage Diff             @@
##           master    #4104      +/-   ##
==========================================
- Coverage   85.10%   85.08%   -0.02%     
==========================================
  Files        1339     1324      -15     
  Lines      132245   131714     -531     
==========================================
- Hits       112544   112066     -478     
+ Misses      19701    19648      -53     
Impacted Files Coverage Δ
src/clients/meta/MetaClient.cpp 76.41% <0.00%> (-0.03%) ⬇️
src/clients/meta/MetaClient.h 92.30% <ø> (ø)
src/common/graph/Response.h 48.99% <ø> (ø)
src/common/thread/test/GenericThreadPoolTest.cpp 53.42% <0.00%> (-30.14%) ⬇️
src/common/thread/test/GenericWorkerTest.cpp 54.16% <0.00%> (-44.45%) ⬇️
src/graph/executor/ExecutionError.h 0.00% <ø> (ø)
src/graph/executor/Executor.cpp 81.32% <ø> (+0.31%) ⬆️
src/graph/executor/Executor.h 100.00% <ø> (ø)
src/graph/executor/StorageAccessExecutor.cpp 100.00% <ø> (ø)
src/graph/executor/StorageAccessExecutor.h 60.00% <ø> (ø)
... and 272 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update daaed0a...fe9d964. Read the comment docs.

liwenhui-soul
liwenhui-soul previously approved these changes Apr 7, 2022
@panda-sheep panda-sheep changed the title If there are failed or stopped data balance or zone balance job, must firstly recover it add and recover balance data or zone balance job Apr 7, 2022
@panda-sheep panda-sheep changed the title add and recover balance data or zone balance job Limits for add and recover balance data or zone balance jobs Apr 7, 2022
@critical27
Copy link
Contributor

So complicated ... I'm confusing ... Several question, majorly related to RECOVER:

  1. If we could recover multiple job, one failed, one recovered, this is weird.
  2. Do we really need to support recover STOPPED job? @liuyu85cn PTAL

@panda-sheep
Copy link
Contributor Author

panda-sheep commented Apr 7, 2022

So complicated ... I'm confusing ... Several question, majorly related to RECOVER:

  1. If we could recover multiple job, one failed, one recovered, this is weird.
  2. Do we really need to support recover STOPPED job? @liuyu85cn PTAL
  1. Multiple jobs can be recovered at one time. For multiple jobs of the same type, it is only necessary to recover the job with the latest starttime as much as possible.
  2. The stop job can be recovered. Just need some conditions to limit whether it can be recovered.

@critical27
Copy link
Contributor

critical27 commented Apr 7, 2022

So complicated ... I'm confusing ... Several question, majorly related to RECOVER:

  1. If we could recover multiple job, one failed, one recovered, this is weird.
  2. Do we really need to support recover STOPPED job? @liuyu85cn PTAL
  1. Multiple jobs can be recovered at one time. For multiple jobs of the same type, it is only necessary to recover the job with the latest starttime as much as possible.
  2. The stop job can be recovered. Just need some conditions to limit whether it can be recovered.

Yeah, some of the recovered jobs actually failed, this does not make sense. I mean we could restrict only recover one job is allowed later if necessary, we could simplify the code, and user won't be confused.

BTW, if we are recovering one job, and recover failed, what message do we get from console? "Recovered job failed"?

@panda-sheep
Copy link
Contributor Author

panda-sheep commented Apr 7, 2022

So complicated ... I'm confusing ... Several question, majorly related to RECOVER:

  1. If we could recover multiple job, one failed, one recovered, this is weird.
  2. Do we really need to support recover STOPPED job? @liuyu85cn PTAL
  1. Multiple jobs can be recovered at one time. For multiple jobs of the same type, it is only necessary to recover the job with the latest starttime as much as possible.
  2. The stop job can be recovered. Just need some conditions to limit whether it can be recovered.

Yeah, some of the recovered jobs actually failed, this does not make sense. I mean we could restrict only recover one job is allowed later if necessary, we could simplify the code, and user won't be confused.

BTW, if we are recovering one job, and recover failed, what message do we get from console? "Recovered job failed"?

  1. Yes, we can support it later. Only one jobId can be recovered at a time. This makes the error clearer.
  2. At present, recover jobs only return how many jobs can be recovered.
    When recover multiple jobids at a time, a certain jobId cannot be executed and no error is reported.
    BTW, if we are recovering one job, and recover failed, what message do we get from console? "Recovered job failed"?
    So there may be no job to recover and no error to send.
    So here is the advantage of specifying only one jobid at a time when recovering

@Sophie-Xie Sophie-Xie merged commit 07cbbac into vesoft-inc:master Apr 7, 2022
@panda-sheep panda-sheep deleted the balance_data_job branch April 7, 2022 10:50
Sophie-Xie pushed a commit that referenced this pull request Apr 7, 2022
* If there are failed or stopped data balance or zone balance job, must firstly recover it

* address wenhaocs's comments

* address comments

* add more comments

* recover job

* add more ut

* adjust comment format

* adjust comment format

* adjust comment format
yixinglu pushed a commit that referenced this pull request Apr 7, 2022
* fix apply outdate membership change (#4107)

* fix apply outdate membership change

* fix transfer leader to '':0 cause crash

Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>

* add code for part peers backward compatible (#4101)

* add code for part peers backward compatible

* add balance keys

* fix bug: pass tests

* add gflag && ignore

* change int to int64_t

* chage to emplace

Co-authored-by: panda-sheep <59197347+panda-sheep@users.noreply.github.com>
Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>

* fixed (#4116)

Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>

* move KW_CLEAR to unresolved keyword (#4118)

* Limits for add and recover balance data or zone balance jobs (#4104)

* If there are failed or stopped data balance or zone balance job, must firstly recover it

* address wenhaocs's comments

* address comments

* add more comments

* recover job

* add more ut

* adjust comment format

* adjust comment format

* adjust comment format

Co-authored-by: Doodle <13706157+critical27@users.noreply.github.com>
Co-authored-by: pengwei.song <90180021+pengweisong@users.noreply.github.com>
Co-authored-by: panda-sheep <59197347+panda-sheep@users.noreply.github.com>
Co-authored-by: haifei.zhao <32253291+zhaohaifei@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-v3.1 PR: need cherry-pick to this version ready for review ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants