Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: retry truncating sst files upon failure #484

Merged
merged 11 commits into from
May 27, 2019

Conversation

zyguan
Copy link
Contributor

@zyguan zyguan commented May 13, 2019

What problem does this PR solve?

sst files may be deleted after compaction, thus the TruncateSSTFile may fail unexpectly.

This pr closes: #501

What is changed and how it works?

add retry logical.

Check List

Tests

  • E2E test
  • Stability test

Code changes

  • Has Helm charts change
  • Has Go code change
  • Has CI related scripts change
  • Has documents change

Side effects

  • Breaking backward compatibility

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Does this PR introduce a user-facing change?:

NONE

sst files may be deleted after compaction
stdout, stderr, err := exec("find", "/var/lib/tikv/db", "-name", "*.sst", "-o", "-name", "*.save")
if err != nil {
glog.Errorf("list sst files: stderr=%s err=%s", stderr, err.Error())
return errors.Annotate(err, "list sst files")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return errors.Annotate(err, "list sst files")
continue

}
}
if len(sst) == 0 {
return errors.New("cannot find a sst file")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return errors.New("cannot find a sst file")
glog.Error("cannot find a sst file")
continue

return errors.Annotate(err, "list sst files")
}
retryCount := 0
for ; retryCount < retryLimit; retryCount++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for ; retryCount < retryLimit; retryCount++ {
for ; retryCount < retryLimit; retryCount++ {
time.Sleep(10*time.Second)


sstCandidates := make(map[string]bool)
sstCandidates := make(map[string]bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete the blanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the one more indent because it's in retry loop now.

glog.Errorf("truncate sst file: stderr=%s err=%s", stderr, err.Error())
return errors.Annotate(err, "truncate sst file")
if retryCount == retryLimit {
return errors.New("failed to truncate sst file after " + strconv.Itoa(retryLimit) + " trials")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error log needs opts.Namespace and opts.Cluster field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added annotations to log methods. however, it's caller who pass the opts arg, that is, the caller must know these info, so I don't think there is a need for adding these fields to returned error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the caller has not added these fields to error log too
ref: https://github.com/pingcap/tidb-operator/blob/master/tests/failover.go#L74-L81

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some error logs to the case func. If you think every error should be annotated with additional fields, it would be better to create an another PR to do it, because most of methods of operatorActions failed to do so.

@weekface
Copy link
Contributor

/run-e2e-tests

@zyguan
Copy link
Contributor Author

zyguan commented May 20, 2019

/run-e2e-tests

@weekface weekface added the test/stability stability tests label May 21, 2019
@weekface
Copy link
Contributor

/run-e2e-tests

tennix
tennix previously approved these changes May 24, 2019
Copy link
Member

@tennix tennix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code needs to be formatted, rest LGTM.

@weekface
Copy link
Contributor

@zyguan PTAL

@weekface
Copy link
Contributor

/run-e2e-tests

@zyguan
Copy link
Contributor Author

zyguan commented May 27, 2019

/run-e2e-tests

@weekface weekface merged commit 8360887 into pingcap:master May 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test/stability stability tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stability test: the TiKV instance did't crash when we truncate a sst file of it
4 participants