-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MemberDowngrade failpoint #19038
base: main
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
I am getting the following error sometimes. @serathius do you know what this error usually comes from?
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted filessee 31 files with indirect coverage changes @@ Coverage Diff @@
## main #19038 +/- ##
==========================================
+ Coverage 68.72% 68.81% +0.08%
==========================================
Files 420 420
Lines 35627 35640 +13
==========================================
+ Hits 24483 24524 +41
+ Misses 9713 9692 -21
+ Partials 1431 1424 -7 Continue to review full report in Codecov by Sentry.
|
return nil, err | ||
} | ||
targetVersion := semver.Version{Major: v.Major, Minor: v.Minor - 1} | ||
numberOfMembersToDowngrade := rand.Int()%len(clus.Procs) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not downgrade all members? Trying to think if there are any benefits on testing partial downgrade. Technically all partial upgrades are just subset of procedure of full downgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the benefit is to verify that the correctness should never be broken no matter how many members are downgraded.
I am thinking that we should explicitly verify the two cases: full downgrade and partial downgrade.
Most likely the WAL record is somehow corrupted again. Added more debug info in #19067. |
Overall looks good to me, please mark this PR as ready to review when you feel comfortable. |
Updated the error message. I tried |
Did not get time to dig into the robustness test's use case of reading WAL file, but my immediate feeling is the reason might be due to an inappropriate snapshot {Index: 0, Term: 0} is always set for the WAL. etcd/tests/robustness/report/wal.go Line 110 in e0bbea9
If the v2 snapshot files have already been rotated, in other words, the very first snapshot files have been purged, then it means there are data loss (you are not reading WAL records right following the v2 snapshot Index) from v2store perspective (although there isn't data loss from v3store perspective), then you will definitely see this error. etcd/server/storage/wal/wal.go Lines 488 to 493 in e0bbea9
A thought to double check this... set a big value for both |
@ahrtr Thanks for the suggestion. |
The WAL files should haven't rotated, otherwise it will fail to find the WAL files which match the snap index. etcd/server/storage/wal/wal.go Lines 405 to 408 in 0966b4d
Please use the
Also I tried to test it locally, but couldn't reproduce it. Please provide detailed step, I may try it when I get bandwidth.
|
Refer to |
a809747
to
68f5e3f
Compare
I added my local patch to force test this failpoint only. # download 3.5 release with
make test-e2e-release
make gofail-enable && make build
cd tests/robustness
go test -run TestRobustnessExploratory -v --count 100 --failfast --timeout 1h |
The error msg is:
and the log dump is
|
I reproduced this locally.
Initial analysisThe direct reason is that some WAL entries were somehow missing right before the snapshot. The snapshot index in this case is 11281. The WAL records during 10523 ~ 11281 were missing. Confirmed that the WAL entries in this range were not missing in other two nodes.
The good news is that the data across the three members are still consistent,
Next step of the investigation
etcd/server/etcdserver/raft.go Lines 180 to 184 in 9fa35e5
cc @siyuanfoundation @fuweid @serathius Proposal on how to replay WAL in robustness test
|
Confirmed that It has nothing to do with downgrade. Even just stopping & starting multiple members one by one can reproduce this "issue". I think the reason should be the WAL records do not get flushed to disk when it is being stopped in the high traffic scenario. Note just stopping & starting one member can't reproduce it, because it allows 1 failure for a 3 member cluster. See below, etcd/tests/robustness/report/wal.go Line 64 in 9fa35e5
Note I do not see any issue from end users perspective. When some WAL entries are lost due to not being flushed to disk, when the member gets started again, it will automatically re-sync the data from the leader, or receives a snapshot (confirmed).
On top of my previous comment,
|
Thanks @ahrtr for the investigation! @serathius do we need all the records for robustness test? Or the continuous records before the gap is enough? |
97a88f1
to
09d08ae
Compare
c3604e9
to
47540ce
Compare
Yes, currently robustness tests assume all records are persisted. Sorry, I assumed that the error came from etcd, not from robustness tests. |
return nil, err | ||
} | ||
member.Config().ExecPath = e2e.BinPath.EtcdLastRelease | ||
err = patchArgs(member.Config().Args, "initial-cluster-state", "existing") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't necessary to patch the flag --initial-cluster-state
, because the data-dir and wal-dir have already been created & initialised. It won't do any harm to do it, but suggest to remove it for simplicity.
Signed-off-by: Siyuan Zhang <sizhang@google.com>
47540ce
to
747ef5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @siyuanfoundation for the nice work!
I'd like to get #19095 merged firstly to avoid the failed to read WAL flaky.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, siyuanfoundation The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.
#17118