-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: remove KeyValue unmarshal failure panic and send corrupt alarm request instead #13426
Conversation
#12845 may be a related issue. |
#13067 can also be another related issue. |
@ptabor Hi Piotr, can you give some suggestions on this patch? |
What do you mean by 'corrupted data in storage'. From the community meeting, I thought that the after-restart state was correct and the system was able to continue after restart... This is big tradeoff between holding on execution (alarm Corrupt causes all calls to keep failing with Corrupt state) vs. autohealing by restart. If there is a risk of 'storage corruption' then holding on execution seems the right choice. What I'm afraid is that there might be some class of users unaware of the problem because of automated restarts and this change might lead to cumulating failures at scale when it gets released. In summary:
|
03a99cc
to
80ede8b
Compare
Make sense. I added it as an experimental flag now. Regarding the corruption detail, I am going to collect more information and later put it here. |
4114577
to
2437616
Compare
for i, v := range vals { | ||
var kv mvccpb.KeyValue | ||
if err := kv.Unmarshal(v); err != nil { | ||
lg.Panic("failed to unmarshal mvccpb.KeyValue", zap.Error(err)) | ||
s.store.lg.Fatal("failed to unmarshal mvccpb.KeyValue", zap.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we Fatal
here (instead of Warn
), we still don't propagate the error to the error channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do. From what I have seen, Fatal
does not trigger a Panic
2437616
to
500f3c8
Compare
bef8a45
to
4664e04
Compare
…m request instead
4664e04
to
88fc2b9
Compare
Red Hat team just hit this today full log below.
|
Crossed linked here the impact can be more pronounced if some critical k8s object get corrupted. There is one "Corrupted/bitflipped serialized API data present" in etcd issue kubernetes/kubernetes#69579 amazon team reported starting from 2018. I suspect our health check system auto-remedies the panic errors with restart. Does the symptom look similar in bytedance? @wilsonwang371 |
In bytedance, we saw similar issue. but the data is not corrected. I think the reason might be because when corrupted etcd instance was restarted it still believe it can recover its state without a full data resync. |
are we able to enforce a full resync in case of same failure detected after restart? |
Not yet. We prioritized the mitigations (delete the noticeable corrupted kv data) so the data may be lost. I will come back to this thread if there are any new findings. Our team is doing some log scanning on the clusters that faced the issue before to narrow down the bitflipped issue happened in etcd. |
This makes a lot of sense to me from the situation we are observing. |
@wilsonwang371 are you saying that etcd never came back up? (corrupted) We saw etcd panic above then restarted fine, but I believe the problem still exists. kube-controller-manager for example is crashlooping unable to update lease. I think this aligns with some of @chaochn47 observations. |
Let me rephrase what we saw in bytedance:
|
Yes for the record on the peer that we saw panic, we replaced the member and it appears the cluster now has reconciled and working as expected. |
Some update here: with my patch, we can observe data corruption issue reported. However, with some further discussion. We need several things here.
The first thing is done in the patch. what do you guys think? |
759e6c2
to
20b9774
Compare
@chaochn47 @hexfusion @gyuho @ptabor Do you guys think calling I assume this will clean up all the WAL. When we restart that failed etcd instance, it will rebuild its WAL & BoltDB data from the leader and recover. However, I didn't confirm this completely yet. |
20b9774
to
9643999
Compare
9643999
to
ab801f1
Compare
Root cause might be here: #13505 |
If #13505 is the root cause, we can check in that pull request and close this one. |
We found in our enterprise environment that once in a while corruption of unmarshal KeyValue data happened. It will trigger etcd panic. Finally, we end up having corrupted data in storage.
To find the root cause of this issue, we need to: