server: remove KeyValue unmarshal failure panic and send corrupt alarm request instead #13426

wilsonwang371 · 2021-10-19T21:27:33Z

We found in our enterprise environment that once in a while corruption of unmarshal KeyValue data happened. It will trigger etcd panic. Finally, we end up having corrupted data in storage.

To find the root cause of this issue, we need to:

report unmarshal error instead of panic
report unhealthy state so that the admin can do something: restore data, save logs. So that we can further investigate what is happening.

wilsonwang371 · 2021-10-19T21:30:00Z

#12845 may be a related issue.

wilsonwang371 · 2021-10-19T21:31:50Z

#13067 can also be another related issue.

wilsonwang371 · 2021-10-28T20:57:48Z

@ptabor Hi Piotr, can you give some suggestions on this patch?

server/storage/mvcc/watchable_store.go

ptabor · 2021-10-29T07:39:16Z

What do you mean by 'corrupted data in storage'. From the community meeting, I thought that the after-restart state was correct and the system was able to continue after restart...

This is big tradeoff between holding on execution (alarm Corrupt causes all calls to keep failing with Corrupt state) vs. autohealing by restart. If there is a risk of 'storage corruption' then holding on execution seems the right choice. What I'm afraid is that there might be some class of users unaware of the problem because of automated restarts and this change might lead to cumulating failures at scale when it gets released.

In summary:

+1 for expanding logging.
Alarm instead of panic() should be IMHO flag gated to make it an opt-in behavior (unless there is frequent path that leads to data corruption in case of panic() - but in general etcd should be ready to interrupt execution at any point in time).

wilsonwang371 · 2021-10-29T19:02:23Z

What do you mean by 'corrupted data in storage'. From the community meeting, I thought that the after-restart state was correct and the system was able to continue after restart...

This is big tradeoff between holding on execution (alarm Corrupt causes all calls to keep failing with Corrupt state) vs. autohealing by restart. If there is a risk of 'storage corruption' then holding on execution seems the right choice. What I'm afraid is that there might be some class of users unaware of the problem because of automated restarts and this change might lead to cumulating failures at scale when it gets released.

In summary:

+1 for expanding logging.

Alarm instead of panic() should be IMHO flag gated to make it an opt-in behavior (unless there is frequent path that leads to data corruption in case of panic() - but in general etcd should be ready to interrupt execution at any point in time).

Make sense. I added it as an experimental flag now.

Regarding the corruption detail, I am going to collect more information and later put it here.

gyuho · 2021-11-03T01:41:14Z

server/storage/mvcc/watchable_store.go

 	for i, v := range vals {
 		var kv mvccpb.KeyValue
 		if err := kv.Unmarshal(v); err != nil {
-			lg.Panic("failed to unmarshal mvccpb.KeyValue", zap.Error(err))
+			s.store.lg.Fatal("failed to unmarshal mvccpb.KeyValue", zap.Error(err))


If we Fatal here (instead of Warn), we still don't propagate the error to the error channel?

We do. From what I have seen, Fatal does not trigger a Panic

server/embed/config.go

wilsonwang371 · 2021-11-03T04:03:02Z

@ptabor @gyuho Currently, I am working on a patch for internal testing. I will make sure this can correctly report health problem in case real kv decode error happens.

…m request instead

hexfusion · 2021-11-18T17:12:06Z

Red Hat team just hit this today full log below.

{"level":"panic","ts":"2021-11-16T15:30:24.774Z","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 0)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue

goroutine 209 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc013779bc0, 0xc06da42a80, 0x1, 0x1)
        /remote-source/cachito-gomod-with-deps/deps/gomod/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0000aefa0, 0x122a229, 0x23, 0xc06da42a80, 0x1, 0x1)
        /remote-source/cachito-gomod-with-deps/deps/gomod/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc0000aefa0, 0xc00047aa40, 0xc02176a000, 0x2f9e, 0x32aa, 0xc02051a000, 0x2f9e, 0x32aa, 0x12, 0x7fffffffffffffff, ...)
        /remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc00047aa00, 0x0)
        /remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc00047aa00)
        /remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
        /remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:95 +0x3a7

chaochn47 · 2021-11-18T17:54:30Z

Crossed linked here the impact can be more pronounced if some critical k8s object get corrupted.

There is one "Corrupted/bitflipped serialized API data present" in etcd issue kubernetes/kubernetes#69579 amazon team reported starting from 2018.

I suspect our health check system auto-remedies the panic errors with restart. Does the symptom look similar in bytedance? @wilsonwang371

wilsonwang371 · 2021-11-18T18:58:54Z

Crossed linked here the impact can be more pronounced if some critical k8s object get corrupted.

There is one "Corrupted/bitflipped serialized API data present" in etcd issue kubernetes/kubernetes#69579 amazon team reported starting from 2018.

I suspect our health check system auto-remedies the panic errors with restart. Does the symptom look similar in bytedance? @wilsonwang371

In bytedance, we saw similar issue. but the data is not corrected. I think the reason might be because when corrupted etcd instance was restarted it still believe it can recover its state without a full data resync.

wilsonwang371 · 2021-11-18T19:34:06Z

Crossed linked here the impact can be more pronounced if some critical k8s object get corrupted.

There is one "Corrupted/bitflipped serialized API data present" in etcd issue kubernetes/kubernetes#69579 amazon team reported starting from 2018.

I suspect our health check system auto-remedies the panic errors with restart. Does the symptom look similar in bytedance? @wilsonwang371

are we able to enforce a full resync in case of same failure detected after restart?

chaochn47 · 2021-11-18T20:35:11Z

are we able to enforce a full resync in case of same failure detected after restart?

Not yet. We prioritized the mitigations (delete the noticeable corrupted kv data) so the data may be lost.

I will come back to this thread if there are any new findings. Our team is doing some log scanning on the clusters that faced the issue before to narrow down the bitflipped issue happened in etcd.

hexfusion · 2021-11-18T21:22:11Z

Crossed linked here the impact can be more pronounced if some critical k8s object get corrupted.
There is one "Corrupted/bitflipped serialized API data present" in etcd issue kubernetes/kubernetes#69579 amazon team reported starting from 2018.
I suspect our health check system auto-remedies the panic errors with restart. Does the symptom look similar in bytedance? @wilsonwang371

In bytedance, we saw similar issue. but the data is not corrected. I think the reason might be because when corrupted etcd instance was restarted it still believe it can recover its state without a full data resync.

This makes a lot of sense to me from the situation we are observing.

hexfusion · 2021-11-18T21:28:48Z

In bytedance, we saw similar issue. but the data is not corrected. I think the reason might be because when corrupted etcd instance was restarted it still believe it can recover its state without a full data resync.

@wilsonwang371 are you saying that etcd never came back up? (corrupted) We saw etcd panic above then restarted fine, but I believe the problem still exists. kube-controller-manager for example is crashlooping unable to update lease. I think this aligns with some of @chaochn47 observations.

wilsonwang371 · 2021-11-18T21:55:38Z

In bytedance, we saw similar issue. but the data is not corrected. I think the reason might be because when corrupted etcd instance was restarted it still believe it can recover its state without a full data resync.

@wilsonwang371 are you saying that etcd never came back up? (corrupted) We saw etcd panic above then restarted fine, but I believe the problem still exists. kube-controller-manager for example is crashlooping unable to update lease. I think this aligns with some of @chaochn47 observations.

Let me rephrase what we saw in bytedance:

KVValue decoding failed which triggered a panic.
The etcd instance restarted after the panic.
The restarted etcd still believes it can catch up without a full resync and therefore the corrupted KVValue still exists.
Read requests to etcd can return old data on the restarted instance.

hexfusion · 2021-11-19T00:51:46Z

Yes for the record on the peer that we saw panic, we replaced the member and it appears the cluster now has reconciled and working as expected.

wilsonwang371 · 2021-11-19T05:01:08Z

Some update here: with my patch, we can observe data corruption issue reported. However, with some further discussion. We need several things here.

report health issue (DONE)
save important logs for us to do some further investigation (Partially done)
do a full resync if we can since the current state of the etcd is no longer valid. (Optional)

The first thing is done in the patch.
Regarding the second item, I am going to add more information so that we can extract more useful things when it happens next time.
For the third one, we may want to add a flag so that even if the KVValue decode failed, we can force a resync so that the failed etcd instance can recover.

what do you guys think?

wilsonwang371 · 2021-11-24T18:23:21Z

@chaochn47 @hexfusion @gyuho @ptabor

Do you guys think calling CleanupWAL() before panic will work in case of decoding KV value failure?

I assume this will clean up all the WAL. When we restart that failed etcd instance, it will rebuild its WAL & BoltDB data from the leader and recover. However, I didn't confirm this completely yet.

wilsonwang371 · 2021-12-02T18:42:42Z

Root cause might be here: #13505

wilsonwang371 · 2021-12-02T18:53:20Z

If #13505 is the root cause, we can check in that pull request and close this one.

wilsonwang371 requested review from ptabor, jingyih, gyuho and lilic October 19, 2021 21:31

ptabor reviewed Oct 29, 2021

View reviewed changes

server/storage/mvcc/watchable_store.go Outdated Show resolved Hide resolved

wilsonwang371 force-pushed the watcher_panic_fix branch 2 times, most recently from 03a99cc to 80ede8b Compare October 29, 2021 18:57

wilsonwang371 force-pushed the watcher_panic_fix branch 7 times, most recently from 4114577 to 2437616 Compare October 29, 2021 23:59

gyuho reviewed Nov 3, 2021

View reviewed changes

server/embed/config.go Outdated Show resolved Hide resolved

wilsonwang371 force-pushed the watcher_panic_fix branch from 2437616 to 500f3c8 Compare November 3, 2021 04:01

wilsonwang371 force-pushed the watcher_panic_fix branch 2 times, most recently from bef8a45 to 4664e04 Compare November 3, 2021 18:03

server: remove KeyValue unmarshal failure panic and send corrupt alar…

88fc2b9

…m request instead

wilsonwang371 force-pushed the watcher_panic_fix branch from 4664e04 to 88fc2b9 Compare November 3, 2021 18:07

wilsonwang371 and others added 2 commits November 16, 2021 11:20

move error reporting logic into a different goroutine

50315ed

Merge branch 'main' into watcher_panic_fix

23626d8

hexfusion added the type/bug label Nov 18, 2021

wilsonwang371 force-pushed the watcher_panic_fix branch from 759e6c2 to 20b9774 Compare November 24, 2021 07:11

wilsonwang371 mentioned this pull request Nov 24, 2021

server: add more debug information for kvdecode failure case #13499

Closed

wilsonwang371 force-pushed the watcher_panic_fix branch from 20b9774 to 9643999 Compare November 24, 2021 19:58

remove cap and len checking

ab801f1

wilsonwang371 force-pushed the watcher_panic_fix branch from 9643999 to ab801f1 Compare November 24, 2021 20:44

wilsonwang371 requested a review from hexfusion November 25, 2021 00:56

wilsonwang371 closed this Dec 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: remove KeyValue unmarshal failure panic and send corrupt alarm request instead #13426

server: remove KeyValue unmarshal failure panic and send corrupt alarm request instead #13426

wilsonwang371 commented Oct 19, 2021

wilsonwang371 commented Oct 19, 2021

wilsonwang371 commented Oct 19, 2021

wilsonwang371 commented Oct 28, 2021

ptabor commented Oct 29, 2021

wilsonwang371 commented Oct 29, 2021

gyuho Nov 3, 2021

wilsonwang371 Nov 3, 2021

wilsonwang371 commented Nov 3, 2021

hexfusion commented Nov 18, 2021

chaochn47 commented Nov 18, 2021 •

edited

Loading

wilsonwang371 commented Nov 18, 2021

wilsonwang371 commented Nov 18, 2021

chaochn47 commented Nov 18, 2021

hexfusion commented Nov 18, 2021

hexfusion commented Nov 18, 2021 •

edited

Loading

wilsonwang371 commented Nov 18, 2021

hexfusion commented Nov 19, 2021

wilsonwang371 commented Nov 19, 2021 •

edited

Loading

wilsonwang371 commented Nov 24, 2021 •

edited

Loading

wilsonwang371 commented Dec 2, 2021

wilsonwang371 commented Dec 2, 2021

server: remove KeyValue unmarshal failure panic and send corrupt alarm request instead #13426

server: remove KeyValue unmarshal failure panic and send corrupt alarm request instead #13426

Conversation

wilsonwang371 commented Oct 19, 2021

wilsonwang371 commented Oct 19, 2021

wilsonwang371 commented Oct 19, 2021

wilsonwang371 commented Oct 28, 2021

ptabor commented Oct 29, 2021

wilsonwang371 commented Oct 29, 2021

gyuho Nov 3, 2021

Choose a reason for hiding this comment

wilsonwang371 Nov 3, 2021

Choose a reason for hiding this comment

wilsonwang371 commented Nov 3, 2021

hexfusion commented Nov 18, 2021

chaochn47 commented Nov 18, 2021 • edited Loading

wilsonwang371 commented Nov 18, 2021

wilsonwang371 commented Nov 18, 2021

chaochn47 commented Nov 18, 2021

hexfusion commented Nov 18, 2021

hexfusion commented Nov 18, 2021 • edited Loading

wilsonwang371 commented Nov 18, 2021

hexfusion commented Nov 19, 2021

wilsonwang371 commented Nov 19, 2021 • edited Loading

wilsonwang371 commented Nov 24, 2021 • edited Loading

wilsonwang371 commented Dec 2, 2021

wilsonwang371 commented Dec 2, 2021

chaochn47 commented Nov 18, 2021 •

edited

Loading

hexfusion commented Nov 18, 2021 •

edited

Loading

wilsonwang371 commented Nov 19, 2021 •

edited

Loading

wilsonwang371 commented Nov 24, 2021 •

edited

Loading