Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: remove KeyValue unmarshal failure panic and send corrupt alarm request instead #13426

Closed
wants to merge 4 commits into from

Conversation

wilsonwang371
Copy link
Contributor

We found in our enterprise environment that once in a while corruption of unmarshal KeyValue data happened. It will trigger etcd panic. Finally, we end up having corrupted data in storage.

To find the root cause of this issue, we need to:

  1. report unmarshal error instead of panic
  2. report unhealthy state so that the admin can do something: restore data, save logs. So that we can further investigate what is happening.

@wilsonwang371
Copy link
Contributor Author

#12845 may be a related issue.

@wilsonwang371
Copy link
Contributor Author

#13067 can also be another related issue.

@wilsonwang371
Copy link
Contributor Author

@ptabor Hi Piotr, can you give some suggestions on this patch?

@ptabor
Copy link
Contributor

ptabor commented Oct 29, 2021

What do you mean by 'corrupted data in storage'. From the community meeting, I thought that the after-restart state was correct and the system was able to continue after restart...

This is big tradeoff between holding on execution (alarm Corrupt causes all calls to keep failing with Corrupt state) vs. autohealing by restart. If there is a risk of 'storage corruption' then holding on execution seems the right choice. What I'm afraid is that there might be some class of users unaware of the problem because of automated restarts and this change might lead to cumulating failures at scale when it gets released.

In summary:

  • +1 for expanding logging.
  • Alarm instead of panic() should be IMHO flag gated to make it an opt-in behavior (unless there is frequent path that leads to data corruption in case of panic() - but in general etcd should be ready to interrupt execution at any point in time).

@wilsonwang371 wilsonwang371 force-pushed the watcher_panic_fix branch 2 times, most recently from 03a99cc to 80ede8b Compare October 29, 2021 18:57
@wilsonwang371
Copy link
Contributor Author

What do you mean by 'corrupted data in storage'. From the community meeting, I thought that the after-restart state was correct and the system was able to continue after restart...

This is big tradeoff between holding on execution (alarm Corrupt causes all calls to keep failing with Corrupt state) vs. autohealing by restart. If there is a risk of 'storage corruption' then holding on execution seems the right choice. What I'm afraid is that there might be some class of users unaware of the problem because of automated restarts and this change might lead to cumulating failures at scale when it gets released.

In summary:

  • +1 for expanding logging.
  • Alarm instead of panic() should be IMHO flag gated to make it an opt-in behavior (unless there is frequent path that leads to data corruption in case of panic() - but in general etcd should be ready to interrupt execution at any point in time).

Make sense. I added it as an experimental flag now.

Regarding the corruption detail, I am going to collect more information and later put it here.

@wilsonwang371 wilsonwang371 force-pushed the watcher_panic_fix branch 7 times, most recently from 4114577 to 2437616 Compare October 29, 2021 23:59
for i, v := range vals {
var kv mvccpb.KeyValue
if err := kv.Unmarshal(v); err != nil {
lg.Panic("failed to unmarshal mvccpb.KeyValue", zap.Error(err))
s.store.lg.Fatal("failed to unmarshal mvccpb.KeyValue", zap.Error(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we Fatal here (instead of Warn), we still don't propagate the error to the error channel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do. From what I have seen, Fatal does not trigger a Panic

server/embed/config.go Outdated Show resolved Hide resolved
@wilsonwang371
Copy link
Contributor Author

@ptabor @gyuho Currently, I am working on a patch for internal testing. I will make sure this can correctly report health problem in case real kv decode error happens.

@wilsonwang371 wilsonwang371 force-pushed the watcher_panic_fix branch 2 times, most recently from bef8a45 to 4664e04 Compare November 3, 2021 18:03
@hexfusion
Copy link
Contributor

Red Hat team just hit this today full log below.

{"level":"panic","ts":"2021-11-16T15:30:24.774Z","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 0)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue

goroutine 209 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc013779bc0, 0xc06da42a80, 0x1, 0x1)
        /remote-source/cachito-gomod-with-deps/deps/gomod/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0000aefa0, 0x122a229, 0x23, 0xc06da42a80, 0x1, 0x1)
        /remote-source/cachito-gomod-with-deps/deps/gomod/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc0000aefa0, 0xc00047aa40, 0xc02176a000, 0x2f9e, 0x32aa, 0xc02051a000, 0x2f9e, 0x32aa, 0x12, 0x7fffffffffffffff, ...)
        /remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc00047aa00, 0x0)
        /remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc00047aa00)
        /remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
        /remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:95 +0x3a7

@chaochn47
Copy link
Member

chaochn47 commented Nov 18, 2021

Crossed linked here the impact can be more pronounced if some critical k8s object get corrupted.

There is one "Corrupted/bitflipped serialized API data present" in etcd issue kubernetes/kubernetes#69579 amazon team reported starting from 2018.

I suspect our health check system auto-remedies the panic errors with restart. Does the symptom look similar in bytedance? @wilsonwang371

@wilsonwang371
Copy link
Contributor Author

Crossed linked here the impact can be more pronounced if some critical k8s object get corrupted.

There is one "Corrupted/bitflipped serialized API data present" in etcd issue kubernetes/kubernetes#69579 amazon team reported starting from 2018.

I suspect our health check system auto-remedies the panic errors with restart. Does the symptom look similar in bytedance? @wilsonwang371

In bytedance, we saw similar issue. but the data is not corrected. I think the reason might be because when corrupted etcd instance was restarted it still believe it can recover its state without a full data resync.

@wilsonwang371
Copy link
Contributor Author

Crossed linked here the impact can be more pronounced if some critical k8s object get corrupted.

There is one "Corrupted/bitflipped serialized API data present" in etcd issue kubernetes/kubernetes#69579 amazon team reported starting from 2018.

I suspect our health check system auto-remedies the panic errors with restart. Does the symptom look similar in bytedance? @wilsonwang371

are we able to enforce a full resync in case of same failure detected after restart?

@chaochn47
Copy link
Member

are we able to enforce a full resync in case of same failure detected after restart?

Not yet. We prioritized the mitigations (delete the noticeable corrupted kv data) so the data may be lost.

I will come back to this thread if there are any new findings. Our team is doing some log scanning on the clusters that faced the issue before to narrow down the bitflipped issue happened in etcd.

@hexfusion
Copy link
Contributor

Crossed linked here the impact can be more pronounced if some critical k8s object get corrupted.
There is one "Corrupted/bitflipped serialized API data present" in etcd issue kubernetes/kubernetes#69579 amazon team reported starting from 2018.
I suspect our health check system auto-remedies the panic errors with restart. Does the symptom look similar in bytedance? @wilsonwang371

In bytedance, we saw similar issue. but the data is not corrected. I think the reason might be because when corrupted etcd instance was restarted it still believe it can recover its state without a full data resync.

This makes a lot of sense to me from the situation we are observing.

@hexfusion
Copy link
Contributor

hexfusion commented Nov 18, 2021

In bytedance, we saw similar issue. but the data is not corrected. I think the reason might be because when corrupted etcd instance was restarted it still believe it can recover its state without a full data resync.

@wilsonwang371 are you saying that etcd never came back up? (corrupted) We saw etcd panic above then restarted fine, but I believe the problem still exists. kube-controller-manager for example is crashlooping unable to update lease. I think this aligns with some of @chaochn47 observations.

@wilsonwang371
Copy link
Contributor Author

In bytedance, we saw similar issue. but the data is not corrected. I think the reason might be because when corrupted etcd instance was restarted it still believe it can recover its state without a full data resync.

@wilsonwang371 are you saying that etcd never came back up? (corrupted) We saw etcd panic above then restarted fine, but I believe the problem still exists. kube-controller-manager for example is crashlooping unable to update lease. I think this aligns with some of @chaochn47 observations.

Let me rephrase what we saw in bytedance:

  1. KVValue decoding failed which triggered a panic.
  2. The etcd instance restarted after the panic.
  3. The restarted etcd still believes it can catch up without a full resync and therefore the corrupted KVValue still exists.
  4. Read requests to etcd can return old data on the restarted instance.

@hexfusion
Copy link
Contributor

Yes for the record on the peer that we saw panic, we replaced the member and it appears the cluster now has reconciled and working as expected.

@wilsonwang371
Copy link
Contributor Author

wilsonwang371 commented Nov 19, 2021

Some update here: with my patch, we can observe data corruption issue reported. However, with some further discussion. We need several things here.

  1. report health issue (DONE)
  2. save important logs for us to do some further investigation (Partially done)
  3. do a full resync if we can since the current state of the etcd is no longer valid. (Optional)

The first thing is done in the patch.
Regarding the second item, I am going to add more information so that we can extract more useful things when it happens next time.
For the third one, we may want to add a flag so that even if the KVValue decode failed, we can force a resync so that the failed etcd instance can recover.

what do you guys think?

@wilsonwang371
Copy link
Contributor Author

wilsonwang371 commented Nov 24, 2021

@chaochn47 @hexfusion @gyuho @ptabor

Do you guys think calling CleanupWAL() before panic will work in case of decoding KV value failure?

I assume this will clean up all the WAL. When we restart that failed etcd instance, it will rebuild its WAL & BoltDB data from the leader and recover. However, I didn't confirm this completely yet.

@wilsonwang371
Copy link
Contributor Author

Root cause might be here: #13505

@wilsonwang371
Copy link
Contributor Author

If #13505 is the root cause, we can check in that pull request and close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

5 participants