-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix watch event loss #17555
Fix watch event loss #17555
Conversation
d9a3dc8
to
4025f36
Compare
/retest (Will only retest workflows that have failed by default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The change is small and it's easy to backport. Thanks
Thanks for the fix. The fix is simple and makes sense to me, but the 230+ lines of e2e test is a little over complicated to me. Are you able to create a simple unit test to simulate the case that w.ch is full instead? etcd/server/storage/mvcc/watcher_group.go Lines 254 to 260 in b81936d
Also when I read the source code, I found there maybe another slightly related issue. As the code showed above, when etcd/server/storage/mvcc/watcher_group.go Lines 228 to 236 in b81936d
|
Yeah, I have confirmed it would cause compacted watcher still exists in the original watcher group after Also I observed that slow watchers metrics is 829 even if I only opened 801 watchers. Haven't yet dig into that but it could be related.
I can submit another separate PR to fix it. |
I would like to keep the existing e2e test to ensure no regression of etcd watch behavior after the fix gets in. I would simplify it by removing unnecessary code paths. Unit test is good to have and I am working on it. |
Please ping me back when you resolve all comments. |
OK |
I think we might need to do more changes in the watch code soon. Let's go with a minimal fix proposed by the original PR to address the issue. @chaochn47 can you simplify the e2e test? I should be able to help with if needed. |
4025f36
to
7ff1da9
Compare
Ping @ahrtr @serathius |
Have you considered using a gofailpoint to simulate watch stream being clogged? I managed to reproduce the issue using a following test:
Where |
Thanks! I have updated the integration test based on the one you provided! Added unit test, integration test with failpoint and e2e test together and let you guys pick which one is the best. |
1712f23
to
12514b9
Compare
The e2e test is still a little over complicated to me. It will definitely be painful for other contributors to update/maintain such e2e test in future. I think it should be part of the robustness test, which already supports generating traffic, watching, compaction and verifying results (including watchResponses), why do you have to write similar complex test in e2e test suite? But I won't insist on that. |
12514b9
to
4ec7409
Compare
Okay. This e2e test is removed from the PR and can be added in robustness or performance test suite later. This test was created originally to simulate kubernetes traffic. With unit test and integration test, it's enough to capture regression. |
4ec7409
to
f4421bb
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with two very minor comments, which can be resolved separately.
We can let this PR in for now if there is no other comment. cc @serathius
Signed-off-by: Chao Chen <chaochn@amazon.com>
f4421bb
to
405862e
Compare
@@ -250,6 +251,63 @@ func TestWatchCompacted(t *testing.T) { | |||
} | |||
} | |||
|
|||
func TestWatchNoEventLossOnCompact(t *testing.T) { | |||
oldChanBufLen, oldMaxWatchersPerSync := chanBufLen, maxWatchersPerSync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how relevant is maxWatchersPerSync to the issue and this test as len(watchers) < 4
.
What about the case where len(watchers) > maxWatchersPerSync
as pointed out in #17555 (comment) ? I haven't verified it, but I expect that unremoved watcher from s.unsynced
will cause syncWatchers
to return != 0, causing function to be called earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how relevant is maxWatchersPerSync to the issue and this test as
len(watchers) < 4
.
It seems to be true. Confirmed that it always runs into the if branch (see below), and confirmed that it has no any impact on the test case.
But it isn't a big deal, and also from another prospective it should be OK to explicitly set a value to ensure len(wg.watchers) < maxWatchers
although it's already true
by default.
etcd/server/storage/mvcc/watcher_group.go
Lines 225 to 227 in 63e394d
if len(wg.watchers) < maxWatchers { | |
return wg, wg.chooseAll(curRev, compactRev) | |
} |
What about the case where len(watchers) > maxWatchersPerSync as pointed out in #17555 (comment) ?
Suggest to discuss & fix it separately. We may want to do minor local code refactor. @chaochn47 are you able to continue to work on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confirmed that it has no any impact on the test case.
I mean that the even I don't change maxWatchersPerSync
in the test case , the test case could also reproduce the issue without the fix and the issue disappeared after applying the patch in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
// if its next revision of events are compacted and no lost events sent to client. | ||
func TestV3NoEventsLostOnCompact(t *testing.T) { | ||
if integration.ThroughProxy { | ||
t.Skip("grpc proxy currently does not support requesting progress notifications") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think this comment is relevant.
@chaochn47 do you have bandwidth to backport this to 3.5 and 3.4? We may need to release a 3.4 and 3.5 patch soon. |
Signed-off-by: Chao Chen <chaochn@amazon.com>
Signed-off-by: Chao Chen <chaochn@amazon.com>
@ahrtr @jmhbnz sure. Just provided backports PRs. Could you please take a look? |
…-loss-after-compaction [release-3.4] backport fix watch event loss after compaction #17555
Fix #17529
@fuweid @serathius @ahrtr
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.