-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rebalance causes some partition streams to be duplicated #280
Comments
Trying to describe my understanding of the problem in more details: there are 2 lists of When a topic-partition is revoked:
When a new topic-partition is assigned, a new stream is created regardless of whether there was still a matching request in |
I'm having a hard time reproducing the issue consistently. I corrected the previous message to reflect that the gap during which Another case I suspect is when To be the safest possible, should |
Folks, sorry for the delay here. I've not been able to find the time to fix this, but I am writing down how this should be fixed because we will work on this soon:
Diagnosis
Possible fix
Obviously the fix should include a test with a reproduction, etc. @narma has helpfully ported a test case from fs2-kafka which we can use with attribution: https://gist.github.com/narma/b63b5ec99d0b722f7658efd7111f09c0 |
Any updates on this? We've seen this behavior in our service too. |
Anyway, this is critical for our company, I'm working on implementing Itamar's fix instructions. Maybe will need help with the testing part, but this can be discussed in the future PR. |
When a rebalance happens, there is a chance that some partitions that are revoked and immediately re-assigned to the same consumer get duplicated (all messages received twice in the consumer stream).
Here's how to reproduce the problem (my code is not open source):
First add those 2 println in
newPartitionStream
when a partition stream is created then ended:Start stream for ...
✅Start stream for ...
on node B and 150xFinish stream for ...
on node A ✅Finish stream for ...
on node A followed by 300xStart stream for ...
on node AThis seems to happen because
endRevoked
only stops partition streams that have a pendingRequest
, which is not all of them. In normal cases, this is okay because those other streams are stopped byhandleRequests
when a new request is created and it realizes the partition is no longer assigned. But, in this particular case of rebalance, the partitions are reassigned right after being revoked sohandleRequests
doesn't help. Instead, a new partition stream is created and the old one keeps running too.One way to solve this might be to defer the revoke/assign logic while we're in the middle of rebalancing? Not sure if it's safe enough though.
The text was updated successfully, but these errors were encountered: