-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not allow batch poll and rebalance to happen in parallel #176
Do not allow batch poll and rebalance to happen in parallel #176
Conversation
should this PR be against master or main? |
5d8f018
to
0c5ad2e
Compare
0c5ad2e
to
e00886e
Compare
@michaelglass |
I will brush my memory on how rebalance is happening actually happening and will look into this hopefully soon. But, just for clarification, why would it matter for the client whether it sees messages from already unassigned partitions? They'd be processed, perhaps even twice (by this instance and by the new one) so we don't seem to be losing messages? |
Thank you for taking a look!
Yep, agreed, we're not too worried about losing messages. For our particular use case processing a message twice would be fine, but processing a series of messages for a particular partition twice would be a problem. I.e., we'd be fine with this sequence of messages: We believe that by committing after every offset we should be able to get this guarantee from Kafka, but only if we can be sure we will stop processing messages for a partition when we loose it as an assignment. |
If we allow a batch poll and rebalance event to run in parallel then there is no way to tell if a poll that starts before a rebalance and returns after the rebalance completes returns messages from the old assigned partition set or the new one. The hw-kafka-client user will then not know whether they should disregard the returned message set or process it. This makes it so the batch message polling runs in the same lock the consumer event callback does. That way we can be sure a poll for messages and a rebalance event will never be running in parallel.
e00886e
to
c431dc8
Compare
Kafka client only triggers callbacks when one of the
For example, whe using the sync api, Heartbeats are an entirely separate mechanism to polls, they are happening in the backgrounds constantly. The defaults based on https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md is that client sends heartbeat every 3 seconds, while broker treats client as gone after 45 seconds. But this is unrelated to rebalance callback. (While here, a timeout setting does related to poll/rebalance is I believe a good practice is to just rely on (In hw-kafka-client speak - I think the sync API should be preferred for explicit control. Note that message fetching still happens in the background by rdkafka, so the sync API is not slower) |
Our understanding
The rebalance callback happens in two threads:
is that right? if not this PR makes no sense!
This ensures that the async callback waits for a poll to complete before calling.
Problem
We were seeing poll responses with messages from partitions that had been recently revoked. This PR helped (minimize but not completely resolve) those cases.
Some context
If we allow a batch poll and rebalance event to run in parallel then
there is no way to tell if a poll that starts before a rebalance and
returns after the rebalance completes returns messages from the old
assigned partition set or the new one. The hw-kafka-client user will
then not know whether they should disregard the returned message set or
process it.
The solution
This makes it so the batch message polling runs in the same lock the
consumer event callback does. That way we can be sure a poll for
messages and a rebalance event will never be running in parallel.