Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not allow batch poll and rebalance to happen in parallel #176

Merged
merged 1 commit into from
Oct 23, 2021

Conversation

michaelglass
Copy link
Contributor

@michaelglass michaelglass commented May 10, 2021

Our understanding

The rebalance callback happens in two threads:

  • asynchronously (via heartbeat?)
  • synchronously (before the poll happens).

is that right? if not this PR makes no sense!

This ensures that the async callback waits for a poll to complete before calling.

Problem

We were seeing poll responses with messages from partitions that had been recently revoked. This PR helped (minimize but not completely resolve) those cases.

Some context

If we allow a batch poll and rebalance event to run in parallel then
there is no way to tell if a poll that starts before a rebalance and
returns after the rebalance completes returns messages from the old
assigned partition set or the new one. The hw-kafka-client user will
then not know whether they should disregard the returned message set or
process it.

The solution

This makes it so the batch message polling runs in the same lock the
consumer event callback does. That way we can be sure a poll for
messages and a rebalance event will never be running in parallel.

@michaelglass
Copy link
Contributor Author

michaelglass commented May 12, 2021

should this PR be against master or main?

@ghost ghost force-pushed the poll-batch-or-rebalance branch from 5d8f018 to 0c5ad2e Compare May 12, 2021 14:26
@michaelglass michaelglass changed the base branch from main to master May 12, 2021 14:27
@michaelglass michaelglass changed the base branch from master to main May 18, 2021 08:22
@michaelglass michaelglass force-pushed the poll-batch-or-rebalance branch from 0c5ad2e to e00886e Compare May 18, 2021 08:23
@AlexeyRaga
Copy link
Member

@michaelglass main, everything is main now. Master has retired and no more masters in this brave new world...

@AlexeyRaga
Copy link
Member

I will brush my memory on how rebalance is happening actually happening and will look into this hopefully soon.

But, just for clarification, why would it matter for the client whether it sees messages from already unassigned partitions? They'd be processed, perhaps even twice (by this instance and by the new one) so we don't seem to be losing messages?

@ghost
Copy link

ghost commented May 18, 2021

Thank you for taking a look!

They'd be processed, perhaps even twice (by this instance and by the new one) so we don't seem to be losing messages?

Yep, agreed, we're not too worried about losing messages.

For our particular use case processing a message twice would be fine, but processing a series of messages for a particular partition twice would be a problem.

I.e., we'd be fine with this sequence of messages: A B B C C D
But not with this one: A B C B C D

We believe that by committing after every offset we should be able to get this guarantee from Kafka, but only if we can be sure we will stop processing messages for a partition when we loose it as an assignment.

If we allow a batch poll and rebalance event to run in parallel then
there is no way to tell if a poll that starts before a rebalance and
returns after the rebalance completes returns messages from the old
assigned partition set or the new one. The hw-kafka-client user will
then not know whether they should disregard the returned message set or
process it.

This makes it so the batch message polling runs in the same lock the
consumer event callback does. That way we can be sure a poll for
messages and a rebalance event will never be running in parallel.
@ghost ghost force-pushed the poll-batch-or-rebalance branch from e00886e to c431dc8 Compare May 25, 2021 09:03
@robinp
Copy link

robinp commented Sep 26, 2021

Kafka client only triggers callbacks when one of the poll-like methods are invoked.

These callbacks will [in addition to rd_kafka_poll] also be triggered by rd_kafka_flush(), rd_kafka_consumer_poll(), and any other functions that serve queues. (from the excellent https://github.com/edenhill/librdkafka/blob/master/INTRODUCTION.md)

For example, whe using the sync api, rd_kafka_consumer_poll will be called, and the callback trigger at that time. The async api also schedules backgroud rd_kafka_poll-s, which will can also trigger the rebalance callback - probably this is where you see a race.

Heartbeats are an entirely separate mechanism to polls, they are happening in the backgrounds constantly. The defaults based on https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md is that client sends heartbeat every 3 seconds, while broker treats client as gone after 45 seconds. But this is unrelated to rebalance callback.

(While here, a timeout setting does related to poll/rebalance is max.poll.interval.ms, default is 5 minutes. Client should poll messages at least this frequently, otherwise it might be kicked out of the group without receiving the rebalance callback - because if it doesn't poll, no way to get the callback).

I believe a good practice is to just rely on rd_kafka_consumer_poll, to get events together with messages, and don't do any rd_kafka_poll in the background. If you have long-running message processing that exceeds the 5 minutes, adjust the batch sizes or cut down message processing time rather. The background polls for events should be a last resort.

(In hw-kafka-client speak - I think the sync API should be preferred for explicit control. Note that message fetching still happens in the background by rdkafka, so the sync API is not slower)

@AlexeyRaga AlexeyRaga merged commit 5315765 into haskell-works:main Oct 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants