Do not allow batch poll and rebalance to happen in parallel #176

michaelglass · 2021-05-10T10:19:57Z

Our understanding

The rebalance callback happens in two threads:

asynchronously (via heartbeat?)
synchronously (before the poll happens).

is that right? if not this PR makes no sense!

This ensures that the async callback waits for a poll to complete before calling.

Problem

We were seeing poll responses with messages from partitions that had been recently revoked. This PR helped (minimize but not completely resolve) those cases.

Some context

If we allow a batch poll and rebalance event to run in parallel then
there is no way to tell if a poll that starts before a rebalance and
returns after the rebalance completes returns messages from the old
assigned partition set or the new one. The hw-kafka-client user will
then not know whether they should disregard the returned message set or
process it.

The solution

This makes it so the batch message polling runs in the same lock the
consumer event callback does. That way we can be sure a poll for
messages and a rebalance event will never be running in parallel.

michaelglass · 2021-05-12T14:21:44Z

should this PR be against master or main?

AlexeyRaga · 2021-05-18T13:39:10Z

@michaelglass main, everything is main now. Master has retired and no more masters in this brave new world...

AlexeyRaga · 2021-05-18T13:41:53Z

I will brush my memory on how rebalance is happening actually happening and will look into this hopefully soon.

But, just for clarification, why would it matter for the client whether it sees messages from already unassigned partitions? They'd be processed, perhaps even twice (by this instance and by the new one) so we don't seem to be losing messages?

ghost · 2021-05-18T14:27:57Z

Thank you for taking a look!

They'd be processed, perhaps even twice (by this instance and by the new one) so we don't seem to be losing messages?

Yep, agreed, we're not too worried about losing messages.

For our particular use case processing a message twice would be fine, but processing a series of messages for a particular partition twice would be a problem.

I.e., we'd be fine with this sequence of messages: A B B C C D
But not with this one: A B C B C D

We believe that by committing after every offset we should be able to get this guarantee from Kafka, but only if we can be sure we will stop processing messages for a partition when we loose it as an assignment.

If we allow a batch poll and rebalance event to run in parallel then there is no way to tell if a poll that starts before a rebalance and returns after the rebalance completes returns messages from the old assigned partition set or the new one. The hw-kafka-client user will then not know whether they should disregard the returned message set or process it. This makes it so the batch message polling runs in the same lock the consumer event callback does. That way we can be sure a poll for messages and a rebalance event will never be running in parallel.

robinp · 2021-09-26T20:28:55Z

Kafka client only triggers callbacks when one of the poll-like methods are invoked.

These callbacks will [in addition to rd_kafka_poll] also be triggered by rd_kafka_flush(), rd_kafka_consumer_poll(), and any other functions that serve queues. (from the excellent https://github.com/edenhill/librdkafka/blob/master/INTRODUCTION.md)

For example, whe using the sync api, rd_kafka_consumer_poll will be called, and the callback trigger at that time. The async api also schedules backgroud rd_kafka_poll-s, which will can also trigger the rebalance callback - probably this is where you see a race.

Heartbeats are an entirely separate mechanism to polls, they are happening in the backgrounds constantly. The defaults based on https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md is that client sends heartbeat every 3 seconds, while broker treats client as gone after 45 seconds. But this is unrelated to rebalance callback.

(While here, a timeout setting does related to poll/rebalance is max.poll.interval.ms, default is 5 minutes. Client should poll messages at least this frequently, otherwise it might be kicked out of the group without receiving the rebalance callback - because if it doesn't poll, no way to get the callback).

I believe a good practice is to just rely on rd_kafka_consumer_poll, to get events together with messages, and don't do any rd_kafka_poll in the background. If you have long-running message processing that exceeds the 5 minutes, adjust the batch sizes or cut down message processing time rather. The background polls for events should be a last resort.

(In hw-kafka-client speak - I think the sync API should be preferred for explicit control. Note that message fetching still happens in the background by rdkafka, so the sync API is not slower)

ghost force-pushed the poll-batch-or-rebalance branch from 5d8f018 to 0c5ad2e Compare May 12, 2021 14:26

michaelglass changed the base branch from main to master May 12, 2021 14:27

michaelglass changed the base branch from master to main May 18, 2021 08:22

michaelglass force-pushed the poll-batch-or-rebalance branch from 0c5ad2e to e00886e Compare May 18, 2021 08:23

ghost force-pushed the poll-batch-or-rebalance branch from e00886e to c431dc8 Compare May 25, 2021 09:03

AlexeyRaga merged commit 5315765 into haskell-works:main Oct 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not allow batch poll and rebalance to happen in parallel #176

Do not allow batch poll and rebalance to happen in parallel #176

michaelglass commented May 10, 2021 •

edited

Loading

michaelglass commented May 12, 2021 •

edited

Loading

AlexeyRaga commented May 18, 2021

AlexeyRaga commented May 18, 2021

ghost commented May 18, 2021 •

edited by ghost

Loading

robinp commented Sep 26, 2021 •

edited

Loading

Do not allow batch poll and rebalance to happen in parallel #176

Do not allow batch poll and rebalance to happen in parallel #176

Conversation

michaelglass commented May 10, 2021 • edited Loading

Our understanding

Problem

Some context

The solution

michaelglass commented May 12, 2021 • edited Loading

AlexeyRaga commented May 18, 2021

AlexeyRaga commented May 18, 2021

ghost commented May 18, 2021 • edited by ghost Loading

robinp commented Sep 26, 2021 • edited Loading

michaelglass commented May 10, 2021 •

edited

Loading

michaelglass commented May 12, 2021 •

edited

Loading

ghost commented May 18, 2021 •

edited by ghost

Loading

robinp commented Sep 26, 2021 •

edited

Loading