`consumer.consume` return unassigned messages if rebalance happens during the call #1013

olejorgenb · 2021-01-06T12:21:08Z

Description

When asking for large number of messages using consumer.consume with a large timeout, non-assigned messages are returned if a rebalance happens during that consume call.

How to reproduce

import typing as ty
import socket
import confluent_kafka as ck

def create_bare_ck_consumer(
        topic: str,
        uri: str,
        group: str,
        max_poll_interval_ms: int,
        auto_offset_reset: str = "latest",
):
    topics = [topic]

    params = {
        "bootstrap.servers": uri,
        "enable.auto.commit": False,
        "auto.offset.reset": auto_offset_reset,
        "session.timeout.ms": 30*1000,
        "client.id": f"{socket.gethostname()}@rdkafka-{ck.__version__}",
        "group.id": group,
        "max.poll.interval.ms": max_poll_interval_ms
        # "debug": "consumer,cgrp",  # https://github.com/edenhill/librdkafka/blob/master/INTRODUCTION.md#debug-contexts
    }

    kafka_consumer = ck.Consumer(
        **params,
    )

    def on_assigned(consumer: ck.Consumer, assigned):
        consumer.assign(assigned)
        print(f"We were assigned {len(assigned)} partitions.")
        print("Rebalance finished (on_assigned)")

    def on_revoked(*args):
        print("Rebalance started (on_revoked)")

    kafka_consumer.subscribe(
        topics,
        on_assigned,
        on_revoked,
    )

    return kafka_consumer

def main():
    print(f"{ck.version()=}")
    print(f"{ck.libversion()=}")

    consumer = create_bare_ck_consumer(
        topic="the-topic",
        uri="localhost:9092",
        group="the-group",
        max_poll_interval_ms=7 * 60 * 1000
    )
    try:
        msgs = []
        msgs_without_errors = []
        while len(msgs) == 0:
            print("Start consuming")
            msgs = consumer.consume(3000, timeout=10)
            print(f"Done consuming -> {len(msgs)}")

            msgs_without_errors: ty.List[ck.Message] = []

            for msg in msgs:
                if (error := msg.error()) is not None:
                    print(f"Got error when polling: {error.code()}")
                else:
                    msgs_without_errors.append(msg)

        assignment = consumer.assignment()
        print("Assignments:")
        for tp in assignment:
            print(tp.topic, tp.partition)

        for msg in msgs_without_errors:
            if ck.TopicPartition(msg.topic(), msg.partition()) not in assignment:
                print("Received unassigned message!", msg.topic(), msg.partition())
    finally:
        if consumer:
            consumer.close()


if __name__ == '__main__':
    main()

When starting two of the above processes fairly rapidly the first is assigned all 128 partitions and then a rebalance is triggered before consume have returned any messages. On my machine there need to be sufficient amount of messages available to trigger the behavior.

The output of the first process is:

ck.version()=('1.5.0', 17104896)
ck.libversion()=('1.5.0', 17105151)
Start consuming
We were assigned 128 partitions.
Rebalance finished (on_assigned)
Rebalance started (on_revoked)
We were assigned 64 partitions.
Rebalance finished (on_assigned)
Done consuming -> 1654
Assignments:
the-topic 0
the-topic 1
the-topic 2
...
the-topic 61
the-topic 62
the-topic 63
Received unassigned message! the-topic 64
Received unassigned message! the-topic 64
Received unassigned message! the-topic 64
Received unassigned message! the-topic 64
Received unassigned message! the-topic 64
Received unassigned message! the-topic 65
Received unassigned message! the-topic 65
Received unassigned message! the-topic 65
Received unassigned message! the-topic 65
...

According to #435 (comment) the internal queue should be cleared on rebalance.

The same result happens if I remove the assign call in the on_assign callback. (it's handled internally now)

rkqu is not filtered after rd_kafka_consume_batch_queue completes, and I guess librdkafka doesn't touch that queue on assign:

confluent-kafka-python/src/confluent_kafka/src/Consumer.c

Line 935 in c32cd12

n = (Py_ssize_t)rd_kafka_consume_batch_queue(rkqu,

If this is not considered a bug, the behavior should at least be documented? Or is it something I've overlooked?

PS: Similar issue for a node-rdkafka wrapper: Blizzard/node-rdkafka#638 (comment)

Checklist

Please provide the following information:

confluent-kafka-python and librdkafka version (ck.version()=('1.5.0', 17104896) and ck.libversion()=('1.5.0', 17105151)):
Apache Kafka broker version: 2.4.1 (zookeeper: 3.6.0)
Client configuration: (see description)
Operating system: linux
Provide client logs (with 'debug': '..' as necessary)
Provide broker log excerpts
Critical issue

The text was updated successfully, but these errors were encountered:

edenhill · 2021-01-12T08:56:15Z

Yeah this is a known issue with the batch consume interface and a reason why we recommend not to use it.
It should really be fixed.

jliunyu · 2021-04-21T19:01:50Z

The fix was merged

edenhill added the bug label Jan 26, 2021

jliunyu mentioned this issue Feb 18, 2021

Clean buffer after rebalancing for batch queue confluentinc/librdkafka#3269

Merged

jliunyu closed this as completed Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`consumer.consume` return unassigned messages if rebalance happens during the call #1013

`consumer.consume` return unassigned messages if rebalance happens during the call #1013

olejorgenb commented Jan 6, 2021

edenhill commented Jan 12, 2021

jliunyu commented Apr 21, 2021

consumer.consume return unassigned messages if rebalance happens during the call #1013

consumer.consume return unassigned messages if rebalance happens during the call #1013

Comments

olejorgenb commented Jan 6, 2021

Description

How to reproduce

Checklist

edenhill commented Jan 12, 2021

jliunyu commented Apr 21, 2021

`consumer.consume` return unassigned messages if rebalance happens during the call #1013

`consumer.consume` return unassigned messages if rebalance happens during the call #1013