Lost messages in plainPartitionedSource on partition reassignment #382

edrevo · 2017-12-18T11:16:24Z

I am currently seeing the following behavior, using reactive-kafka 0.18:

I am sending 2 messages to a Kafka topic, which fall in two different partitions (partition 0 and partition 3)
A Kafka plainPartitionedSource is created and joins the consumer group at 22:51:04
A second Kafka plainPartitionedSource is created and joins the consumer group at 22:51:13
A partition rebalance happens between 22:51:28 and 22:51:34 (I see several RequestMessages from topic/partition X already registered by other stage)
Partition 0 stays with the first consumer
Partition 3 moves to the second consumer
The message to partition 0 is never emitted by the Source
The message to partition 3 is correctly emitted by the second consumer

Upon inspecting the reactive-kafka's source code, here is what I think is happening in the first consumer:

All partitions are assigned to the consumer, which triggers partitionAssignedCB (SubSourceLogic:78) which in turns pumps and emits a Source for each partition
The Source for partition 0 receives a pull, which sends a RequestMessages to the kafka consumer actor (SubSourceLogic:235), but still hasn't received the messages associated with that request
The new consumer joins the consumer group, and Kafka first revokes all partitions and then assigns the new ones.
Revoking of all partitions causes the Source for partition 0 to be cancelled (SubSourceLogic:99)
The Source for partition 0 calls completeStage (SubSourceLogic:222), regardless of whether it was mid-request (requested = true) or not.
The Kafka actor performs a poll for partition 0, since it has a request for it. And sends the resulting messages to the Source that was just closed (KafkaConsumerActor:346)

This causes the Kafka consumer to mark the message in partition 0 as received, but nobody actually received it.

Does this sound like a reasonable explanation? cc @13h3r, @elkozmon, @patriknw, @rgcase since you are authors of the SubSouceLogic file.

I am happy to provide a PR to fix this, but I'm not sure of what the best solution is (I'm pretty new to Kafka and Akka Streams). Is it reasonable to emit all pending messages before closing the Source? Or is that a big no-no?

The text was updated successfully, but these errors were encountered:

rafalmag · 2018-01-05T22:01:16Z

@edrevo Could you please create an automated test that reproduces this issue ?

edrevo · 2018-01-09T10:17:23Z

I don't think so, unfortunately. The test case would need to have very fine-grained control over the execution timing of each actor and I wouldn't know how to control that in akka / akka-streams

rafalmag · 2018-01-10T20:12:21Z

Could you try to make a branch with changes in "main" code - by adding thread.sleeps or even better latches - just to demonstrate the issue? Or maybe just the debug logs from the client when the issue actually happens?
If the issue is reproducible then it would be a huge issue as "at least once" guarantee would be gone.

I was thinking about using plainPartitionedSource to couple it with custom transactional producers and try to achieve "exactly once" kafka o kafka processing (something similar to Kafka Streams - https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/)

…kka#382)

edrevo · 2018-01-11T15:15:27Z

@rafalmag here you have a very dirty repro of the problem. You should:

git clone git@github.com:edrevo/reactive-kafka.git
cd reactive-kafka
git checkout issue-382
sbt "testOnly *MyTest*"

Keep in mind that the issue won't happen in every run of the tests, since it is a timing issue and the test isn't forcing any specific timing.

asflierl · 2018-02-16T16:21:05Z

I think I'm encountering the same issue using Consumer.commitablePartitionedSource.

ennru · 2018-05-23T13:14:23Z

@edrevo I've tried the tests in your dirty repo with current code - both go green. Are they flaky or did something improve already?

edrevo · 2018-05-23T13:40:41Z

@enru, the timing conditions needed to repro the bug are quite tricky, so the test case is just "probabilistic", in the sense that it sets up the scenario for the bug to appear, but it might or might not appear. I've ran it in several machines, and it being able to repro the bug with that test is a hit or miss.

Mitigate the problem of closing partitions with pending requests. Related to #382.

asflierl · 2018-06-08T15:14:33Z

FWIW, I am no longer running into lost messages around partition rebalancing in my tests with Consumer.commitablePartitionedSource using 0.21

ennru · 2018-06-08T15:35:57Z

Thanks for reporting, I hope others experience the same!

GrigorievNick · 2018-08-10T14:04:09Z

Hi I have very similar issue, that can be reproduced with simply close downstream for topicPartition source.

(0 to 100000).foreach(action => producer.send(new ProducerRecord(topic, action.toString)))
    producer.flush()
    Consumer
      .committablePartitionedSource[String, String](ConsumerSettings(system, None, None), Subscriptions.topics(topic))
      .log(topic)
      .delay(FiniteDuration(10, TimeUnit.MILLISECONDS))
      .runForeach {
        case (tp, source) =>
          source
            .map(_.record.offset())
            .log(tp.toString)
//            .map(_ => throw new IllegalArgumentException)
            .take(10)
            .runWith(Sink.ignore)
      }
    Thread.sleep(30000)

In log you will see that every time when source complete, next partitioned started from offset max.poll.records + 1 in case max.poll.records > 10.

if you uncomment map with exception, situation will be same, except it will not read 10 messages.

i use 0.22 version of lib.

GrigorievNick · 2019-07-02T16:12:59Z

TOday i check 1.0.4 version. And i can confirm, that issue was fixed.
And 99% precent that Consumer might skip offsets #336, also fixed.

ennru · 2019-07-03T11:27:36Z

Thank you for reporting this. So the fix from #589 solved your case.
Closing.

edrevo added a commit to edrevo/reactive-kafka that referenced this issue Jan 11, 2018

test: dirty repro for reactive-kafka/akka#382

23a8ff0

edrevo added a commit to edrevo/reactive-kafka that referenced this issue Jan 11, 2018

test: dirty repro for reactive-kafka/akka#382

51a7b08

edrevo added a commit to edrevo/reactive-kafka that referenced this issue Jan 11, 2018

Emit read elements on SubSourceLogic before closing (reactive-kafka/a…

ee0d14c

…kka#382)

This was referenced Jan 11, 2018

Emit read elements on SubSourceLogic before closing #390

Closed

Implement Consumer handling directly in GraphStages #377

Closed

rafalmag mentioned this issue Feb 12, 2018

Transactional support for Kafka consume-transform-produce workflows #369

Closed

This was referenced May 23, 2018

Fix rebalances with partitions dvallejo/reactive-kafka#4

Closed

Mitigate errors in rebalances with partitions in SubSourceLogic #466

Merged

ennru pushed a commit that referenced this issue May 29, 2018

Mitigate errors in rebalances with partitions in SubSourceLogic (#466)

b63df63

Mitigate the problem of closing partitions with pending requests. Related to #382.

ennru mentioned this issue Sep 25, 2018

Partitioned sources revoke/assign ignore re-assigned partitions #570

Closed

This was referenced Oct 3, 2018

Seek to the last unemitted offset after subsource is cancelled #589

Merged

A test case for partition subsource that fails #592

Merged

JeVaughan mentioned this issue Nov 6, 2018

KafkaExecutor, unhandled messages workflowfm/pew#13

Open

GrigorievNick mentioned this issue Jan 25, 2019

Consumer might skip offsets #336

Closed

ennru added this to the 1.0-M1 milestone Jul 3, 2019

ennru closed this as completed Jul 3, 2019

jhooda mentioned this issue Jan 10, 2020

Added a test case to validate no message loss during rebalance #1016

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost messages in plainPartitionedSource on partition reassignment #382

Lost messages in plainPartitionedSource on partition reassignment #382

edrevo commented Dec 18, 2017

rafalmag commented Jan 5, 2018

edrevo commented Jan 9, 2018

rafalmag commented Jan 10, 2018

edrevo commented Jan 11, 2018 •

edited

Loading

asflierl commented Feb 16, 2018

ennru commented May 23, 2018

edrevo commented May 23, 2018

asflierl commented Jun 8, 2018

ennru commented Jun 8, 2018

GrigorievNick commented Aug 10, 2018 •

edited

Loading

GrigorievNick commented Jul 2, 2019

ennru commented Jul 3, 2019

Lost messages in plainPartitionedSource on partition reassignment #382

Lost messages in plainPartitionedSource on partition reassignment #382

Comments

edrevo commented Dec 18, 2017

rafalmag commented Jan 5, 2018

edrevo commented Jan 9, 2018

rafalmag commented Jan 10, 2018

edrevo commented Jan 11, 2018 • edited Loading

asflierl commented Feb 16, 2018

ennru commented May 23, 2018

edrevo commented May 23, 2018

asflierl commented Jun 8, 2018

ennru commented Jun 8, 2018

GrigorievNick commented Aug 10, 2018 • edited Loading

GrigorievNick commented Jul 2, 2019

ennru commented Jul 3, 2019

edrevo commented Jan 11, 2018 •

edited

Loading

GrigorievNick commented Aug 10, 2018 •

edited

Loading