[improve][broker] Unblock stuck Key_Shared subscription after consumer reconnect #21396

ghost · 2023-10-19T05:49:09Z

Motivation

There's strange behaviour while not acknowledging a message due to a processing error.

The setup is:

14 consumers listen to a non-partitioned key-shared topic A;
in case any of them encounters a corrupted message, it restarts in some time. During some short time there is 13 consumers;
also consumers are disconnected in similar fashion while performing rolling restart;
batches are disabled on the producer and on the consumer (enableBatchIndexAcknowledgment = false).

The flow is:

producer sends a corrupted message to the topic with (message 1);
producer sends correct messages to the topic (message 2 and message 3);
consumer 1 fails to process the corrupted message (message 1);
the corrupted message 1 is recorded for replay (PersistentDispatcherMultipleConsumers#redeliveryMessages);
consumer 1 proceeds processing messages further for some time (e.g. the message 2 is processed successfully);
in some time consumer 1 stops;
consumer 2 picks up the corrupted message 1 from the replay set (PersistentDispatcherMultipleConsumers#readMoreEntries) and fails too;
consumer 1 spins up again and becomes a recently joined consumer (PersistentStickyKeyDispatcherMultipleConsumers#recentlyJoinedConsumers);
consumer 1 waits for message 1 to be acknowledged by anyone in order to be removed from recently joined consumers (PersistentStickyKeyDispatcherMultipleConsumers#removeConsumersFromRecentJoinedConsumers), so that it would be able to receive message 3;
steps 4-9 are repeated for all consumers until message 1 to be acknowledged, so none of the restarted consumers receives any messages.

As a result, not acknowledging message 1 for message 1, followed by restart of consumer 1 from the cluster, leads to full blocking of the topic.

I checked the code and I suppose it is implemented this way in order to prevent breaking the order.

Modifications

I used MessageRedeliveryController in order to track if the not-acked message has not yet been sent to a consumer. Until it is not, I block sending other messages with the same key hash. After the not-acked message is sent, the other message are also allowed to be sent.

Verifying this change

Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (10MB)
Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository: nborisov#1

…r reconnect

lhotari · 2023-10-19T12:04:25Z

.../apache/pulsar/broker/service/persistent/PersistentStickyKeyDispatcherMultipleConsumers.java

+            entriesWithSameKey.stream()
+                    .filter(entryWithTheSameKey -> !entriesForC.contains(entryWithTheSameKey))
+                    .forEach(entryToReplay -> {


btw. In the Pulsar code base, the Java Streams API is avoided in performance hotspots to reduce GC pressure. I'm not sure if that helps in practice, but that's one reason why plain for loops are preferred. :)

Good note, refactored.

liudezhi2098 · 2023-10-23T09:15:38Z

Will this change affect the ordering of consumption in the Key_Shared subscription mode?

ghost · 2023-11-10T19:03:59Z

the changes could break messages ordering. closing the MR

github-actions bot added the doc-required Your PR changes impact docs and you will update later. label Oct 19, 2023

[improve][broker] Unblock stuck Key_Shared subscription after consume…

573700a

…r reconnect

ghost force-pushed the unblock_stuck_keyshared_after_consumer_connect branch from 67a2b08 to 573700a Compare October 19, 2023 10:20

lhotari reviewed Oct 19, 2023

View reviewed changes

lhotari requested review from hangc0276, AnonHxy, codelipenghui and massakam October 19, 2023 12:04

lhotari assigned ghost Oct 19, 2023

Technoboy- added the ready-to-test label Oct 20, 2023

Ivan S added 2 commits October 20, 2023 08:07

[improve][broker] streams replaced with foreaches

7d1e4fe

[improve][broker] codestyle fixes

44a6a0f

ghost closed this Nov 10, 2023

ghost deleted the unblock_stuck_keyshared_after_consumer_connect branch November 10, 2023 19:04

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improve][broker] Unblock stuck Key_Shared subscription after consumer reconnect #21396

[improve][broker] Unblock stuck Key_Shared subscription after consumer reconnect #21396

ghost commented Oct 19, 2023 •

edited by ghost

Loading

lhotari Oct 19, 2023

ghost Oct 20, 2023

liudezhi2098 commented Oct 23, 2023

ghost commented Nov 10, 2023

[improve][broker] Unblock stuck Key_Shared subscription after consumer reconnect #21396

[improve][broker] Unblock stuck Key_Shared subscription after consumer reconnect #21396

Conversation

ghost commented Oct 19, 2023 • edited by ghost Loading

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

lhotari Oct 19, 2023

Choose a reason for hiding this comment

ghost Oct 20, 2023

Choose a reason for hiding this comment

liudezhi2098 commented Oct 23, 2023

ghost commented Nov 10, 2023

ghost commented Oct 19, 2023 •

edited by ghost

Loading