-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Subscription consumption stuck on consumer reconnect #21199
[Bug] Subscription consumption stuck on consumer reconnect #21199
Comments
Based on my testing, the bug reproduces only when there are a few consumers for the subscription. When there's one consumer everything works as expected. |
The name of the field says it's "numberOfEntriesSinceFirstNotAckedMessage" (apparently, readPosition-markDeletePosition). It sounds like it includes the number of both successfully processed and not processed messgages since the first error. |
@Technoboy- , what do you mean by "the owner broker dump file"? broker.conf? |
I've just tested unblockStuckSubscriptionEnabled=true flag and it doesn't help. |
In-depth description of this issue.
|
Can normal consumption be restored after unload or restart owner broker ? |
@equanz @codelipenghui Do you think that PIP-282 changes in #21406 21953 address this issue? |
I think not. In #21199 (comment) case: Lines 484 to 499 in 66e1a06
I cannot conclude that this behavior is incorrect(Isn't this one of feature to preserve ordering?). |
It seems that this issue might be addressed together with PIP-282 changes #21953 and other PRs #23226 (merged) and #23231 (in-progress). |
I've created PIP-379: Key_Shared Draining Hashes for Improved Message Ordering as a proposal to address such issues. |
Search before asking
Version
Pulsar broker: 2.8.4
Java Pulsar client: 2.8.4
Minimal reproduce step
Non-partitioned topic. Batching is disabled on both producer and consumer. No acknowledge timeout. 5 subscriptions, each has 12 consumers.
One consumer of one subscription fails to process a message and doesn't ack it.
On a fail, I give the consumer a minute more to try to process other messages and ack them, if they are processed successfully. After a minute, I recreate the consumer and try to reprocess the messages, which would help if the error was transient.
What did you expect to see?
I expected to see the subscription backlog consumed further by the consumer with 1 failed message and by the other 11 consumers.
What did you see instead?
If a consumer fails to process one message, processing of all other messages with other keys is also stalled.
Including the other 11 consumers of the subscription.
All the other subscriptions and their consumers of the topic continue processing as expected.
As a symptom, I see the stuck subscription has "waitingReadOp" : false and "subscriptionHavePendingRead" : false, while the other subscription has these fields at true.
stats.txt
stats-internal.txt
Anything else?
The message rate is about 50 messages per second. The same scenario with a few (1-2-5) messages per minute works as expected. So, I believe there might be some race condition.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: