Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Consumer Stops Receiving Messages with Large Backlogs Post-Processing #22435

Closed
2 of 3 tasks
KannarFr opened this issue Apr 4, 2024 · 6 comments · Fixed by #22454
Closed
2 of 3 tasks

[Bug] Consumer Stops Receiving Messages with Large Backlogs Post-Processing #22435

KannarFr opened this issue Apr 4, 2024 · 6 comments · Fixed by #22454
Assignees
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@KannarFr
Copy link
Contributor

KannarFr commented Apr 4, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

3.2.2

Minimal reproduce step

Open consumer on a subscription using very large backlog by passing "old" MessageId.

What did you expect to see?

Consume all messages and wait for upcoming messages when the backlog is consumed.

What did you see instead?

Note that the following paragraph mentions a NonDurable consumer.

Initially, everything functions correctly—the process begins with a backlog of (let's say) 100,000 messages, which gradually decreases to 0 as we approach the present ("now"). However, once the backlog is fully processed, the consumer unexpectedly stops receiving new messages, leading to an increase in the backlog again. The consumer is still connected. This behavior is consistently reproducible with my topics that have a substantial amount of data.
I've also noticed that when the starting point (since) is relatively close to the current time (now), this problem does not occur.

Anything else?

We do not know if this bug was introduced by v3.2.2. We didn't see it before. We are currently rollbacking brokers to 3.2.1 to confirm this.

Otherwise, it may be related to #22191.

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@KannarFr KannarFr added the type/bug The PR fixed a bug or issue reported a bug label Apr 4, 2024
@KannarFr KannarFr changed the title [Bug] Consumer Stops Receiving Messages with Large Backlogs Post-Processing in v3.2.2: [Bug] Consumer Stops Receiving Messages with Large Backlogs Post-Processing Apr 4, 2024
@KannarFr
Copy link
Contributor Author

KannarFr commented Apr 4, 2024

I confirm bumping the client to v3.2.2 doesn't fix the issue.

I also confirm using a Durable consumer does not produce the issue. So it might be related to #22191. I'll rollback to v3.2.1 tomorrow to confirm.

@KannarFr
Copy link
Contributor Author

KannarFr commented Apr 5, 2024

I confirm that I can't reproduce the issue using v3.2.1 brokers.

@lhotari
Copy link
Member

lhotari commented Apr 6, 2024

Trying to understand the logic of #22191 and the changes regarding "backlogged" cursors before that, PRs #19343, #6766, #4066, #162 . The change made in #4066 to inactive cursors in the checkBackloggedCursor method seems suspicious, and might be the reason why a hack such as #9789 was needed.

In any case, it seems that #22191 should be reverted as the first step. However, the proper fix seems to be to sort out various issues in this area.

@lhotari
Copy link
Member

lhotari commented Apr 6, 2024

One of the problems is a possible race condition here:

ledger.getScheduledExecutor()
.schedule(() -> checkForNewEntries(op, callback, ctx),
config.getNewEntriesCheckDelayInMillis(), TimeUnit.MILLISECONDS);

In ManagedLedger, tasks are executed on 2 threads: the executor thread and the scheduler thread.

Reminds me of this old comment on an experimental PR: https://github.com/apache/pulsar/pull/11387/files#r693112234

@Technoboy-
Copy link
Contributor

In any case, it seems that #22191 should be reverted as the first step

If revert, the OOM issue will be there.

@lhotari
Copy link
Member

lhotari commented Apr 8, 2024

If revert, the OOM issue will be there.

Great work on the fix #22454, @Technoboy- . It looks like removing #22191 changes to addWaitingCursor is something that should be done there since changes to addWaitingCursor weren't relevant after all. The other changes made in #22454 will fix the OOM that #22191 attempted to fix. That's great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
3 participants