Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pulsar-broker] Allow broker to discover and unblock stuck subscription #9789

Merged
merged 4 commits into from
Mar 10, 2021

Conversation

rdhabalia
Copy link
Contributor

Motivation

We have been frequently seeing issue where subscription gets stuck on different topics and broker is not dispatching messages though consumer has available-permits and no pending reads (example #9788). It can happen due to regression bug or unknown issue when expiry runs.. one of the workarounds is manually unload the topic and reload it which is not feasible if this happens frequently to many topics. Or broker should have the capability to discover such stuck subscriptions and unblock them.
Below example shows that:
subscription has available-permit>0, there is no pending reads, cursor's read-position is not moving forward and that builds the backlog until we unload the topic. It happens frequently due to unknown reason:

STATS-INTERNAL:
"sub1" : {
      "markDeletePosition" : "11111111:15520",
      "readPosition" : "11111111:15521",
      "waitingReadOp" : false,
      "pendingReadOps" : 0,
      "messagesConsumedCounter" : 115521,
      "cursorLedger" : 585099247,
      "cursorLedgerLastEntry" : 597,
      "individuallyDeletedMessages" : "[]",
      "lastLedgerSwitchTimestamp" : "2021-02-25T19:55:50.357Z",
      "state" : "Open",
      "numberOfEntriesSinceFirstNotAckedMessage" : 1,
      "totalNonContiguousDeletedMessagesRange" : 0,

STATS:
"sub1" : {
      "msgRateOut" : 0.0,
      "msgThroughputOut" : 0.0,
      "msgRateRedeliver" : 0.0,
      "msgBacklog" : 30350,
      "blockedSubscriptionOnUnackedMsgs" : false,
      "msgDelayed" : 0,
      "unackedMessages" : 0,
      "type" : "Shared",
      "msgRateExpired" : 0.0,
      "consumers" : [ {
        "msgRateOut" : 0.0,
        "msgThroughputOut" : 0.0,
        "msgRateRedeliver" : 0.0,
        "consumerName" : "C1",
        "availablePermits" : 723,
        "unackedMessages" : 0,
        "blockedConsumerOnUnackedMsgs" : false,
        "metadata" : { },
        "connectedSince" : "2021-02-25T19:55:50.358285Z",

image

Modification

Add capability in broker to periodically check if subscription is stuck and unblock it if needed. This check is controlled by flag and for initial release it can be disabled by default (and we can enable by default in later release)

Result

It helps broker to handle stuck subscription and logs the message for later debugging.

@rdhabalia rdhabalia added this to the 2.8.0 milestone Mar 4, 2021
@rdhabalia rdhabalia self-assigned this Mar 4, 2021
@@ -120,11 +120,13 @@ public int hashCode() {

@Override
public boolean equals(Object obj) {
if (obj == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The null scenario is already covered by the instanceof check

Comment on lines +278 to +279
# Broker periodically checks if subscription is stuck and unblock if flag is enabled. (Default is disabled)
unblockStuckSubscriptionEnabled=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the description of the flag, the broker will periodically check if stuck and unblock, Should we inform users what is the default frequency and how to change the frequency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check depends on the rate and this check is performed in the same stats-update task so, this feature doesn't require additional configuration.

@lhotari
Copy link
Member

lhotari commented Mar 5, 2021

We have been frequently seeing issue where subscription gets stuck on different topics and broker is not dispatching messages though consumer has available-permits and no pending reads (example #9788). It can happen due to regression bug or unknown issue when expiry runs.

This could be some sort of locked up state which is prevented by @merlimat 's change #9787 . @rdhabalia do you have some environment where you could check if #9787 helps to fix the issue you are seeing?

@rdhabalia
Copy link
Contributor Author

@lhotari this issue is a more functional issue (mostly related expiry-check and regression bug which we saw in past with key-shared sub) and it doesn't happen due to deadlock (validated thread-dump) or slower processing because sub gets stuck until one manually unload the topic. so, #9787 won't exactly address this issue.

@rdhabalia
Copy link
Contributor Author

/pulsarbot run-failure-checks

@codelipenghui codelipenghui merged commit 8d9a2ab into apache:master Mar 10, 2021
@devinbost
Copy link
Contributor

devinbost commented Mar 16, 2021

It seems like this is somewhat of a bandaid. How do we find the root cause of why subscriptions are getting stuck? This seems like it might be related to #6054
@rdhabalia

@devinbost
Copy link
Contributor

@rdhabalia Thanks for doing this work. It looks like it's going to help a lot with cases where the subscriptions stop receiving messages.

@devinbost
Copy link
Contributor

I realized that this won't help when negative permits are occurring... So, there's still more work to do to unblock stuck subscriptions.

eolivelli pushed a commit that referenced this pull request May 13, 2021
…on (#9789)

We have been frequently seeing issue where subscription gets stuck on different topics and broker is not dispatching messages though consumer has available-permits and no pending reads (example #9788). It can happen due to regression bug or unknown issue when expiry runs.. one of the workarounds is manually unload the topic and reload it which is not feasible if this happens frequently to many topics. Or broker should have the capability to discover such stuck subscriptions and unblock them.
Below example shows that:
subscription has available-permit>0, there is no pending reads, cursor's read-position is not moving forward and that builds the backlog until we unload the topic. It happens frequently due to unknown reason:
```
STATS-INTERNAL:
"sub1" : {
      "markDeletePosition" : "11111111:15520",
      "readPosition" : "11111111:15521",
      "waitingReadOp" : false,
      "pendingReadOps" : 0,
      "messagesConsumedCounter" : 115521,
      "cursorLedger" : 585099247,
      "cursorLedgerLastEntry" : 597,
      "individuallyDeletedMessages" : "[]",
      "lastLedgerSwitchTimestamp" : "2021-02-25T19:55:50.357Z",
      "state" : "Open",
      "numberOfEntriesSinceFirstNotAckedMessage" : 1,
      "totalNonContiguousDeletedMessagesRange" : 0,

STATS:
"sub1" : {
      "msgRateOut" : 0.0,
      "msgThroughputOut" : 0.0,
      "msgRateRedeliver" : 0.0,
      "msgBacklog" : 30350,
      "blockedSubscriptionOnUnackedMsgs" : false,
      "msgDelayed" : 0,
      "unackedMessages" : 0,
      "type" : "Shared",
      "msgRateExpired" : 0.0,
      "consumers" : [ {
        "msgRateOut" : 0.0,
        "msgThroughputOut" : 0.0,
        "msgRateRedeliver" : 0.0,
        "consumerName" : "C1",
        "availablePermits" : 723,
        "unackedMessages" : 0,
        "blockedConsumerOnUnackedMsgs" : false,
        "metadata" : { },
        "connectedSince" : "2021-02-25T19:55:50.358285Z",

```

![image](https://user-images.githubusercontent.com/2898254/109894631-ab62d980-7c42-11eb-8dcc-a1a5f4f5d14e.png)

Add capability in broker to periodically check if subscription is stuck and unblock it if needed. This check is controlled by flag and for initial release it can be disabled by default (and we can enable by default in later release)

It helps broker to handle stuck subscription and logs the message for later debugging.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants