Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not acking without ackDeadline override #314

Closed
orourkedd opened this issue Nov 2, 2018 · 19 comments
Closed

Not acking without ackDeadline override #314

orourkedd opened this issue Nov 2, 2018 · 19 comments
Assignees
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@orourkedd
Copy link

orourkedd commented Nov 2, 2018

Environment details

  • OS: alpine-node:8.9
  • Node.js version: 8.9
  • npm version: 5.6.0
  • @google-cloud/pubsub version: 0.20.1

Steps to reproduce

  1. Create topic with 60 second acknowledgement deadline
  2. Process messages slowly
  3. Observe that somtimes 500 messages are acknowledged correctly after a deployment, the rest are not acknowledged. I confirmed that ack() is being called.

Here's my original subscription code that works (I simplified it). Using this code, messages are removed from the queue as expected after being ack'd:

const pubsub = new PubSub();
const subscription = pubsub.subscription("foo", {
  flowControl: {
    maxMessages: 5
  }
});

subscription.on(`message`, async message => {
  await doSomething(message)  // this operation take ~1 second
  message.ack()
})

Because I needed to slow down processing, I added a delay. After adding this delay, 500 messages would be removed from the queue after being ack'd. Other messages (thousands of them) would also be ack'd, but wouldn't be removed from the queue. Eventually, the same messages would come back around to be processed multiple times.

const pubsub = new PubSub();
const subscription = pubsub.subscription("foo", {
  flowControl: {
    maxMessages: 5
  }
});

subscription.on(`message`, async message => {
  await sleep(Math.random() * 4000) // this sleeps for an average of 2 seconds
  await doSomething(message) // this operation take ~1 second
  message.ack() // I verified that this is being called
})

After reading the source for this library, I noticed this line: https://github.com/googleapis/nodejs-pubsub/blob/master/src/subscriber.js#L281

I added this line to the code and things started working again:

subscription.ackDeadline = 60 * 1000; // match 60 second ack deadline of the topic

It appears that the default 10000ms ackDeadline is somehow causing a problem in my case.

Full working code here:

const pubsub = new PubSub();
const subscription = pubsub.subscription("foo", {
  flowControl: {
    maxMessages: 5
  }
});

subscription.ackDeadline = 60 * 1000;

subscription.on(`message`, async message => {
  await sleep(Math.random() * 4000) // this sleeps for an average of 2 seconds
  await doSomething(message) // this operation take ~1 second
  message.ack() // I verified that this is being called
})

Here is a chart from stackdriver showing the timeline of events:
stackdriver

@JustinBeckwith JustinBeckwith added triage me I really want to be triaged. 🚨 This issue needs some love. labels Nov 3, 2018
@callmehiphop callmehiphop self-assigned this Nov 12, 2018
@JustinBeckwith JustinBeckwith added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed triage me I really want to be triaged. 🚨 This issue needs some love. labels Dec 2, 2018
@zxbodya
Copy link

zxbodya commented Dec 6, 2018

Having same issue after upgrading from 0.18.0 to 0.19.1, back then - I did not had much time to investigate, so reverted, and forgot about it for some time.

Today, was trying to deploy new version using version 0.21.1, and have exactly same issue as before - some messages are not acknowledged.

And, indeed - it looks to be related to having delay for message processing before acknowledging it… I have other service with almost identical setup, consuming messages from same topics, but without noticeable delay before acknowledgement - and everything works fine there.

@callmehiphop
Copy link
Contributor

We've been working on refactoring the Subscription class (#388). If anyone has some spare time and wouldn't mind testing to see if it resolves this issue, that would be most excellent!

@callmehiphop
Copy link
Contributor

Just put out a new release (v0.23.0), if anyone could verify that this issue persists that would be really helpful. I've been pretty unsuccessful in trying to reproduce it myself.

@zxbodya
Copy link

zxbodya commented Jan 17, 2019

will try to reproduce it tomorrow

@callmehiphop
Copy link
Contributor

@zxbodya thanks!

@zxbodya
Copy link

zxbodya commented Jan 18, 2019

tried to run my app with a new release… - unfortunately it did not work as I expect.

but, now I see some errors in a log: Error: 4 DEADLINE_EXCEEDED: Deadline Exceeded, which make me to think that there might be there is something else to look into

will try to debug it, and maybe to build smaller app to reproduce the issue

@zxbodya
Copy link

zxbodya commented Jan 21, 2019

Tried to write small app to reproduce the issue - no luck, in small app it works…

I feel, I miss something, but have not much idea what exactly might be wrong

In my app I have something like this:

const pubsub = new PubSub({
  projectId: process.env.GOOGLE_PUBSUB_PROJECT_ID,
});

const sub1 = pubsub.subscription(subName1);
sub1.on('error', (err: Error) => {
  logger.error({ err }, 'subscription error');
});

const sub2 = pubsub.subscription(subName1);
sub2.on('error', (err: Error) => {
  logger.error({ err }, 'subscription error');
});

merge(
  fromEvent<Message>(sub1, 'message'),
  fromEvent<Message>(sub2, 'message'),
  // here is more subscriptions from different topics around 30 topics
)
    .pipe(
      tap(message => {
        logger.debug(
          { messageId: message.id, messageData: message.data.toString() },
          'incoming'
        );
      }),
      delay(3000), // or around 60000 for subscriptionExtraAdded
      mergeMap(message => {
        message.ack();
        logger.debug({ messageId: message.id }, 'ack');
        return [];
      })
    )
    .subscribe();

And in a small app it works fine… But in real app - it does not, and what I see in logs looks weird for me:

received 3 messages, one from webshopOrderCreated, 2 from subscriptionExtraAdded

[2019-01-21T01:16:43.044Z] DEBUG: incoming (messageId=260044166173615, messageAttributes={})
    subscriptionName: subscriptionExtraAdded
[2019-01-21T01:16:43.047Z] DEBUG: incoming (messageId=260043926601384, messageAttributes={})
    subscriptionName: subscriptionExtraAdded
[2019-01-21T01:16:43.148Z] DEBUG: incoming (messageId=260041231106144, messageAttributes={})
    subscriptionName: webshopOrderCreated

Tried to acknowledge webshopOrderCreated message (.ack() was called)

[2019-01-21T01:16:46.255Z] DEBUG: ack (messageId=260041231106144, messageAttributes={})

errors from subscriptionExtraAdded subscription

[2019-01-21T01:17:43.142Z] ERROR: subscription error
    subscriptionName: subscriptionExtraAdded
    --
    error: {
      "code": 4,
      "metadata": {
        "_internal_repr": {}
      },
      "details": "Deadline Exceeded"
    }
[2019-01-21T01:17:43.248Z] ERROR: subscription error
    subscriptionName: webshopOrderCreated
    --
    error: {
      "code": 4,
      "metadata": {
        "_internal_repr": {}
      },
      "details": "Deadline Exceeded"
    }

Tried to acknowledge subscriptionExtraAdded message (.ack() was called)

[2019-01-21T01:17:46.070Z] DEBUG: ack (messageId=260043926601384, messageAttributes={})
[2019-01-21T01:17:46.070Z] DEBUG: ack (messageId=260044166173615, messageAttributes={})

more errors:

[2019-01-21T01:17:50.691Z] ERROR: subscription error
    subscriptionName: subscriptionExtraAdded
    --
    error: {
      "code": 4,
      "metadata": {
        "_internal_repr": {}
      },
      "details": "Deadline Exceeded"
    }

# errors below are the same as previous(removed everything except timestamp)
[2019-01-21T01:17:59.050Z]
[2019-01-21T01:18:06.410Z]
[2019-01-21T01:18:09.188Z]
[2019-01-21T01:18:09.365Z]
[2019-01-21T01:18:16.676Z]
[2019-01-21T01:18:17.989Z]
[2019-01-21T01:18:18.099Z]
[2019-01-21T01:18:22.393Z]
[2019-01-21T01:18:29.556Z]
[2019-01-21T01:18:36.722Z]
[2019-01-21T01:18:41.777Z]

10 minues passed and messages were redelivered:

[2019-01-21T01:26:53.482Z] DEBUG: incoming (messageId=260043926601384, messageAttributes={})
    subscriptionName: subscriptionExtraAdded
[2019-01-21T01:26:53.484Z] DEBUG: incoming (messageId=260044166173615, messageAttributes={})
    subscriptionName: subscriptionExtraAdded
[2019-01-21T01:26:53.573Z] DEBUG: incoming (messageId=260041231106144, messageAttributes={})
    subscriptionName: webshopOrderCreated

error in webshopOrderCreated (weird one because there are no data except code):

[2019-01-21T01:26:55.101Z] ERROR: subscription error
    subscriptionName: webshopOrderCreated
    --
    error: {
      "code": 4
    }

Tried to acknowledge webshopOrderCreated message (.ack() was called)

[2019-01-21T01:26:56.623Z] DEBUG: ack (messageId=260041231106144, messageAttributes={})
[2019-01-21T01:27:53.584Z] ERROR: subscription error
    subscriptionName: subscriptionExtraAdded
    --
    error: {
      "code": 4,
      "metadata": {
        "_internal_repr": {}
      },
      "details": "Deadline Exceeded"
    }
 
# for errors below removed error details because it is the same
 
[2019-01-21T01:27:53.674Z] ERROR: subscription error
    subscriptionName: webshopOrderCreated
[2019-01-21T01:27:53.686Z] ERROR: subscription error
    subscriptionName: subscriptionExtraAdded
[2019-01-21T01:27:53.776Z] ERROR: subscription error
    subscriptionName: webshopOrderCreated
[2019-01-21T01:27:53.788Z] ERROR: subscription error
    subscriptionName: subscriptionExtraAdded
[2019-01-21T01:27:53.878Z] ERROR: subscription error
    subscriptionName: webshopOrderCreated
[2019-01-21T01:27:53.889Z] ERROR: subscription error
    subscriptionName: subscriptionExtraAdded
[2019-01-21T01:27:53.977Z] ERROR: subscription error
    subscriptionName: webshopOrderCreated
[2019-01-21T01:27:53.992Z] ERROR: subscription error
    subscriptionName: subscriptionExtraAdded
[2019-01-21T01:27:54.078Z] ERROR: subscription error
    subscriptionName: webshopOrderCreated
[2019-01-21T01:27:54.095Z] ERROR: subscription error
    subscriptionName: subscriptionExtraAdded
[2019-01-21T01:27:54.180Z] ERROR: subscription error
    subscriptionName: webshopOrderCreated
[2019-01-21T01:27:54.196Z] ERROR: subscription error
    subscriptionName: subscriptionExtraAdded
[2019-01-21T01:27:54.280Z] ERROR: subscription error
    subscriptionName: webshopOrderCreated
[2019-01-21T01:27:54.296Z] ERROR: subscription error
    subscriptionName: subscriptionExtraAdded
[2019-01-21T01:27:54.380Z] ERROR: subscription error
    subscriptionName: webshopOrderCreated
[2019-01-21T01:27:54.398Z] ERROR: subscription error
    subscriptionName: subscriptionExtraAdded
[2019-01-21T01:27:54.481Z] ERROR: subscription error

....


lot more errors and messages redelivered each 10 minutes, each time having first unusual error for webshopOrderCreated

then, errors like this for subscriptions where was no messages:

[2019-01-21T01:51:41.286Z] ERROR: subscription error
    subscriptionName: referralPointReceived
    --
    error: {
      "code": 14,
      "metadata": {
        "_internal_repr": {}
      },
      "details": "The service was unable to fulfill your request. Please try again. [code=8a75]"
    }

.... lot of same errors from different topics omitted here

[2019-01-21T02:26:41.610Z] ERROR: subscription error
    subscriptionName: subscriptionCreated
    --
    error: {
      "code": 14,
      "metadata": {
        "_internal_repr": {}
      },
      "details": "The service was unable to fulfill your request. Please try again. [code=8a75]"
    }

No more log messages from this point…

@callmehiphop any suggestion where to look for possible cause?

Btw, looking at this log - it looks that subscription might end-up in "not working" state… is there some specific errors needed to be handled manually to prevent this from happening?

@callmehiphop
Copy link
Contributor

@zxbodya actually there isn't really a good way to detect if the subscription crashed, I'm going to put together a PR that will hopefully improve the quality of the Errors emitted, as well as a hook to tell when the subscription closed.

@callmehiphop
Copy link
Contributor

@zxbodya I just cut new a release (0.24.0), if you upgrade then you should see more detailed error messages, which will hopefully help us track down that deadline error you're seeing.

  • BatchError refers to failed acknowledge or modifyAckDeadline requests.
  • ChannelError means we were unable to establish a connection with the PubSub servers.
  • StatusError are non-recoverable streaming errors. Something went wrong (bad credentials, bad subscription, etc.)

Additionally Subscriptions now emit a close event that will fire off when the subscription closes. In the event of a non-recoverable error we will auto close the subscription for you.

subscription.on('close', () => { /* */ });

@callmehiphop
Copy link
Contributor

I think the original issue here might be resolved, we're not really seeing any chatter about it. I'm going to close it, but if anyone starts to experience this again let me know and I'll re-open.

@zxbodya if the secondary issue you mentioned persists, let me know and I'll open a new issue to investigate.

Thanks all!

@zxbodya
Copy link

zxbodya commented Feb 5, 2019

tried to reproduce the issue I had, using latest version - indeed, it looks to be fixed in v0.24.1

@dinvlad
Copy link

dinvlad commented Feb 7, 2019

Unfortunately, 0.24.1 didn't seem to resolve the issue for us.

It worked when we had ~80 messages in the queue, as they were processed with only a minimal number of "expired" acks, as reported through subscription/pull_ack_message_operation_count metric in Stackdriver.

However, with 4k+ messages in the queue and all the same settings (flowControl.maxMessages set to 2 and the default ackDeadline of 10 sec) and the same average message processing time (~2 min between when the callback receives and when it acks a message), every message ack is indicated as expired after a short while (~15-30 min after starting the subscriber), and so acks keep getting ignored.

We've tried varying flowControl.maxMessages (1, 2, 4), setting allowExcessMessages: false, adjusting ackDeadline in a wide range (10, 300, 7200, 720000 sec) and nothing seems to help. We're kinda out of options here. The only thing that helps is restarting the subscriber every 10 min or so, but that's ugly..

@dinvlad
Copy link

dinvlad commented Feb 7, 2019

I think I found the smoking gun: streaming pull subscriptions stop being restarted internally after ~30 min (according to streaming_pull_message_operation_count):
screen shot 2019-02-07 at 1 53 23 am
The timing coincides with that of all acks turning expired (according to pull_ack_message_operation_count):
screen shot 2019-02-07 at 2 11 10 am
Perhaps modAck uses streaming pull subscription to modify ack deadlines, so when streaming pull subscriptions are not active anymore, it stops modifying deadlines, resulting in the behavior that we observe.

@callmehiphop
Copy link
Contributor

@dinvlad interesting! Thanks for the graphs, those are super helpful! modAck does not use streamingPull, we send both those and ack requests through regular rpcs. However if you're seeing modAck, ack and streamingPull fail, I think that might be a different issue than the one being described here. In fact it sounds eerily similar to #11, could you share a bit more info about your environment?

@dinvlad
Copy link

dinvlad commented Feb 7, 2019

Thanks @callmehiphop - the interesting point is that this service is running on-prem (to shuttle files between on-prem and GCS), instead of GKE. I guess it may be subject to similar network-related limitations.

@dinvlad
Copy link

dinvlad commented Feb 7, 2019

It appears that only versions starting from 0.23.0 are affected by this behavior. With 0.22.2, streaming pulls don't stop and while we still get occasional "expired" acks, they only happen in a fraction of cases (instead of 100%).

@callmehiphop
Copy link
Contributor

@dinvlad would you mind opening a new issue with the details of what you're seeing?

@dinvlad
Copy link

dinvlad commented Feb 7, 2019

Apologies - #468

@callmehiphop
Copy link
Contributor

@dinvlad thanks! That should help us solve your issue more efficiently.

@google-cloud-label-sync google-cloud-label-sync bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Jan 31, 2020
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

6 participants