Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate messages exactly every 15 minutes #68

Closed
ctavan opened this issue Feb 13, 2018 · 11 comments
Closed

Duplicate messages exactly every 15 minutes #68

ctavan opened this issue Feb 13, 2018 · 11 comments
Assignees
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. status: blocked Resolving the issue is dependent on other work. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@ctavan
Copy link
Contributor

ctavan commented Feb 13, 2018

Environment details

  • OS: Kubernetes Engine
  • Node.js version: 6.3.0
  • npm version: –
  • @google-cloud/pubsub version: 0.16.2

Steps to reproduce

This is most likely related to the discussion in #2 (comment) (and following comments in that thread). However since the discussion there somewhat faded out I want to report my findings in a new issue.

My subscriber is consuming messages at a rate of roughly 500/s and it is receiving small batches of duplicate messages exactly every 15 minutes. Those batches typically contain between 100 and 400 duplicate messages. Here's a plot of the number of duplicates over time:

batches

Most of the duplicates are being delivered to my subscriber within less then a second. Here's a histogram of the durations between redeliveries in milliseconds:

redelivery

As you can see, the batches of duplicates coincide with spikes in Stackdriver graphs on StreamingPull Operations and StreamingPull Acknowledge Requests (please note that Stackdriver shows Berlin time while the above graph shows UTC, hence 1h time difference):

stackdriver

From the comments in the other thread I did not really understand whether the behavior we see is actually expected. What's the reason for this to happen precisely every 15 minutes?

Even though the absolute number of duplicates is well below 1%, this still looks pretty odd, unexpected and unnecessary. I'd love to understand better what's causing this issue and how it could potentially be fixed.

/cc @kir-titievsky @callmehiphop @rossj

@callmehiphop
Copy link
Contributor

I believe the lifespan of each stream created via streamingPull is 15 minutes, in my own testing I've also seen messages get redelivered when one stream closes and another opens. That being said, I believe the stream is just reaching the end of its lifespan and when a new stream is opened to take its place, the PubSub service redelivers the messages.

@kir-titievsky does that sound like the expected behavior? Is there something on the client we should be doing differently to prevent these redeliveries?

@callmehiphop callmehiphop added priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue. labels Feb 13, 2018
@rossj
Copy link

rossj commented Feb 13, 2018

From my reading of the code, the Node subscription client assumes that it will not receive messages with ack ids that it is already leasing. This is evident based on how it tracks and clears ack ids in the internal inventory array, with no checking of an existing duplicate ack id.

IMHO the current client is broken in this regard, as after receiving a duplicate message / ack id, calling ack() or nack() on either message instance will result in the subscription client ack / nacking the id, while at the same time attempting to further lease it until ack() / nack() is called on the duplicate message instance. Even worse, if a duplicate message is received and the inventory is at it's set maximum size, the ack id will be nacked right away, even though it's also still being leased.

I've monkey patched my own subscriptions on('message') handler to completely ignore duplicate ack ids that are already being leased, and this has been working well for me, with my downstream code receiving no duplicates. I also haven't had any problems maintaining rather long leases of about an hour.

const openConnectionOld = sub.openConnection_;
sub.openConnection_ = function () {
    openConnectionOld.call(this, ...arguments);
    const pool = this.connectionPool;

    pool.removeAllListeners('message');
    pool.on('message', (message) => {
        // If we already have the ackId leased, ignore
        if (this.inventory_.lease.indexOf(message.ackId) >= 0) {
            console.log(`Already leasing ack ID ${message.ackId}. Ignoring...`);
            return;
        }

        if (!this.hasMaxMessages_()) {
            this.emit('message', this.leaseMessage_(message));
            return;
        }

        if (!pool.isPaused) {
            pool.pause();
        }

        message.nack();
    });
};

I could submit a PR, but I'm not sure if the duplicate messages are indicative of some other, upstream problem. Perhaps the ack id inventory would be better implemented using a Set instead of an array with indexOf checks.

@callmehiphop
Copy link
Contributor

@rossj I agree with you, however the client was built with specifications from the PubSub team and this particular issue has been brought up several times before. I think we want input from the PubSub team on what the correct solution to this issue actually is.

@callmehiphop
Copy link
Contributor

@kir-titievsky just a friendly ping :)

@kir-titievsky
Copy link

@ctavan @rossj This should get better this week. The behavior had to do with how treated messages that were sent to a client but not yet acknowledged when a streamingPull connection was broken down and rebuilt. A feature rolling out right now (and expected to be completely out tomorrow) should ensure that nothing special -- like duplicate delivery-- happens when connections are rebuilt.

@ctavan Might you check in on your metrics in two days and tell me if you still see this?

@stephenplusplus stephenplusplus added the status: blocked Resolving the issue is dependent on other work. label Feb 26, 2018
@ctavan
Copy link
Contributor Author

ctavan commented Feb 26, 2018

Thanks @kir-titievsky, we'll check again in a few days!

Any chance that the changes might affect #73 as well?

@callmehiphop
Copy link
Contributor

@ctavan have you been experiencing this issue lately?

@ctavan
Copy link
Contributor Author

ctavan commented Mar 21, 2018

I'm currently out of office. I'll try to reproduce as soon as possible, but it might take me a few more days.

@callmehiphop
Copy link
Contributor

callmehiphop commented Mar 21, 2018 via email

@stephenplusplus stephenplusplus added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. Not an issue. labels Apr 2, 2018
@ctavan
Copy link
Contributor Author

ctavan commented Apr 13, 2018

@callmehiphop @kir-titievsky I have finally found the time to check this again.

First of all I re-ran the test using @google-cloud/pubsub@^0.16.4 and I saw a similar behavior. Just that the duplicates now no longer appeared in strict 15-minute intervals but in more random intervals that were however still strict multiples of 1 minute!

timeseries

I have then upgraded to @google-cloud/pubsub@^0.18.0 and I've been running the consumer for more than one hour without a single duplicate since the upgrade.

So even though I cannot confirm that there was a server-side fix for the issue (since it still persists with the old version of the node client) I will still go ahead and close this issue since it is gone since the latest refactoring of this client.

Thanks for your support!

@ctavan ctavan closed this as completed Apr 13, 2018
@callmehiphop
Copy link
Contributor

@ctavan glad to hear it, thanks for all your help!

stephenplusplus pushed a commit to stephenplusplus/nodejs-pubsub that referenced this issue Aug 31, 2018
* chore: lock files maintenance

* chore: lock files maintenance
@google-cloud-label-sync google-cloud-label-sync bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Jan 31, 2020
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. status: blocked Resolving the issue is dependent on other work. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

6 participants