Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

messages sit in queue until GKE pod with subscriber gets reset #11

Closed
stephenplusplus opened this issue Dec 13, 2017 · 35 comments
Closed
Assignees
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. release blocking Required feature/issue must be fixed prior to next release. 🚨 This issue needs some love. triaged for GA type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@stephenplusplus
Copy link
Contributor

From @ShahNewazKhan on October 1, 2017 9:3

Environment details

  • OS: Debian GNU/Linux 8.9 (jessie) [K8s pod based on dockerfile gcr.io/google_appengine/base]
  • Node.js version: 6.11.3
  • npm version: 5.4.2
  • google-cloud/pubsub version: 0.14.2

Steps to reproduce

  1. Spin up nodejs pubsub publisher to topic1 in GKE pod 1
  2. Spin up nodejs pubsub subscriber to subscription to topic1 in GKE pod 2
  3. Publish messages to topic1

I am facing an intermittent issue where pubsub messages are sitting in the queue and not being delivered to the subscriber in GKE pod 2. Only when I delete the GKE pod 2 subscriber and restart the pod does the message get delivered.

Copied from original issue: googleapis/google-cloud-node#2640

@stephenplusplus
Copy link
Contributor Author

From @callmehiphop on October 1, 2017 15:50

We've seen a number of reports of messages not being delivered in k8s, I believe this issue is being investigated internally, although I do not know the status. @lukesneeringer have we heard any news in regards to this?

@stephenplusplus
Copy link
Contributor Author

From @eyalse on October 1, 2017 22:33

I'm suffering from the same issue at the moment :(

@stephenplusplus stephenplusplus added api: pubsub status: blocked Resolving the issue is dependent on other work. labels Dec 13, 2017
@stephenplusplus
Copy link
Contributor Author

From @ApeNox on October 2, 2017 8:5

Suffering the same issue too, please provide a fix as these are production used tools.

@stephenplusplus
Copy link
Contributor Author

From @eyalse on October 2, 2017 9:26

@callmehiphop (@lukesneeringer) hey any update? as mentioned these tools (k8s and pubsub) are used in production.

@stephenplusplus
Copy link
Contributor Author

From @callmehiphop on October 2, 2017 14:54

I don't have any official updates, but a new patch release was made this morning that might resolve the issues you're seeing.

@stephenplusplus
Copy link
Contributor Author

From @ShahNewazKhan on October 2, 2017 20:25

@callmehiphop I have done some preliminary testing with the google-cloud/pubsub patch version: 0.14.3 release this morning and it looks promising so far

I have not been able to reproduce the issue yet however will need to run full end to end tests to confirm

@stephenplusplus
Copy link
Contributor Author

From @callmehiphop on October 2, 2017 20:34

@ShahNewazKhan that's great, please keep us posted! 😃

@stephenplusplus
Copy link
Contributor Author

From @ShahNewazKhan on October 3, 2017 1:42

@callmehiphop I have been able to replicate the issue with google-cloud/pubsub patch 0.14.3 in a slightly different use case.

Environment details

OS: Debian GNU/Linux 8.9 (jessie) [K8s pod based on dockerfile gcr.io/google_appengine/base]
Node.js version: 6.11.3
npm version: 5.4.2
google-cloud/pubsub version: 0.14.3

Steps to reproduce

  1. Spin up nodejs pubsub publisher to topic1 in GKE pod 1
    
  2. Spin up nodejs pubsub subscriber to subscription to topic1 in GKE pod 2
    
  3. Reset GKE pod 1 [pubsub publisher app]
    
  4. Publish messages to topic1
    

At this point the message remains stuck in the pubsub queue until I reset the GKE pod 2 [pubsub subscriber app]

@stephenplusplus
Copy link
Contributor Author

From @ShahNewazKhan on October 10, 2017 21:41

Just checking in for updates on this issue.

@stephenplusplus
Copy link
Contributor Author

From @callmehiphop on October 11, 2017 18:27

@ShahNewazKhan We believe this is a GKE issue and because of that I can't comment on if its being worked on and when it will be fixed. I'm really sorry for the inconvenience.

@stephenplusplus
Copy link
Contributor Author

From @ehacke on October 16, 2017 1:25

We may be having similar issues, not sure. @ShahNewazKhan what version of GKE are you on?

@stephenplusplus
Copy link
Contributor Author

From @ShahNewazKhan on October 16, 2017 2:10

@ehacke

GKE: 1.6.10-gke.1
Kubernetes: 1.5.6

@stephenplusplus
Copy link
Contributor Author

From @kir-titievsky on October 26, 2017 18:23

Question for those who'd reported this: is there any chance you had no messages published or delivered for 10 minutes or longer before you started publishing and accumulating them in the backlog?

@stephenplusplus
Copy link
Contributor Author

From @ShahNewazKhan on October 26, 2017 19:2

@kir-titievsky I can confirm that the published messages sit in the subscription queue only when the publisher has been inactive longer than 10 minutes.

@stephenplusplus
Copy link
Contributor Author

From @kir-titievsky on October 26, 2017 19:47

Thanks @ShahNewazKhan . My guess here is this: by default, GCE suspends inactive connections after 10 minutes [1]. Since Pub/Sub relies on a persistent streamingPull connection, this connection would get suspended if no messages flow for 10 minutes. This condition was not properly detected by Pub/Sub. This was fixed as of 2017-10-20 by shutting down affected streamingPull connections. The server-initiated shutdown should now trigger the client library to rebuild the connection.

Can those of you affected check if the issue persists?

[1] https://cloud.google.com/compute/docs/troubleshooting#communicatewithinternet

@stephenplusplus
Copy link
Contributor Author

From @ShahNewazKhan on October 31, 2017 22:56

@kir-titievsky Can you clarify what you mean by 'server-initiated shutdown'. Does this mean that the inactive Pub/Sub streamingPull connections are now being shutdown instead of being suspended by GCE?

I have noticed messages sitting in the queue intermittently still, do I have to update the Pub/Sub client to a latest version to handle the streamingPull connection rebuilds?

Thanks in advance!

@stephenplusplus
Copy link
Contributor Author

I'm marking this as blocked, since it sounds like GKE is the party responsible for any progress on this. @callmehiphop does this sound right?

@stephenplusplus
Copy link
Contributor Author

From @callmehiphop on November 27, 2017 15:37

@stephenplusplus I believe it does!

@stephenplusplus stephenplusplus changed the title Pubsub messages sit in queue until GKE pod with subscriber gets reset messages sit in queue until GKE pod with subscriber gets reset Dec 13, 2017
@kir-titievsky
Copy link

@ShahNewazKhan: Pub/Sub servers now close streamingPull connections regularly, with a timeout shorter than GCE's 10 minute limit. This allow the client library to quickly rebuilt the connections making sure that none are stuck in a suspended state.

@stephenplusplus
Copy link
Contributor Author

Please let us know if there are still any issues. For now, this sounds like it's resolved.

@thomas-hilaire
Copy link

Hello,

I continue to encounter this issue regularly. As you my pod stops to receive any message after an inactive period, as soon as I restart the pod all messages are well delivered to this new pod instance.

Do I need to update anything to get the fix explained by @kir-titievsky? How can I investigate about this issue?

Thanks!

Environment details:

  • google-cloud/pubsub: 0.15.0
  • node: 8.4
$ kubectl version

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T23:15:59Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.5-gke.0", GitCommit:"2c2a807131fa8708abc92f3513fe167126c8cce5", GitTreeState:"clean", BuildDate:"2017-12-19T20:05:45Z", GoVersion:"go1.8.3b4", Compiler:"gc", Platform:"linux/amd64"}

@janisto
Copy link

janisto commented Feb 7, 2018

We have experienced the same issue in the past few weeks with new deployments. Older deployments seem to work fine.

A bit annoying to restart the pods daily.

Environment details:

  • google-cloud/pubsub: 0.13.1
  • node: 6.11

@janisto
Copy link

janisto commented Feb 8, 2018

I think this was caused by grpc 1.8.4. I added grpc 1.7.3 as dependency and so far everything seems to be working fine.

@danoscarmike danoscarmike added release blocking Required feature/issue must be fixed prior to next release. triaged for GA labels Feb 8, 2018
@theacodes
Copy link

Related: googleapis/google-cloud-python#4737

@domparry
Copy link

I have also solved this issue by adding grpc 1.7.3 as a dependency.

andres-arana added a commit to GlobalFishingWatch/pipe-reports that referenced this issue Apr 11, 2018
As per
googleapis/nodejs-pubsub#11 (comment),
the timeout problems we are having on the reporting pipeline can be
solved by fixing grpc to 1.7.3. We are trying this out to see if it
fixes it on our case.
@Alexandredc
Copy link

Hi, any news on this issue ? Because I encountered the same problem.

I move from Google compute engine to Kubernetes. I have 2 pubsub topics. One is used frequently (eg: many message pushed) and there is no problem. And another is used less frequently (message pushed every 30 minutes), after few minutes, pubsub stops receiving messages.

@callmehiphop
Copy link
Contributor

@alexandreawe what version of the PubSub client are you experiencing these issues with?

@Alexandredc
Copy link

@callmehiphop i'm using v0.18.0. I have fixed this issue by looping every 15 minutes and restarting the subscription

@barrettc
Copy link

I'm also seeing this behavior with the latest version of google pubusb as of this comment - 0.22.2. We have two apps running in GKE communicating with each other via pubsub and the subscribers just stop receiving messages until the pods are restarted. At this point I guess I'm looking at looping and restarting the subscription every 15 minutes as described above but this feels very hacky.

@dinvlad
Copy link

dinvlad commented Feb 7, 2019

I'd like to chime in and point out that we're experiencing a very similar issue in our on-prem service, where streaming pull connections stop "restarting every 10 min" after a few times (typically in about 30 min). That seems to be correlated with modAcks/acks not working as expected (i.e. all acks are indicated as "expired" after streaming pulls stop). This is described in #314 (comment)

Additionally, we've experienced a very similar issue with the Java client on GKE a few months ago, and it was resolved as a "server-side" problem on behalf of Google. Is there a chance we're seeing it pop up again here?

EDIT: It appears that only versions 0.23.0 and up are affected. With 0.22.2, only a fraction of acks turns up "expired", and streaming pulls don't stop.

@sduskis sduskis added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Apr 16, 2019
@sduskis
Copy link

sduskis commented Apr 16, 2019

@plamut, this is a similar problem to the one you're seeing on Python.

@camsjams
Copy link

camsjams commented Jul 10, 2019

@alexandreawe @barrettc
Just for my sake and perhaps other lurkers, how are you restarting your subscriptions?

Are you doing something straightforward like close and then open within a setInterval like so:

// assuming topic
let topic = pubsub.topic('dogs');

let subscription = subscriptionTopic.subscription('my-subscription');
setInterval(async () => {
    await subscription.close();
    subscription.open();
}, 15 * 60 * 1000);

I would love to hear about your experiences ~six months later and if this has been reliable enough.

Thanks!

@barrettc
Copy link

@camsjams For our particular issue, it turns out we had a code path in which messages were not being properly acknowledged and the buildup of those messages created the perception that the subscriber had stopped working. In other words, we had stupid programmer error. I had forgotten about the comment I wrote back in December searching for answers but ultimately we did not have the problem described here in this issue.

@camsjams
Copy link

Thank you for your response. In our code we have a thin wrapper mediating PubSub activity, so it "always" acknowledges.

@google-cloud-label-sync google-cloud-label-sync bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. release blocking Required feature/issue must be fixed prior to next release. 🚨 This issue needs some love. triaged for GA type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests