-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent gRPC StatusCode.UNAVAILABLE errors #2683
Comments
@forsberg As you can see from the stack trace, this comes from |
@nathanielmanistaatgoogle I can consistently reproduce an Unavailable error when a connection goes stale. Is there any way to avoid this, short of retrying on failures? |
That library installation seems to have gone horribly wrong. We had an earlier version using the grpc.beta interface, and I guess installing this into the same virtualenv, something went wrong. Will investigate that on Monday. |
@forsberg |
@nathanielmanistaatgoogle See #2693 and #2699. What is the recommended way to deal with this for stale connections? |
Small update: We have fixed our borked google-cloud-python install to actually use e1fbb6bc, but we're still seeing roughly the same number of UNAVAILABLE - retrying always works on first attempt. |
@nathanielmanistaatgoogle Bump /cc @geigerj This is the issue I was referring to about GAPIC retry strategies |
@dhermes You can configure this on the GAPIC layer, see comment here for details. It actually looks like we already retry by default on |
@geigerj I believe the correct link to "retry by default on UNAVAILABLE for Pub/Sub Publish" is this one, because the one you've provided no longer works. We're using:
What seems strange is that the default retry you mentioned doesn't seem to work. And I have checked that the file Updated:
Updated 2:
when using a stale connection. Retrying fixes the issue. Package versions:
(google-cloud-pubsub installed from Git master) |
Just wanted to add to the above, that disabling gRPC "fixed" the issue ( |
@dhermes: does that code of yours hit a particular host? If so, are you able to reproduce the problem against any other host? If you're able to hit that host with unauthenticated RPCs, are you able to reproduce the defect in the absence of authentication? If you are able observe the traffic at a low level (with Wireshark or something like it) is there anything obviously the matter? Obviously the expected behavior is that if you hold a |
@dhermes: when I run this code of yours I get |
I think I'm seeing a similar issue. My goal is to have a connection to speech.googleapis.com always open so that whenever a user wants to say something, they can enter a 'y' through terminal and then speak instantly. Otherwise, establishing the connection seems to take about 4 seconds on our architecture. However, it seems that the connection closes after a while. Would this issue be the cause? I have taken google's streaming python example code and modified it for my purposes.
|
@dakrawczyk I think you might be running into the 1 minute limit for streaming. See: https://cloud.google.com/speech/limits#content You said...
Do you know if that connection overhead is on the |
@daspecster I don't think it's the 1 minute limit for streaming, I do know what you're talking about, but I'm not actually streaming until the user enters 'y' and the record/request/response streams are started and used. My understanding is that I'm only just creating the channel to begin with and that doesn't count against the 1 minute timeout. Also I only end up streaming for about 10 seconds at a time. I am building an embedded system using a samsung artik710 running debian.
on my MacBook it is basically instant. When it runs on my embedded architecture it takes about 4 seconds. |
Ok, good to know! I just realized that your code is actually not using this library. You're using the gRPC library directly. If you want you can ping me on https://googlecloud-community.slack.com. I've spent some time in Speech so I might be able to help get you going. |
@dakrawczyk Hello! Looks like you're using the sample that I wrote, so I'm here to take blame / responsibility ^_^; From the symptoms you describe (error happens in time span > streaming limits, re-starting the stream a lot, some auth thing mentioned above), my guess is that the access token is expiring. The If that's the case, you might be able to fix this by modifying Let me know how that works. I haven't looked at the google-cloud-python code, but perhaps it's a similar issue? |
Also - the sample has since been updated to use the google-auth package, which should also fix that issue. |
@jerjou Thank you! Trying this out now :] |
@jerjou I've updated to the newer sample code that uses the google-auth package. I am still under the impression that the channel closes after a certain amount of time, probably due to the validity expiring. Here is the error I receive when trying to send/receive data from the channel after some time and the channel has closed.
A couple of questions:
I can't reopen the channel when the user wants to record because it has a few second delay and that's not the experience we're going for. |
@nathanielmanistaatgoogle I already gave a deterministic reproduction. I am happy to chat with you off the thread about how to set up the credentials needed for this or we could work together (I'll need your expertise) to create a gRPC service that doesn't require auth to accomplish the same goal. |
@lukesneeringer @dhermes The issue that @nathanielmanistaatgoogle referenced (grpc/grpc#11043) was fixed on June 8. Is this still an issue? |
@bjwatson Checking right now |
The example still fails:
This fails in Python 2.7 with |
@nathanielmanistaatgoogle, looks like the gRPC fix was insufficient for this issue. Do you have any insight into what else might be going wrong? FYI @lukesneeringer |
Hello! :-) In this case, I have been in the process of making a radical update to the PubSub library (#3637) to add significant performance improvements and a new surface, which we hope to launch soon. As such, I am clearing out issues on the old library. It is my sincere goal to do a better job of being on top of issues in the future. As the preceding paragraph implies, I am closing this issue. If the revamped library does not solve your issue, however, please feel free to reopen. Thanks! |
@lukesneeringer from my recollection, this wasn't Pub/Sub-specific? |
That's correct, I even reproduced it with |
Yeah, I was firing through everything with an |
I am removing all the "api: X" labels from this issue since issue automation is coming. The Although really this should just be moved to |
Reproduced the Bigtable issue with
The error looks like:
|
Hi guys, I'm using the following packages:
and I'm running on a Linux Machine (Ubuntu 16.04.3 LTS) Exception:
|
I just got this error while running a job on Google ML Engine.
Honestly not sure how to solve other than retry the job. Edit: And rerunning does not seem to help. Keep getting the same error. |
Hi, i'm getting this error on PubSub consumer. I manage to get a "not so pretty" workaround. using a policy like this that replicates code for deadline_exceeded on google.cloud.pubsub_v1.subscriber.policy.thread.Policy.on_exception.
On receive message function i have a code like
Problem is that when the resource it is trully UNAVAILABLE we will be not aware. UPDATE: As noted here by @makrusak and here by @rclough. This hack cause high CPU usage leaving your consumer practically useless (available intermittently). So basically this changes one problem for another, your consumer does not die, but you will have to restart the worker that executes it often. |
I might be getting a similar problem on spanner trying to read ranges with index. I will need to test if it's my code or not. |
I think with all the work that @dhermes did on pubsub this should be resolved. I'm going to go ahead and close this, but if it's still reproducible with the latest version we can re-open. |
Using the current codebase from master branch (e1fbb6b), with GRPC, we sometimes (0.5% of requests, approximately) see the following exception:
Retrying this seem to always succeed.
Should application code have to care about this kind of error and retry? Or is this a bug in google-cloud-pubsub code?
Package versions installed:
Note: Everything google-cloud* comes from git master.
This is on Python 2.7.3
Traceback:
The text was updated successfully, but these errors were encountered: