Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent gRPC StatusCode.UNAVAILABLE errors #2683

Closed
forsberg opened this issue Nov 4, 2016 · 61 comments
Closed

Frequent gRPC StatusCode.UNAVAILABLE errors #2683

forsberg opened this issue Nov 4, 2016 · 61 comments
Assignees
Labels
grpc priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@forsberg
Copy link

forsberg commented Nov 4, 2016

Using the current codebase from master branch (e1fbb6b), with GRPC, we sometimes (0.5% of requests, approximately) see the following exception:

 AbortionError(code=StatusCode.UNAVAILABLE, details="{"created":"@1478255129.468798425","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":157,"grpc_status":14,"referenced_errors":[{"created":"@1478255129.468756939","description":"EOF","file":"src/core/lib/iomgr/tcp_posix.c","file_line":235}]}"))

Retrying this seem to always succeed.

Should application code have to care about this kind of error and retry? Or is this a bug in google-cloud-pubsub code?

Package versions installed:

gapic-google-logging-v2==0.10.1
gapic-google-pubsub-v1==0.10.1
google-api-python-client==1.5.4
google-cloud==0.20.0
google-cloud-bigquery==0.20.0
google-cloud-bigtable==0.20.0
google-cloud-core==0.20.0
google-cloud-datastore==0.20.1
google-cloud-dns==0.20.0
google-cloud-error-reporting==0.20.0
google-cloud-language==0.20.0
google-cloud-logging==0.20.0
google-cloud-monitoring==0.20.0
google-cloud-pubsub==0.20.0
google-cloud-resource-manager==0.20.0
google-cloud-storage==0.20.0
google-cloud-translate==0.20.0
google-cloud-vision==0.20.0
google-gax==0.14.1
googleapis-common-protos==1.3.5
grpc-google-iam-v1==0.10.1
grpc-google-logging-v2==0.10.1
grpc-google-pubsub-v1==0.10.1
grpcio==1.0.0

Note: Everything google-cloud* comes from git master.

This is on Python 2.7.3

Traceback:

  File "ospdatasubmit/pubsub.py", line 308, in _flush
    publish_response = self.pubsub_client.Publish(publish_request, self._publish_timeout)
  File "grpc/beta/_client_adaptations.py", line 305, in __call__
    self._request_serializer, self._response_deserializer)
  File "grpc/beta/_client_adaptations.py", line 203, in _blocking_unary_unary
    raise _abortion_error(rpc_error_call)
@dhermes
Copy link
Contributor

dhermes commented Nov 4, 2016

@forsberg As you can see from the stack trace, this comes from grpc.beta (the beta interface). We haven't used the beta interface for some time. How are you installing the library?

@dhermes
Copy link
Contributor

dhermes commented Nov 4, 2016

@nathanielmanistaatgoogle I can consistently reproduce an Unavailable error when a connection goes stale. Is there any way to avoid this, short of retrying on failures?

@forsberg
Copy link
Author

forsberg commented Nov 4, 2016

That library installation seems to have gone horribly wrong. We had an earlier version using the grpc.beta interface, and I guess installing this into the same virtualenv, something went wrong. Will investigate that on Monday.

@dhermes
Copy link
Contributor

dhermes commented Nov 4, 2016

@forsberg grpc 1.0 still supports the beta interface, but none of google-cloud-python (or its dependencies) use that interface any longer. So it'd be your google-cloud-python install that's b0rked rather than your grpc install.

@forsberg forsberg changed the title AbortionError(code=StatusCode.UNAVAILABLE, ... Frequent gRPC StatusCode.UNAVAILABLE errors Nov 7, 2016
@forsberg forsberg added api: datastore Issues related to the Datastore API. api: logging Issues related to the Cloud Logging API. labels Nov 7, 2016
@dhermes
Copy link
Contributor

dhermes commented Nov 7, 2016

@nathanielmanistaatgoogle See #2693 and #2699. What is the recommended way to deal with this for stale connections?

@forsberg
Copy link
Author

Small update: We have fixed our borked google-cloud-python install to actually use e1fbb6bc, but we're still seeing roughly the same number of UNAVAILABLE - retrying always works on first attempt.

@dhermes
Copy link
Contributor

dhermes commented Nov 17, 2016

@nathanielmanistaatgoogle Bump

@dhermes
Copy link
Contributor

dhermes commented Nov 28, 2016

@nathanielmanistaatgoogle Bump

/cc @geigerj This is the issue I was referring to about GAPIC retry strategies

@geigerj
Copy link
Contributor

geigerj commented Nov 28, 2016

@dhermes You can configure this on the GAPIC layer, see comment here for details. It actually looks like we already retry by default on UNAVAILABLE for Pub/Sub Publish, but you can override the default settings to extend the timeout on retry if that's the issue?

@jgoclawski
Copy link

jgoclawski commented Dec 3, 2016

@geigerj I believe the correct link to "retry by default on UNAVAILABLE for Pub/Sub Publish" is this one, because the one you've provided no longer works.
The bug affects us too. When the connection is stale, we get exactly the same error as here (grpc._channel._Rendezvous but caused by PubSub publish action). Once we retry, it works.

We're using:

gapic-google-pubsub-v1==0.11.1
google-cloud-core==0.21.0
google-cloud-pubsub==0.21.0
google-gax==0.15.0
googleapis-common-protos==1.5.0
grpc-google-iam-v1==0.11.1
grpc-google-pubsub-v1==0.11.1

What seems strange is that the default retry you mentioned doesn't seem to work. And I have checked that the file publisher_client_config.json is present, with correct values. I get the unhandled exception much sooner than 60s, almost immediately (haven't measured it precisely).

Updated:
It seems that I was wrong about dependency versions, but publisher_client_config.json is the same. I don't think that it will change anything, but I will switch to newest versions and report back.
Actual versions:

gapic-google-pubsub-v1==0.10.1
google-cloud-core==0.21.0
google-cloud-pubsub==0.21.0
google-gax==0.14.1
googleapis-common-protos==1.5.0
grpc-google-iam-v1==0.10.1
grpc-google-pubsub-v1==0.10.1
grpcio==1.0.1

Updated 2:
Newest versions do not fix this issue.
I'm still getting

GaxError(RPC failed, caused by <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, {"created":"@1480777925.720435842","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":157,"grpc_status":14,"referenced_errors":[{"created":"@1480777925.720399286","description":"EOF","file":"src/core/lib/iomgr/tcp_posix.c","file_line":235}]})>)

when using a stale connection. Retrying fixes the issue.

Package versions:

gapic-google-cloud-pubsub-v1==0.14.0
google-cloud-core==0.21.0
git+https://github.com/GoogleCloudPlatform/google-cloud-python.git@a02ac500548cf9fc37f4d81033e696a0efb53f99#egg=google-cloud-pubsub&subdirectory=pubsub
google-gax==0.15.0
googleapis-common-protos==1.5.0
grpc-google-cloud-pubsub-v1==0.14.0
grpc-google-iam-v1==0.11.1
grpcio==1.0.1

(google-cloud-pubsub installed from Git master)

@jgoclawski
Copy link

Just wanted to add to the above, that disabling gRPC "fixed" the issue ($ export GOOGLE_CLOUD_DISABLE_GRPC=true) - there's no need for manual retrying and there are no errors with stale connections.

@nathanielmanistaatgoogle

@dhermes: does that code of yours hit a particular host? If so, are you able to reproduce the problem against any other host? If you're able to hit that host with unauthenticated RPCs, are you able to reproduce the defect in the absence of authentication? If you are able observe the traffic at a low level (with Wireshark or something like it) is there anything obviously the matter? Obviously the expected behavior is that if you hold a grpc.Channel and don't use it for a matter of minutes it should still be able to make RPCs. They may take slightly longer if the underlying TCP connection has been taken down, but they shouldn't fail and then immediately succeed when reattempted.

@nathanielmanistaatgoogle

@dhermes: when I run this code of yours I get StatusCode.PERMISSION_DENIED, so there's more to the reproduction of the problem than merely running that, right? Something has to happen server-side? Possibly something else has to happen locally?

@dakrawczyk
Copy link

I think I'm seeing a similar issue.

My goal is to have a connection to speech.googleapis.com always open so that whenever a user wants to say something, they can enter a 'y' through terminal and then speak instantly. Otherwise, establishing the connection seems to take about 4 seconds on our architecture.

However, it seems that the connection closes after a while. Would this issue be the cause?

I have taken google's streaming python example code and modified it for my purposes.

def main():
    # Open channel to Google Speech and keep it open indefinitely.
    with cloud_speech.beta_create_Speech_stub(
            make_channel('speech.googleapis.com', 443)) as service:
        answer = ""
        while True:

            # If we're not retrying from a failed attempt, 
            # wait for the user to send 'y' to start recording
            if answer != "retry":
                answer = raw_input("Do you want to record? y/n: ")
            #  pass through raw_input block
            #  in an attempt to retry the streaming
            #  request.
            else:
                answer = "y"

            if answer == "y":
                print("Recieved the Y")

                # For streaming audio from the microphone, there are three threads.
                # First, a thread that collects audio data as it comes in
                with record_audio(RATE, CHUNK) as buffered_audio_data:
                    # Second, a thread that sends requests with that data
                    requests = request_stream(buffered_audio_data, RATE)
                    # Third, a thread that listens for transcription responses
                    recognize_stream = service.StreamingRecognize(
                        requests, DEADLINE_SECS)

                    try:
                        listen_print_loop(recognize_stream)
                        recognize_stream.cancel()

                    except face.CancellationError:
                        pass
                    except face.AbortionError:
                        print("ABORTION ERROR RECEIVED")
                        answer = "retry"

@daspecster
Copy link
Contributor

@dakrawczyk I think you might be running into the 1 minute limit for streaming.

See: https://cloud.google.com/speech/limits#content

You said...

establishing the connection seems to take about 4 seconds on our architecture.

Do you know if that connection overhead is on the google-cloud-python side or has something to do with your architecture?

@dakrawczyk
Copy link

@daspecster I don't think it's the 1 minute limit for streaming, I do know what you're talking about, but I'm not actually streaming until the user enters 'y' and the record/request/response streams are started and used. My understanding is that I'm only just creating the channel to begin with and that doesn't count against the 1 minute timeout. Also I only end up streaming for about 10 seconds at a time.

I am building an embedded system using a samsung artik710 running debian.
When I run

     with cloud_speech.beta_create_Speech_stub(
            make_channel('speech.googleapis.com', 443)) as service:

on my MacBook it is basically instant.

When it runs on my embedded architecture it takes about 4 seconds.

@daspecster
Copy link
Contributor

Ok, good to know!

I just realized that your code is actually not using this library. You're using the gRPC library directly.
I don't know that this is the best issue to discuss as it might be kind of drawn out.

If you want you can ping me on https://googlecloud-community.slack.com. I've spent some time in Speech so I might be able to help get you going.

@jerjou
Copy link
Contributor

jerjou commented Dec 16, 2016

@dakrawczyk Hello! Looks like you're using the sample that I wrote, so I'm here to take blame / responsibility ^_^;

From the symptoms you describe (error happens in time span > streaming limits, re-starting the stream a lot, some auth thing mentioned above), my guess is that the access token is expiring.

The make_channel function grabs an access token the first time it's run (ie when it creates the channel), but doesn't refresh it, so if you keep it open for long enough, soon enough the access token's validity period expires. I imagine each time you start a new stream, grpc is re-sending the auth headers, so eventually you'll start getting auth errors.

If that's the case, you might be able to fix this by modifying make_channel so that it calls get_access_token() on the fly, instead of just getting it upon channel creation.

Let me know how that works. I haven't looked at the google-cloud-python code, but perhaps it's a similar issue?

@jerjou
Copy link
Contributor

jerjou commented Dec 16, 2016

Also - the sample has since been updated to use the google-auth package, which should also fix that issue.

@dakrawczyk
Copy link

@jerjou Thank you! Trying this out now :]

@dakrawczyk
Copy link

@jerjou I've updated to the newer sample code that uses the google-auth package.

I am still under the impression that the channel closes after a certain amount of time, probably due to the validity expiring.

Here is the error I receive when trying to send/receive data from the channel after some time and the channel has closed.

Traceback (most recent call last):
  File "transcribe_streaming.py", line 323, in <module>
    main()
  File "transcribe_streaming.py", line 310, in main
    listen_print_loop(recognize_stream)
  File "transcribe_streaming.py", line 239, in listen_print_loop
    for resp in recognize_stream:
  File "/usr/local/lib/python2.7/site-packages/grpc/_channel.py", line 344, in next
    return self._next()
  File "/usr/local/lib/python2.7/site-packages/grpc/_channel.py", line 335, in _next
    raise self
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, {"created":"@1482091441.761674000","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":157,"grpc_status":14,"referenced_errors":[{"created":"@1482091441.761631000","description":"EOF","file":"src/core/lib/iomgr/tcp_posix.c","file_line":235}]})>

A couple of questions:

  • Is there a way for me to poll the status of the channel and if I detect that it has closed, reopen it?
  • Or/Also a way to keep the channel open indefinitely?

I can't reopen the channel when the user wants to record because it has a few second delay and that's not the experience we're going for.

@dhermes
Copy link
Contributor

dhermes commented May 9, 2017

@nathanielmanistaatgoogle I already gave a deterministic reproduction. I am happy to chat with you off the thread about how to set up the credentials needed for this or we could work together (I'll need your expertise) to create a gRPC service that doesn't require auth to accomplish the same goal.

@bjwatson
Copy link

@lukesneeringer @dhermes The issue that @nathanielmanistaatgoogle referenced (grpc/grpc#11043) was fixed on June 8. Is this still an issue?

@dhermes
Copy link
Contributor

dhermes commented Jul 28, 2017

@bjwatson Checking right now

@dhermes
Copy link
Contributor

dhermes commented Jul 28, 2017

The example still fails:

Listing all instances:
Traceback (most recent call last):
  File "bt_unavailable.py", line 30, in <module>
    main()
  File "bt_unavailable.py", line 26, in main
    list_em(client)
  File "bt_unavailable.py", line 8, in list_em
    instances, failed_locations = client.list_instances()
  File ".../google/cloud/bigtable/client.py", line 375, in list_instances
    response = self._instance_stub.ListInstances(request_pb)
  File ".../grpc/_channel.py", line 507, in __call__
    return _end_unary_response_blocking(state, call, False, deadline)
  File ".../grpc/_channel.py", line 455, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, Endpoint read failed)>

This fails in Python 2.7 with grpcio==1.4.0

@bjwatson
Copy link

@nathanielmanistaatgoogle, looks like the gRPC fix was insufficient for this issue. Do you have any insight into what else might be going wrong?

FYI @lukesneeringer

@lukesneeringer
Copy link
Contributor

Hello! :-)
First, a mea culpa; I have not done as good of a job at keeping up with issues as I should have. If you are getting this (I admit it) cut and paste, it is likely because your issue sat for too long.

In this case, I have been in the process of making a radical update to the PubSub library (#3637) to add significant performance improvements and a new surface, which we hope to launch soon. As such, I am clearing out issues on the old library. It is my sincere goal to do a better job of being on top of issues in the future.

As the preceding paragraph implies, I am closing this issue. If the revamped library does not solve your issue, however, please feel free to reopen.

Thanks!

@geigerj geigerj reopened this Aug 8, 2017
@geigerj
Copy link
Contributor

geigerj commented Aug 8, 2017

@lukesneeringer from my recollection, this wasn't Pub/Sub-specific?

@dhermes
Copy link
Contributor

dhermes commented Aug 8, 2017

That's correct, I even reproduced it with bigtable 12 days ago.

@lukesneeringer
Copy link
Contributor

Yeah, I was firing through everything with an api: pubsub label. Thanks for reopening.

@lukesneeringer lukesneeringer removed api: datastore Issues related to the Datastore API. api: logging Issues related to the Cloud Logging API. api: pubsub Issues related to the Pub/Sub API. labels Aug 8, 2017
@lukesneeringer
Copy link
Contributor

I am removing all the "api: X" labels from this issue since issue automation is coming. The grpc label is the appropriate tracking.

Although really this should just be moved to grpc.

@sx5640
Copy link

sx5640 commented Sep 18, 2017

Reproduced the Bigtable issue with google-cloud-bigtable==0.26.0 during:

def write_row(key, column_id):
    """
    Utility method for writing a row to BigTable.
    Note that we don't actually store values - the column ids are where we actually store values,
    so the value is always just an empty string.
    :param key: Key of the row to write
    :param column_id: This is the actual value we want stored.
    :return: None
    """
    row = db.table.row(key)
    row.set_cell(conf.column_family_id, column_id, '')
    row.commit()

The error looks like:

_Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, Endpoint read failed)>
at _end_unary_response_blocking (/env/local/lib/python2.7/site-packages/grpc/_channel.py:455)
at __call__ (/env/local/lib/python2.7/site-packages/grpc/_channel.py:507)
at commit (/env/local/lib/python2.7/site-packages/google/cloud/bigtable/row.py:417)

@ericbbraga
Copy link

Hi guys,
I'm also facing this issue but with PubSub subscription.
We've started a subscription for one topic yesterday and all is working fine.
However, today we saw this exception on the console.

I'm using the following packages:

google-auth==1.1.1
google-cloud-bigquery==0.27.0
google-cloud-core==0.27.1
google-cloud-logging==1.3.0
google-cloud-pubsub==0.28.3
google-cloud-storage==1.4.0
google-gax==0.15.15
google-resumable-media==0.2.3
googleapis-common-protos==1.5.3
grpc-google-iam-v1==0.11.3
proto-google-cloud-logging-v2==0.91.3

and I'm running on a Linux Machine (Ubuntu 16.04.3 LTS)

Exception:

Exception in thread Consumer helper: consume bidirectional stream:
Traceback (most recent call last):
File "/home/user/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/user/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/user/anaconda3/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 248, in _blocking_consume
self._policy.on_exception(exc)
File "/home/user/anaconda3/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/policy/thread.py", line 135, in on_exception
raise exception
File "/home/user/anaconda3/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 234, in _blocking_consume
for response in response_generator:
File "/home/user/anaconda3/lib/python3.6/site-packages/grpc/_channel.py", line 348, in next
return self._next()
File "/home/user/anaconda3/lib/python3.6/site-packages/grpc/_channel.py", line 342, in _next
raise self
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, OS Error)>

@jkurz
Copy link

jkurz commented Oct 7, 2017

I just got this error while running a job on Google ML Engine.

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 194, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 296, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 793, in train train_step_kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 546, in train_step if sess.run(train_step_kwargs['should_log']): File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call raise type(e)(node_def, op, message) UnavailableError: {"created":"@1507348685.699643392","description":"EOF","file":"external/grpc/src/core/lib/iomgr/tcp_posix.c","file_line":235,"grpc_status":14}

Honestly not sure how to solve other than retry the job.

Edit:

And rerunning does not seem to help. Keep getting the same error.

@darofar
Copy link

darofar commented Oct 19, 2017

Hi, i'm getting this error on PubSub consumer. I manage to get a "not so pretty" workaround.

using a policy like this that replicates code for deadline_exceeded on google.cloud.pubsub_v1.subscriber.policy.thread.Policy.on_exception.

from google.cloud.pubsub_v1.subscriber.policy.thread import Policy
import grpc

class UnavailableHackPolicy(Policy):
    def on_exception(self, exception):
        """
        There is issue on grpc channel that launch an UNAVAILABLE exception now and then. Until
        that issue is fixed we need to protect our consumer thread from broke.
        https://github.com/GoogleCloudPlatform/google-cloud-python/issues/2683
        """
        unavailable = grpc.StatusCode.UNAVAILABLE
        if getattr(exception, 'code', lambda: None)() in [unavailable]:
            print("¡OrbitalHack! - {}".format(exception))
            return
        return super(UnavailableHackPolicy, self).on_exception(exception)

On receive message function i have a code like

subscriber = pubsub.SubscriberClient(policy_class=UnavailableHackPolicy)
subscription_path = subscriber.subscription_path(project, subscription_name)
subscriber.subscribe(subscription_path, callback=callback, flow_control=flow_control)

Problem is that when the resource it is trully UNAVAILABLE we will be not aware.

UPDATE: As noted here by @makrusak and here by @rclough. This hack cause high CPU usage leaving your consumer practically useless (available intermittently). So basically this changes one problem for another, your consumer does not die, but you will have to restart the worker that executes it often.

@chemelnucfin
Copy link
Contributor

I might be getting a similar problem on spanner trying to read ranges with index. I will need to test if it's my code or not.

@chemelnucfin chemelnucfin added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jan 9, 2018
@theacodes
Copy link
Contributor

I think with all the work that @dhermes did on pubsub this should be resolved. I'm going to go ahead and close this, but if it's still reproducible with the latest version we can re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
grpc priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests