Receiving Error: MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE #1150

MostafaOmar98 · 2024-06-12T12:56:29Z

Hello, so we have been seeing the following error:

<_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Socket closed", grpc_status:14, created_time:"2024-06-11T07:08:48.917638822+00:00"}"
>

Facts we know so far:

It seems to be a transient error.
It is not directly related to the spanner server instance but rather the connection between our application and spanner.
It is not related to 1 specific query or application. Seems to be happening across different queries on different services.
The error could be masked with retrial: https://cloud.google.com/spanner/docs/custom-timeout-and-retry. However, we see the rate of this error going up and down seemingly arbitrarily to us.

We have contacted the google support team and they have recommended we get insights by raising the issue on the client library. We acknowledge that we can mask this transient error by implementing a retrial mechanism. However, we are very interested in knowing what causes it and what factors cause this error to increase/decrease in its rate. We have a very performance-critical service that is getting affected by this error, so we would like to implement mechanisms to keep the error rate at its minimum and constant before we do a retrial on top of it.

Environment details

OS type and version: Debian 12.5
Python version: 3.10.14
pip version: 24.0
google-cloud-spanner version: "3.46.0"

Steps to reproduce

Run a query enough amount of times for this transient error to happen

Code example

# init code
client = Client("project name")
instance = client.instance("instance name")

pool = PingingPool(
    size=20,
    default_timeout=10,
    ping_interval=300
)

self.db = instance.database(db, pool=pool)
SpannerDB.background_pool_pinging(pool)

# query execution code
query = "SELECT <> FROM <table>"
with self.db.snapshot() as snapshot:
    res = snapshot.execute_sql(query)

# background pinging pool code
def background_pool_pinging(pool):
    import threading
    import time
    def target():
        while True:
            pool.ping()
            time.sleep(10)

    background = threading.Thread(target=target, name='spanner-ping-pool')
    background.daemon = True
    background.start()

Stack trace

(censored internal function name/files)

_MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Socket closed", grpc_status:14, created_time:"2024-06-11T07:34:07.272354902+00:00"}"
>
  File "/opt/venv/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 170, in error_remapped_callable
    return _StreamingResponseIterator(
  File "/opt/venv/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 92, in __init__
    self._stored_first_result = next(self._wrapped)
  File "grpc/_channel.py", line 541, in __next__
    return self._next()
  File "grpc/_channel.py", line 967, in _next
    raise self
ServiceUnavailable: Socket closed
  File "starlette/applications.py", line 124, in __call__
    await self.middleware_stack(scope, receive, send)
  File "starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "starlette/middleware/base.py", line 72, in __call__
    response = await self.dispatch_func(request, call_next)
  File "starlette/middleware/base.py", line 46, in call_next
    raise app_exc
  File "starlette/middleware/base.py", line 36, in coro
    await self.app(scope, request.receive, send_stream.send)
  File "/opt/venv/lib/python3.10/site-packages/opentelemetry/instrumentation/asgi/__init__.py", line 581, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "starlette/middleware/base.py", line 72, in __call__
    response = await self.dispatch_func(request, call_next)
  File "********", line 149, in dispatch
    response = await call_next(request)
  File "starlette/middleware/base.py", line 46, in call_next
    raise app_exc
  File "starlette/middleware/base.py", line 36, in coro
    await self.app(scope, request.receive, send_stream.send)
  File "starlette/middleware/exceptions.py", line 75, in __call__
    raise exc
  File "starlette/middleware/exceptions.py", line 64, in __call__
    await self.app(scope, receive, sender)
  File "fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "starlette/routing.py", line 680, in __call__
    await route.handle(scope, receive, send)
  File "starlette/routing.py", line 275, in handle
    await self.app(scope, receive, send)
  File "starlette/routing.py", line 65, in app
    response = await func(request)
  File "********", line 35, in custom_route_handler
    response = await original_route_handler(request)
  File "fastapi/routing.py", line 231, in app
    raw_response = await run_endpoint_function(
  File "fastapi/routing.py", line 162, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "starlette/concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "********", line 70, in ********
    return ********(
  File "********", line 125, in ********
    ******** = ********(
  File "********", line 519, in get_rocket_warehouse_legs
    spanner_ctx.spanner_conn.execute_query(
  File "********", line 122, in execute_query
    return self.execute_sql(query, max_staleness_seconds, **new_kwargs)
  File "********", line 118, in execute_sql
    return SpannerProxy(res)
  File "********", line 21, in __new__
    first = next(it)
  File "/opt/venv/lib/python3.10/site-packages/google/cloud/spanner_v1/streamed.py", line 145, in __iter__
    self._consume_next()
  File "/opt/venv/lib/python3.10/site-packages/google/cloud/spanner_v1/streamed.py", line 117, in _consume_next
    response = next(self._response_iterator)
  File "/opt/venv/lib/python3.10/site-packages/google/cloud/spanner_v1/snapshot.py", line 88, in _restart_on_unavailable
    iterator = method(request=request)
  File "/opt/venv/lib/python3.10/site-packages/google/cloud/spanner_v1/services/spanner/client.py", line 1444, in execute_streaming_sql
    response = rpc(
  File "/opt/venv/lib/python3.10/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
    return wrapped_func(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/google/api_core/timeout.py", line 120, in func_with_timeout
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 174, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc

The text was updated successfully, but these errors were encountered:

ohmayr · 2024-06-18T16:37:50Z

Hi @MostafaOmar98, Thanks for reporting this issue! Does this error only occur over the gRPC transport? If not, can you share what error you get when using the REST transport? You can set transport in the following way:

client = Client(..., transport="rest")

MostafaOmar98 · 2024-06-25T07:09:56Z

Heyy @ohmayr , thanks for your reply. I don't think the transport is publicly configurable, am I misunderstanding something?

I don't see the transport as a constructor field on the Client class and there is a comment that explicitly says that the cloud spanner api requires gRPC transport.
I can see that it is configurable on the internal SpannerClient class but this one is instantiated by the Database class and it is not configurable there either

harshachinta · 2024-07-08T11:27:50Z

@MostafaOmar98
Can you please refer to the internal bug and share the necessary information that has been asked for over there?

ppsic · 2024-12-02T07:25:58Z

Hi @harshachinta @MostafaOmar98 Is this issue still open? I get the same error, although it does not involve spanner.
this is the full error message

Traceback (most recent call last):
File "/Users/priyankaphadnis/Dev/SiVista/grpc_client.py", line 51, in
run()
File "/Users/priyankaphadnis/Dev/SiVista/grpc_client.py", line 38, in run
for response in responses:
File "/Users/priyankaphadnis/anaconda3/envs/py39/lib/python3.9/site-packages/grpc/_channel.py", line 543, in next
return self._next()
File "/Users/priyankaphadnis/anaconda3/envs/py39/lib/python3.9/site-packages/grpc/_channel.py", line 969, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:50051 {created_time:"2024-12-01T23:14:14.304512-08:00", grpc_status:14, grpc_message:"Socket closed"}"

harshachinta · 2024-12-02T08:41:20Z

@ppsic
The UNAVAILABLE errors can occur for multiple reasons, one of which is idle connections. When a connection made to GFE is not used for a long time then those connections get cleaned. May be you can check if that is what happens in your case.

If that is the case then you can check this document - https://grpc.io/docs/guides/keepalive/.
If you do not want keepalive pings, then this error is normally considered as a retryable error and retried from application.

You can enable gRPC debug logs to see if there is more information on the error.

export GRPC_VERBOSITY=debug
export GRPC_TRACE=all

MostafaOmar98 · 2024-12-05T11:47:48Z

@ppsic, as @harshachinta has mentioned, if you aren't already keeping the connection active (i.e., by doing a periodic ping in the background), this could be one of the reasons.

For us, we made sure that we were constantly pinging on connections, so we never really figured out the root cause of this issue. It seems to be a transient, retriable issue. So if the rate is very very low, you might be able to tolerate or retry it. We did see a massive spike in the error rate for around a month though but it went away on its own (we assumed it was a server-side change, never really got the time to confirm it though). Are you seeing an increase in the error rate or are you seeing it for the first time, @ppsic ?

harshachinta · 2024-12-09T04:37:08Z

@MostafaOmar98

Curious to understand the mechanism you are following to constantly ping on the connections? Are you setting something on python client to achieve this or something else?
Also when you get time can you help us verify that using this update https://github.com/googleapis/python-spanner/tree/grpc-keep-alive-setting works without your application handling any logic of constant pinging?

MostafaOmar98 · 2024-12-10T08:21:35Z

@harshachinta

Yes, we are setting up a pinging pool for a long-lived client that lives throughout the lifetime of the application process. The setup is according to this section of the documentation. The only difference is that we sleep 10 seconds between consecutive pings. i.e., the background_loop code is changed to:

def background_loop():
   while True:
      # (Optional) Perform other background tasks here
      pool.ping()
      time.sleep(10) # sleeping 10 seconds between consecutive pings

Sure thing! Will update you if the team gets the time for it.

UNAVAILABLE errors that occurred during the initial attempt of a streaming RPC (StreamingRead / ExecuteStreamingSql) would not be retried. Fixes #1150

UNAVAILABLE errors that occurred during the initial attempt of a streaming RPC (StreamingRead / ExecuteStreamingSql) would not be retried. Fixes googleapis#1150

product-auto-label bot added the api: spanner Issues related to the googleapis/python-spanner API. label Jun 12, 2024

blunderbuss-gcf bot assigned surbhigarg92 Jun 12, 2024

surbhigarg92 assigned harshachinta Jun 13, 2024

surbhigarg92 added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Jun 13, 2024

harshachinta assigned surbhigarg92 and unassigned surbhigarg92 Jun 20, 2024

harshachinta mentioned this issue Jul 30, 2024

Missing retry during ExecuteStreamingSql call in _restart_on_unavailable #1175

Open

harshachinta added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Dec 9, 2024

olavloite added a commit that referenced this issue Dec 27, 2024

fix: retry UNAVAILABLE errors for streaming RPCs

54faec1

UNAVAILABLE errors that occurred during the initial attempt of a streaming RPC (StreamingRead / ExecuteStreamingSql) would not be retried. Fixes #1150

olavloite mentioned this issue Dec 27, 2024

fix: retry UNAVAILABLE errors for streaming RPCs #1278

Merged

olavloite closed this as completed in #1278 Jan 1, 2025

olavloite added a commit that referenced this issue Jan 1, 2025

fix: retry UNAVAILABLE errors for streaming RPCs (#1278)

ab31078

UNAVAILABLE errors that occurred during the initial attempt of a streaming RPC (StreamingRead / ExecuteStreamingSql) would not be retried. Fixes #1150

release-please bot mentioned this issue Dec 20, 2024

chore(main): release 3.52.0 #1258

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receiving Error: MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE #1150

Receiving Error: MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE #1150

MostafaOmar98 commented Jun 12, 2024

ohmayr commented Jun 18, 2024

MostafaOmar98 commented Jun 25, 2024

harshachinta commented Jul 8, 2024 •

edited

Loading

ppsic commented Dec 2, 2024

harshachinta commented Dec 2, 2024 •

edited

Loading

MostafaOmar98 commented Dec 5, 2024 •

edited

Loading

harshachinta commented Dec 9, 2024

MostafaOmar98 commented Dec 10, 2024

Receiving Error: MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE #1150

Receiving Error: MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE #1150

Comments

MostafaOmar98 commented Jun 12, 2024

Environment details

Steps to reproduce

Code example

Stack trace

ohmayr commented Jun 18, 2024

MostafaOmar98 commented Jun 25, 2024

harshachinta commented Jul 8, 2024 • edited Loading

ppsic commented Dec 2, 2024

harshachinta commented Dec 2, 2024 • edited Loading

MostafaOmar98 commented Dec 5, 2024 • edited Loading

harshachinta commented Dec 9, 2024

MostafaOmar98 commented Dec 10, 2024

harshachinta commented Jul 8, 2024 •

edited

Loading

harshachinta commented Dec 2, 2024 •

edited

Loading

MostafaOmar98 commented Dec 5, 2024 •

edited

Loading