[copy] fix the TimeoutError and ServerDisconnected issues in copy #11830

danking · 2022-05-11T20:03:06Z

cc: @daniel-goldstein, this is a tricky asyncio situation which you should also keep in mind

OK, there were two problems:

A timeout of 5s appears to be now too short for Google Cloud Storage. I am not sure why but we
timeout substantially more frequently. I have observed this myself on my laptop. Just this
morning I saw it happen to Daniel.
When using an aiohttp.AsyncIterablePayload, it is critical to always check if the coroutine
which actually writes to GCS (which is stashed in the variable request_task) is still
alive. In the current main, we do not do this which causes hangs (in particular the timeout
exceptions are never thrown ergo we never retry).

To understand the second problem, you must first recall how writing works in aiogoogle. There are
two Tasks and an asyncio.Queue. The terms "writer" and "reader" are somewhat confusing, so let's
use left and right. The left Task has the owning reference to both the source "file" and the
destination "file". In particular, it is the left Task which closes both "files". Moreover, the
left Task reads chunks from the source file and places those chunks on the asyncio.Queue. The
right Task takes chunks off the queue and writes those chunks to the destination file.

This situation can go awry in two ways.

First, if the right Task encounters any kind of failure, it will stop taking chunks off of the
queue. When the queue (which has a size limit of one) is full, the left Task will hang. The system
is stuck. The left Task will wait forever for the right Task to empty the queue.

The second scenario is exactly the same except that the left Task is trying to add the "stop"
message to the queue rather than a chunk.

In either case, it is critical that the left Task waits simultaneously on the queue operation and
on the right Task completing. If the right Task has died, no further writes can occur and the left
Task must raise an exception. In the first scenario, we do not observe the right Task's exception
because that will be done when we close the InsertObjectStream (which represents the destination
"file").

I also added several types, assertions, and a few missing async with ... as resp: blocks.

OK, there were two problems: 1. A timeout of 5s appears to be now too short for Google Cloud Storage. I am not sure why but we timeout substantially more frequently. I have observed this myself on my laptop. Just this morning I saw it happen to Daniel. 2. When using an `aiohttp.AsyncIterablePayload`, it is *critical* to always check if the coroutine which actually writes to GCS (which is stashed in the variable `request_task`) is still alive. In the current `main`, we do not do this which causes hangs (in particular the timeout exceptions are never thrown ergo we never retry). To understand the second problem, you must first recall how writing works in aiogoogle. There are two Tasks and an `asyncio.Queue`. The terms "writer" and "reader" are somewhat confusing, so let's use left and right. The left Task has the owning reference to both the source "file" and the destination "file". In particular, it is the *left* Task which closes both "files". Moreover, the left Task reads chunks from the source file and places those chunks on the `asyncio.Queue`. The right Task takes chunks off the queue and writes those chunks to the destination file. This situation can go awry in two ways. First, if the right Task encounters any kind of failure, it will stop taking chunks off of the queue. When the queue (which has a size limit of one) is full, the left Task will hang. The system is stuck. The left Task will wait forever for the right Task to empty the queue. The second scenario is exactly the same except that the left Task is trying to add the "stop" message to the queue rather than a chunk. In either case, it is critical that the left Task waits simultaneously on the queue operation *and* on the right Task completing. If the right Task has died, no further writes can occur and the left Task must raise an exception. In the first scenario, we do not observe the right Task's exception because that will be done when we close the `InsertObjectStream` (which represents the destination "file"). --- I also added several types, assertions, and a few missing `async with ... as resp:` blocks.

jigold · 2022-05-11T20:12:33Z

hail/python/hailtop/httpx.py

@@ -101,7 +101,7 @@ def __init__(self,
        assert 'connector' not in kwargs

        if timeout is None:
-            timeout = aiohttp.ClientTimeout(total=5)
+            timeout = aiohttp.ClientTimeout(total=20)


Can we set this timeout explicitly from the aiogoogle code? I'm worried this will make other places where we use this Client have long timeouts like in Batch and we don't want that?

It is possible to set the timeout during construction of the session for the StorageClient.

I'm somewhat disinclined to use different timeouts for different parts of our system. That seems like it will be harder to keep track of when we're debugging things. I kind of think our original 5s timeout is quite aggressive. I'm not sure what to think. I just really don't want to think about multiple different parts of our system using differing lengths of timeout.

If that's the case, then we need to make sure every batch-driver / worker interaction has the correct timeouts. We cannot wait 20 seconds to schedule a job as that will gum up the scheduler.

jigold · 2022-05-11T20:13:31Z

hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py

-            return await self._session.post(
-                f'https://storage.googleapis.com/upload/storage/v1/b/{bucket}/o',
-                **kwargs)
+        assert 'data' not in params


I assume we never reach this case in our current code...

Yeah, it's totally unused.

jigold · 2022-05-11T20:16:09Z

hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py

+            await asyncio.wait([fut, self._request_task], return_when=asyncio.FIRST_COMPLETED)
+            if fut.done():
+                return len(b)
+            raise ValueError(f'request task finished early')


What is the implication of this Exception? Does this show up in user logs? Is it retried at all as a transient error?

It's not a transient error and is not retried. It could show up anywhere that someone tries to use aiogoogle. That includes input and output of batch.

As long as the client of this code is correctly calling close on an InsertObjectStream, then you'll see that while handling this ValueError, you encountered the error produced by the _request_task, which is the actual cause of all this. As a result, if you're doing something like:

async def foo(): async with await fs.create(...) as obj: await obj.write(...) await retry_transient_errors(foo)

The retry_transient_errors will see the transient error from the _request_task (which will have the ValueError as a suppressed exception) and will appropriately retry foo.

x

danking · 2022-05-13T17:21:40Z

Let's try to get this merged today, I don't have meetings so I can respond quickly to changes

jigold · 2022-05-13T17:53:52Z

I think you need non-default timeouts in job.py unschedule_job and in instance.py check_is_active_healthy

jigold · 2022-05-13T17:55:42Z

Also, in worker.py when you construct Worker()

danking · 2022-05-13T18:34:45Z

Hmm. I really am not a fan of heterogeneity of timeout. OK, if you really think its critical that we have 5s timeouts in Batch, then I'll just put the 20 second timeout into the storage_client.

danking assigned jigold May 11, 2022

jigold previously requested changes May 11, 2022

View reviewed changes

Daniel King added 2 commits May 11, 2022 17:17

remove pylints

901db97

more pylints

b4d5c69

vladsavelyev added a commit to populationgenomics/hail that referenced this pull request May 11, 2022

https timeout 5 -> 20 (part of Dan Kings fix for GCS copy hail-is#11830)

531b3e1

This was referenced May 11, 2022

Increase https timeout from 5 to 20 populationgenomics/hail#193

Merged

[copy] fix the TimeoutError and ServerDisconnected issues in copy populationgenomics/hail#195

Merged

more type errors

8b77333

danking mentioned this pull request May 13, 2022

[release] 0.2.95 #11834

Merged

move 20s timeout to GoogleStorageClient

c844743

jigold approved these changes May 13, 2022

View reviewed changes

danking merged commit 5365520 into hail-is:main May 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[copy] fix the TimeoutError and ServerDisconnected issues in copy #11830

[copy] fix the TimeoutError and ServerDisconnected issues in copy #11830

danking commented May 11, 2022

jigold May 11, 2022

danking May 11, 2022

jigold May 13, 2022

jigold May 11, 2022

danking May 11, 2022

jigold May 11, 2022

danking May 11, 2022

danking commented May 13, 2022

jigold commented May 13, 2022

jigold commented May 13, 2022

danking commented May 13, 2022

[copy] fix the TimeoutError and ServerDisconnected issues in copy #11830

[copy] fix the TimeoutError and ServerDisconnected issues in copy #11830

Conversation

danking commented May 11, 2022

jigold May 11, 2022

Choose a reason for hiding this comment

danking May 11, 2022

Choose a reason for hiding this comment

jigold May 13, 2022

Choose a reason for hiding this comment

jigold May 11, 2022

Choose a reason for hiding this comment

danking May 11, 2022

Choose a reason for hiding this comment

jigold May 11, 2022

Choose a reason for hiding this comment

danking May 11, 2022

Choose a reason for hiding this comment

danking commented May 13, 2022

jigold commented May 13, 2022

jigold commented May 13, 2022

danking commented May 13, 2022