Confusing Backend worker failures logs on Google Cloud Run #1282

ivanzvonkov · 2021-10-14T05:00:35Z

Context

torchserve version: 0.4.2
torch-model-archiver version: What ever version comes with the docker image (not sure how to check)
torch version: 1.9.0
java version: openjdk 11.0.11 2021-04-20
Operating System and version: docker image pytorch/torchserve:0.4.2-cpu

Your Environment

Installed using source? [yes/no]: No
Are you planning to deploy it using docker container? [yes/no]: Yes
Is it a CPU or GPU environment?: CPU
Using a default/custom handler? [If possible upload/share custom handler/model]: Custom, but getting same results with image_classifier handler
What kind of model is it e.g. vision, text, audio?: vision
Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.?
[If public url then provide link.]: local models
Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs: n/a

Expected Behavior

Clearer logs about Backend worker failures.

Current Behavior

After deploying the torch-serve docker container to Google Cloud Run I see the regular start up logs from torch-serve in the first 9 seconds:

2021-10-14 04:28:01,593 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
...
2021-10-14 04:28:02,031 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin
...
2021-10-14 04:28:02,106 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: Togo.mar
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model Togo
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model Togo
2021-10-14 04:28:02,191 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model Togo loaded.
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: Togo, count: 2
...
2021-10-14 04:28:03,773 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2021-10-14 04:28:03,776 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082

Lastly (this I see only on Google Cloud Run)

Container Sandbox: Unsupported syscall setsockopt(0xa0,0x1,0xd,0x3e9999efcf70,0x8,0x4).

Then without calling any API after 3 minutes I see:

2021-10-14 04:31:14,244 [WARN ] W-9000-Togo_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-Togo_1.0-stdout
2021-10-14 04:31:08,840 [WARN ] W-9007-Ethiopia_Tigray_2020_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9007-Ethiopia_Tigray_2020_1.0-stdout
2021-10-14 04:31:13,040 [WARN ] W-9002-Rwanda_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9002-Rwanda_1.0-stdout
2021-10-14 04:31:13,040 [WARN ] W-9003-Rwanda_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9003-Rwanda_1.0-stdout
2021-10-14 04:30:56,141 [WARN ] W-9006-Ethiopia_Tigray_2020_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9006-Ethiopia_Tigray_2020_1.0-stderr
2021-10-14 04:30:56,141 [WARN ] W-9001-Togo_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9001-Togo_1.0-stderr
2021-10-14 04:30:56,340 [WARN ] W-9005-Ethiopia_Tigray_2021_w_forecaster_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9005-Ethiopia_Tigray_2021_w_forecaster_1.0-stderr
2021-10-14 04:30:56,540 [WARN ] W-9008-Kenya_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9008-Kenya_1.0-stderr
2021-10-14 04:30:49,440 [WARN ] W-9009-Kenya_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9009-Kenya_1.0-stderr
2021-10-14 04:30:49,640 [WARN ] W-9004-Ethiopia_Tigray_2021_w_forecaster_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9004-Ethiopia_Tigray_2021_w_forecaster_1.0-stderr
2021-10-14 04:30:56,640 [INFO ] W-9003-Rwanda_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9003-Rwanda_1.0-stderr
2021-10-14 04:30:56,740 [INFO ] W-9009-Kenya_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9009-Kenya_1.0-stderr
2021-10-14 04:31:14,841 [ERROR] W-9000-Togo_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker startup time out.
	at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:85)
	at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:281)
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:179)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

The models and prediction endpoint both work as expected but since I am running torch serve for thousands of examples my logs are full of these "errors".
Cannot reproduce locally, only on Google Cloud Run.
Models are loaded with BaseHandler's torch.jit.load

Possible Solution

Might be something to do with how Google Cloud Run shuts down inactive container instances?

Failure Logs [if any]

Provided above.

The text was updated successfully, but these errors were encountered:

lxning · 2021-10-22T22:55:44Z

@ivanzvonkov according to log, it seems that there are multiple models (eg, togo, Ethiopia_Tigray, Rwanda, Kenya, Ethiopia_Tigray) in torchserve. Could you please provide the following information to help us understand the scenario?

Were they loaded concurrently in google cloud? What's the cpu/memory usage of your container and host in google cloud?
Does the same loading scenario work at local?

ivanzvonkov · 2021-10-24T18:07:54Z

@lxning

What's the cpu/memory usage of your container and host in google cloud?

CPU at most 32%

Memory at most 86%

Were they loaded concurrently in google cloud?

I assume yes, how can I tell for sure?

Does the same loading scenario work at local?

Yes, when I start the same docker container locally I do not see those error logs.

lxning · 2021-10-25T17:33:56Z

@ivanzvonkov There are two numbers in the above CPU/memory metrics. For example, container cpu utilization 99% 31.99%, I assume container cpu usage is 99%, host cpu usage is 31.99%. Is my assumption correct?

Overall, I think there are too many model loaded in the container. This caused model loading was too slow and timeout. You can reduce the number of models loaded in one container.

ivanzvonkov · 2021-10-29T05:51:33Z

@lxning The 99%, 95%, 50% represent the percentile of all the measurements at that time. So 99% - 31.99% means the highest container CPU measurement was 31.99%.

The models are all loaded and work. But this error message still comes up.

lxning · 2021-11-04T17:07:27Z

@ivanzvonkov TS calls setsockopt to create uds connection b/w frontend and backend. However, there is a [bug (https://github.com/google/gvisor/issues/1739) in google cloud. This bug causes TS init backend worker timeout.

lxning · 2021-11-04T17:09:44Z

@ivanzvonkov I close this ticket since the fix needs to be done in google cloud. Please feel free reopen this ticket if it is needed.

rysmende · 2024-04-18T13:30:57Z

@ivanzvonkov The solution that worked for me was to increase the RAM for the service.

msaroufim added help wanted Extra attention is needed support labels Oct 15, 2021

lxning closed this as completed Nov 4, 2021

lxning added this to the v0.4.3 milestone Nov 4, 2021

lxning self-assigned this Nov 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusing Backend worker failures logs on Google Cloud Run #1282

Confusing Backend worker failures logs on Google Cloud Run #1282

ivanzvonkov commented Oct 14, 2021

lxning commented Oct 22, 2021

ivanzvonkov commented Oct 24, 2021

lxning commented Oct 25, 2021

ivanzvonkov commented Oct 29, 2021

lxning commented Nov 4, 2021

lxning commented Nov 4, 2021

rysmende commented Apr 18, 2024

Confusing Backend worker failures logs on Google Cloud Run #1282

Confusing Backend worker failures logs on Google Cloud Run #1282

Comments

ivanzvonkov commented Oct 14, 2021

Context

Your Environment

Expected Behavior

Current Behavior

Possible Solution

Failure Logs [if any]

lxning commented Oct 22, 2021

ivanzvonkov commented Oct 24, 2021

lxning commented Oct 25, 2021

ivanzvonkov commented Oct 29, 2021

lxning commented Nov 4, 2021

lxning commented Nov 4, 2021

rysmende commented Apr 18, 2024