Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing Backend worker failures logs on Google Cloud Run #1282

Closed
ivanzvonkov opened this issue Oct 14, 2021 · 7 comments
Closed

Confusing Backend worker failures logs on Google Cloud Run #1282

ivanzvonkov opened this issue Oct 14, 2021 · 7 comments
Assignees
Labels
help wanted Extra attention is needed support
Milestone

Comments

@ivanzvonkov
Copy link

Context

  • torchserve version: 0.4.2
  • torch-model-archiver version: What ever version comes with the docker image (not sure how to check)
  • torch version: 1.9.0
  • java version: openjdk 11.0.11 2021-04-20
  • Operating System and version: docker image pytorch/torchserve:0.4.2-cpu

Your Environment

  • Installed using source? [yes/no]: No
  • Are you planning to deploy it using docker container? [yes/no]: Yes
  • Is it a CPU or GPU environment?: CPU
  • Using a default/custom handler? [If possible upload/share custom handler/model]: Custom, but getting same results with image_classifier handler
  • What kind of model is it e.g. vision, text, audio?: vision
  • Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.?
    [If public url then provide link.]: local models
  • Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs: n/a

Expected Behavior

Clearer logs about Backend worker failures.

Current Behavior

After deploying the torch-serve docker container to Google Cloud Run I see the regular start up logs from torch-serve in the first 9 seconds:

2021-10-14 04:28:01,593 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
...
2021-10-14 04:28:02,031 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin
...
2021-10-14 04:28:02,106 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: Togo.mar
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model Togo
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model Togo
2021-10-14 04:28:02,191 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model Togo loaded.
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: Togo, count: 2
...
2021-10-14 04:28:03,773 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2021-10-14 04:28:03,776 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082

Lastly (this I see only on Google Cloud Run)

Container Sandbox: Unsupported syscall setsockopt(0xa0,0x1,0xd,0x3e9999efcf70,0x8,0x4).

Then without calling any API after 3 minutes I see:

2021-10-14 04:31:14,244 [WARN ] W-9000-Togo_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-Togo_1.0-stdout
2021-10-14 04:31:08,840 [WARN ] W-9007-Ethiopia_Tigray_2020_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9007-Ethiopia_Tigray_2020_1.0-stdout
2021-10-14 04:31:13,040 [WARN ] W-9002-Rwanda_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9002-Rwanda_1.0-stdout
2021-10-14 04:31:13,040 [WARN ] W-9003-Rwanda_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9003-Rwanda_1.0-stdout
2021-10-14 04:30:56,141 [WARN ] W-9006-Ethiopia_Tigray_2020_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9006-Ethiopia_Tigray_2020_1.0-stderr
2021-10-14 04:30:56,141 [WARN ] W-9001-Togo_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9001-Togo_1.0-stderr
2021-10-14 04:30:56,340 [WARN ] W-9005-Ethiopia_Tigray_2021_w_forecaster_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9005-Ethiopia_Tigray_2021_w_forecaster_1.0-stderr
2021-10-14 04:30:56,540 [WARN ] W-9008-Kenya_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9008-Kenya_1.0-stderr
2021-10-14 04:30:49,440 [WARN ] W-9009-Kenya_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9009-Kenya_1.0-stderr
2021-10-14 04:30:49,640 [WARN ] W-9004-Ethiopia_Tigray_2021_w_forecaster_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9004-Ethiopia_Tigray_2021_w_forecaster_1.0-stderr
2021-10-14 04:30:56,640 [INFO ] W-9003-Rwanda_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9003-Rwanda_1.0-stderr
2021-10-14 04:30:56,740 [INFO ] W-9009-Kenya_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9009-Kenya_1.0-stderr
2021-10-14 04:31:14,841 [ERROR] W-9000-Togo_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker startup time out.
	at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:85)
	at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:281)
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:179)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

The models and prediction endpoint both work as expected but since I am running torch serve for thousands of examples my logs are full of these "errors".
Cannot reproduce locally, only on Google Cloud Run.
Models are loaded with BaseHandler's torch.jit.load

Possible Solution

Might be something to do with how Google Cloud Run shuts down inactive container instances?

Failure Logs [if any]

Provided above.

@msaroufim msaroufim added help wanted Extra attention is needed support labels Oct 15, 2021
@lxning
Copy link
Collaborator

lxning commented Oct 22, 2021

@ivanzvonkov according to log, it seems that there are multiple models (eg, togo, Ethiopia_Tigray, Rwanda, Kenya, Ethiopia_Tigray) in torchserve. Could you please provide the following information to help us understand the scenario?

  • Were they loaded concurrently in google cloud? What's the cpu/memory usage of your container and host in google cloud?
  • Does the same loading scenario work at local?

@ivanzvonkov
Copy link
Author

@lxning

What's the cpu/memory usage of your container and host in google cloud?

CPU at most 32%
image

Memory at most 86%
image

Were they loaded concurrently in google cloud?

I assume yes, how can I tell for sure?

Does the same loading scenario work at local?

Yes, when I start the same docker container locally I do not see those error logs.

@lxning
Copy link
Collaborator

lxning commented Oct 25, 2021

@ivanzvonkov There are two numbers in the above CPU/memory metrics. For example, container cpu utilization 99% 31.99%, I assume container cpu usage is 99%, host cpu usage is 31.99%. Is my assumption correct?

Overall, I think there are too many model loaded in the container. This caused model loading was too slow and timeout. You can reduce the number of models loaded in one container.

@ivanzvonkov
Copy link
Author

@lxning The 99%, 95%, 50% represent the percentile of all the measurements at that time. So 99% - 31.99% means the highest container CPU measurement was 31.99%.

The models are all loaded and work. But this error message still comes up.

@lxning
Copy link
Collaborator

lxning commented Nov 4, 2021

@ivanzvonkov TS calls setsockopt to create uds connection b/w frontend and backend. However, there is a [bug (https://github.com/google/gvisor/issues/1739) in google cloud. This bug causes TS init backend worker timeout.

@lxning
Copy link
Collaborator

lxning commented Nov 4, 2021

@ivanzvonkov I close this ticket since the fix needs to be done in google cloud. Please feel free reopen this ticket if it is needed.

@lxning lxning closed this as completed Nov 4, 2021
@lxning lxning added this to the v0.4.3 milestone Nov 4, 2021
@lxning lxning self-assigned this Nov 4, 2021
@rysmende
Copy link

@ivanzvonkov The solution that worked for me was to increase the RAM for the service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed support
Projects
None yet
Development

No branches or pull requests

4 participants