-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusing Backend worker failures logs on Google Cloud Run #1282
Comments
@ivanzvonkov according to log, it seems that there are multiple models (eg, togo, Ethiopia_Tigray, Rwanda, Kenya, Ethiopia_Tigray) in torchserve. Could you please provide the following information to help us understand the scenario?
|
I assume yes, how can I tell for sure?
Yes, when I start the same docker container locally I do not see those error logs. |
@ivanzvonkov There are two numbers in the above CPU/memory metrics. For example, container cpu utilization 99% 31.99%, I assume container cpu usage is 99%, host cpu usage is 31.99%. Is my assumption correct? Overall, I think there are too many model loaded in the container. This caused model loading was too slow and timeout. You can reduce the number of models loaded in one container. |
@lxning The 99%, 95%, 50% represent the percentile of all the measurements at that time. So 99% - 31.99% means the highest container CPU measurement was 31.99%. The models are all loaded and work. But this error message still comes up. |
@ivanzvonkov TS calls setsockopt to create uds connection b/w frontend and backend. However, there is a [bug (https://github.com/google/gvisor/issues/1739) in google cloud. This bug causes TS init backend worker timeout. |
@ivanzvonkov I close this ticket since the fix needs to be done in google cloud. Please feel free reopen this ticket if it is needed. |
@ivanzvonkov The solution that worked for me was to increase the RAM for the service. |
Context
Your Environment
[If public url then provide link.]: local models
Expected Behavior
Clearer logs about Backend worker failures.
Current Behavior
After deploying the torch-serve docker container to Google Cloud Run I see the regular start up logs from torch-serve in the first 9 seconds:
Lastly (this I see only on Google Cloud Run)
Then without calling any API after 3 minutes I see:
The models and prediction endpoint both work as expected but since I am running torch serve for thousands of examples my logs are full of these "errors".
Cannot reproduce locally, only on Google Cloud Run.
Models are loaded with BaseHandler's torch.jit.load
Possible Solution
Might be something to do with how Google Cloud Run shuts down inactive container instances?
Failure Logs [if any]
Provided above.
The text was updated successfully, but these errors were encountered: