Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

PAI become unavailable when users open too many job detail-pages #4263

Closed
Binyang2014 opened this issue Mar 9, 2020 · 1 comment
Closed
Assignees
Labels

Comments

@Binyang2014
Copy link
Contributor

Binyang2014 commented Mar 9, 2020

Currently, PAI job-detail page will auto-refresh every 10 seconds. When user submit too many job instances, every refresh will take a long time to retrieve data. And this will block other requests send to rest-server

Options to solve this problem:

  1. Disable auto-refresh in job detail page
  2. Add cache in rest-server
  3. Add PAI service alert. When PAI become unavailable, rest-server will consume more than 4GB memory and cpu usage is over 100%. Maybe we can trigger an alert for this. (Collecting rest-server logs will also helpful)
@fanyangCS
Copy link
Contributor

fanyangCS commented Mar 9, 2020

In the short term, I suggest to:

  • do not retrieve GPU index when querying job details
  • Disable or increase the interval of auto-refresh when number of containers is larger than 64
  • Exponential back-off auto refresh when API call is failing

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants