Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Rest-server job detail API becomes slow and CPU-intensive #5012

Closed
hzy46 opened this issue Oct 26, 2020 · 1 comment
Closed

Rest-server job detail API becomes slow and CPU-intensive #5012

hzy46 opened this issue Oct 26, 2020 · 1 comment

Comments

@hzy46
Copy link
Contributor

hzy46 commented Oct 26, 2020

I compared the job detail API of v1.2.1 and current master (commit id 03d83a7) with the same test job:

protocolVersion: 2
name: admin_0c0cdf32
type: job
jobRetryCount: 0
prerequisites:
  - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 2000
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 4
      cpu: 20
      memoryMB: 225336
    commands:
      - sleep 1000s
defaults:
  virtualCluster: default
extras:
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true
  hivedScheduler:
    jobPriorityClass: oppo
    taskRoles:
      taskrole:
        skuNum: 4
        skuType: DT

In v1.2.1:

If we only open 1 job detail page, the api call takes about 400~500ms.
If we open 10 job detail pages, the api call also takes about 400~500ms. And the rest-server CPU usage is lower than 10%.

In master branch:

If we only open 1 job detail page, the api call takes about 3~3.5s.
If we open 10 job detail pages, the api call also takes over 30s. And the rest-server CPU usage is 100%.
It also slows down other requests (see the following pic: the call to ssh is also slow):
image

@hzy46 hzy46 changed the title Rest-server job detail API becomes slow Rest-server job detail API becomes slow and CPU-intensive Oct 26, 2020
@suiguoxin
Copy link
Member

suiguoxin commented Oct 29, 2020

This is caused by a repetitive readFile operation involved in PR #4958.

The issue is addressed in PR #5022 .

We test the performance of get job detail API after this fix in int bed and have the following results:

image

Further discussion about convertFrameworkDetail is tracked in #5027 .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants