Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

[display event] add event watcher in database controller #4939

Merged
merged 4 commits into from
Oct 13, 2020

Conversation

hzy46
Copy link
Contributor

@hzy46 hzy46 commented Sep 29, 2020

database size control strategy:

  • In event watcher, check the disk usage in the beginning. If the disk usage > 80%, stop the event watcher and exit with a non-zero code.
  • The disk check also happens every 60s. If the disk usage > 80%, stop the event watcher and exit with a non-zero code.
  • 60s and 80% are configurable.

Problem found:

  • If a job generates too many events, it will affect all other jobs.

fix

fix

fix

fix

fix

fix

fix

fix

Add index on task uid instead of framework name and attempt index (#4938)

fix

fix

fix

fix

fix

fix

fix
@hzy46 hzy46 requested a review from yqwang-ms September 29, 2020 06:48
@coveralls
Copy link

coveralls commented Sep 29, 2020

Coverage Status

Coverage remained the same at 34.383% when pulling 87d4225 on zhiyuhe/add_event_watcher into 9755553 on master.

}
};

async function assertDiskUsageHealthy() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to limit the quota per job and global, and does not impact the critical path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recorded in #4953 . We can solve this problem in the future.

# Max connection number to database in cluster event watcher.
cluster-event-max-db-connection: 40
# Max disk usage in internal storage for cluster event watcher
cluster-event-watcher-max-disk-usage-percent: 80
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also limit for history ? Why not move non-critical things to another DB server?

Copy link
Contributor Author

@hzy46 hzy46 Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recorded in #4954 . We can solve this problem in the future.

@hzy46 hzy46 mentioned this pull request Oct 12, 2020
31 tasks
@hzy46 hzy46 merged commit 16f55e5 into master Oct 13, 2020
@hzy46
Copy link
Contributor Author

hzy46 commented Oct 23, 2020

Event Watcher Test Cases:

  1. Test: the event watcher works properly

Submit a job that will be always in waiting status (e.g. use a lot of resource). Then check if there is any event about "failed scheduling" on the job event page after a few minutes.

  1. Test: the event watcher can handle a large number of events.

Submit a job with 2000+ tasks. After a few minutes check the event page can work properly.

  1. Test: the event watcher will exit if too much disk size is used.

Go to internal storage to see the existing usage:

kubectl exec -it `kubectl get po  | grep internal-storage | awk '{print $1}' ` bash
df -h

Please notice the usage of loop device /paiInternal/storage.

Create a big file under /paiInternal/storage and make its usage larger than 80%.

After a few minutes, confirm that 1. there is a NodeFilesystemUsage alert shown on webportal 2. the event watcher should exit automatically.

Remove the big file. After a few minutes, confirm that: 1. there is no more NodeFilesystemUsage alert 2. the event watcher should work properly, and we can see events of new jobs on webportal.

@hzy46 hzy46 deleted the zhiyuhe/add_event_watcher branch November 3, 2020 09:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants