Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Alert manager based utilization enhancement #4788

Closed
wants to merge 63 commits into from

Conversation

suiguoxin
Copy link
Member

@suiguoxin suiguoxin commented Aug 5, 2020

This PR is for the first part of issue #4789

  • Alert-manager: Kill low-gpu-utilization jobs, tag abnormal jobs
    • add virtual cluster info in job-exporter
    • config monitor rules in prometheus
    • send action request through webhook
    • job-handler: deal with webhook request & redirect to RestServer
    • realize customized SMTP service in alert-handler, send alert email to user when possible, change email template to ejs
    • document how to customize alerts/actions
  • Job tags:
    • DB: job-tag table
    • RestServer:
      • getJobList : filter by tag
      • getJobDetails : with tag info
      • executionJob : tag / untag
  • WebPortal abnormal jobs: refactor with tag filter

Realization Details:

  • add vc info in job-exporter
  • set Prometheus to monitor jobs in certain virtual clusters with low gpu usage, send email and call webhook (alert-handler)
  • add alert-handler as a container in alert-manager, transfer webhook call to REST-SERVER to stop job
  • Customized Configuration (bearer token / alert config) should be set in services-configuration.yaml.template

@coveralls
Copy link

coveralls commented Aug 6, 2020

Coverage Status

Coverage remained the same at 34.383% when pulling 27408d4 on suiguoxin:prometheus into 9755553 on microsoft:master.

@suiguoxin suiguoxin force-pushed the prometheus branch 2 times, most recently from f31c93b to c7de70c Compare August 7, 2020 06:27
@suiguoxin suiguoxin force-pushed the prometheus branch 4 times, most recently from 1373b8f to 2193336 Compare August 11, 2020 05:45
docs/manual/cluster-admin/how-to-customize-alerts.md Outdated Show resolved Hide resolved
Comment on lines +62 to +63
tags:
- 'stopped-by-alert-manager'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed about move tags under tag-jobs. Can we achieve this?

Copy link
Member Author

@suiguoxin suiguoxin Sep 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed about move tags under tag-jobs. Can we achieve this?

It is legal to mix dict / string in the same list in yaml, but it becomes difficult to check if tag-jobs is one of the available actions in the template. So I suggest to keep the current schema.

docs/manual/cluster-admin/how-to-customize-alerts.md Outdated Show resolved Hide resolved
src/alert-manager/config/alert-manager.md Show resolved Hide resolved
src/alert-manager/config/alert-manager.md Show resolved Hide resolved
src/alert-manager/src/alert-handler/controllers/job.js Outdated Show resolved Hide resolved
Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this enhance.

@suiguoxin
Copy link
Member Author

Duplicated by #4940

@suiguoxin suiguoxin closed this Sep 29, 2020
@suiguoxin suiguoxin deleted the prometheus branch July 8, 2021 07:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants