Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Alert manager based utilization enhancement #4788

Closed
wants to merge 63 commits into from
Closed
Changes from 2 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
7126a7e
add vc info in framework env
suiguoxin Jul 28, 2020
79d04a9
add low task_gpu_percent rule
suiguoxin Jul 29, 2020
649f474
use job level avg gpu percent
suiguoxin Jul 30, 2020
88614c4
update
suiguoxin Jul 30, 2020
06ff087
update
suiguoxin Jul 30, 2020
d8996e3
match multiple vcs
suiguoxin Jul 30, 2020
9532bb5
add job killer router
suiguoxin Jul 30, 2020
bf6f1f0
set web_hook url
suiguoxin Jul 30, 2020
b9e081a
test bearer token
suiguoxin Jul 31, 2020
a29fbc7
update
suiguoxin Jul 31, 2020
4d7f6dd
update
suiguoxin Jul 31, 2020
5855822
read from services config
suiguoxin Jul 31, 2020
ab44150
add alert handler container
suiguoxin Aug 4, 2020
30f5cd1
Merge branch 'master' into prometheus
suiguoxin Aug 6, 2020
f1b98e1
fix lint problem
suiguoxin Aug 6, 2020
45253ed
fix job exporter test issue
suiguoxin Aug 6, 2020
13c6871
fix docker inspect test
suiguoxin Aug 6, 2020
410bc32
add vcs and percent in configuration
suiguoxin Aug 6, 2020
f2ce9ff
refine doc and configuration
suiguoxin Aug 7, 2020
5b9a3ea
add admin manual for customization of alerts
suiguoxin Aug 11, 2020
c23ad72
refactor customize alerts doc
suiguoxin Aug 12, 2020
0d67312
refine user interface / doc
suiguoxin Aug 14, 2020
5a06bce
fix
suiguoxin Aug 14, 2020
4ce27e1
Merge remote-tracking branch 'msft/master' into prometheus
suiguoxin Aug 14, 2020
928f129
update
suiguoxin Aug 14, 2020
9748b86
rename webhook-actions, email-notification
suiguoxin Aug 17, 2020
5548b7f
update
suiguoxin Aug 17, 2020
eb77dca
disable stop-job action by default
suiguoxin Aug 25, 2020
3ad08c5
merge from master
suiguoxin Sep 7, 2020
1b8584f
init send-email
suiguoxin Sep 7, 2020
3f6a57f
update
suiguoxin Sep 7, 2020
b85cde8
update
suiguoxin Sep 8, 2020
1ac48a2
update
suiguoxin Sep 8, 2020
c93e697
fix lint problem
suiguoxin Sep 8, 2020
e617418
Merge branch 'prometheus' into customize-email
suiguoxin Sep 8, 2020
8115619
update
suiguoxin Sep 8, 2020
4d62677
replace email engine with webhook and change template engine to ejs
suiguoxin Sep 8, 2020
59d1404
send email to job user when job_name in alert
suiguoxin Sep 9, 2020
0a3c2ee
merge master
suiguoxin Sep 10, 2020
26385e8
fix typo
suiguoxin Sep 10, 2020
1e509a7
refine code structure with router and controller
suiguoxin Sep 11, 2020
848832c
refine log
suiguoxin Sep 11, 2020
9a981de
refine document and configuration file
suiguoxin Sep 16, 2020
2032084
split email user and admin actions, rename receiver to admin-receiver
suiguoxin Sep 21, 2020
6e0b2e2
merge master
suiguoxin Sep 22, 2020
b5f50a4
add tag-job action
suiguoxin Sep 22, 2020
7ec94fa
move customized alerts and matching rules to service configuration
suiguoxin Sep 22, 2020
a123430
move customized receivers to service configuration
suiguoxin Sep 22, 2020
36b226c
rename alertmanager config files
suiguoxin Sep 22, 2020
d71301d
fix yaml render issue
suiguoxin Sep 23, 2020
9a2aa7a
Merge remote-tracking branch 'msft/master' into prometheus-refine
suiguoxin Sep 24, 2020
33a5e39
fix openpaidbsdk copy issue
suiguoxin Sep 24, 2020
2501741
define alert-handler return code
suiguoxin Sep 24, 2020
d968954
refine admin doc : customzed alerts
suiguoxin Sep 25, 2020
12bf1b4
resolve document confict on alert
suiguoxin Sep 25, 2020
4981dbf
change logger
suiguoxin Sep 27, 2020
6c557f2
change doc link to msft master; fix return code issue in mail action
suiguoxin Sep 27, 2020
25146d7
Merge branch 'prometheus' into prometheus-logger
suiguoxin Sep 27, 2020
5ef913c
use winston logger
suiguoxin Sep 27, 2020
1acb47c
use module alias
suiguoxin Sep 27, 2020
8910f3e
refine response check in alert-handler
suiguoxin Sep 29, 2020
9c7cac4
update example services-configuration
suiguoxin Sep 29, 2020
27408d4
Merge remote-tracking branch 'msft/master' into prometheus
suiguoxin Sep 29, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/manual/cluster-admin/how-to-customize-alerts.md
Original file line number Diff line number Diff line change
@@ -32,7 +32,7 @@ The action `webportal-notification` is always enabled, which means that all the

All the other actions are realized in `alert-handler`.
To make these actions available, administrators need to properly fill the corresponding fields of `alert-manager` in `service-configuration.yml`,
the available actions list will then be saved in `cluster_cfg["alert-manager"]["actions-available"]`, please refer to [alert-manager config](https://github.com/suiguoxin/pai/tree/prometheus/src/alert-manager/config/alert-manager.md) for details of alert-manager service configuration details.
the available actions list will then be saved in `cluster_cfg["alert-manager"]["actions-available"]`, please refer to [alert-manager config](https://github.com/microsoft/pai/tree/master/src/alert-manager/config/alert-manager.md) for details of alert-manager service configuration details.

Make sure `job_name` presents in the alert body if you want to use `email-user`, `stop-jobs`, or `tag-jobs` actions.

@@ -134,7 +134,7 @@ The source code of `alert-handler` is available [here](https://github.com/micros
### Check the dependencies of the action

As stated before, to make an action available, administrators need to provide the necessary configurations.
Check this [folder](https://github.com/suiguoxin/pai/tree/prometheus/src/alert-manager/config) and define the dependencies' rules for your customized actions.
Check this [folder](https://github.com/microsoft/pai/tree/master/src/alert-manager/config) and define the dependencies' rules for your customized actions.


### Render the action to webhook configurations
7 changes: 4 additions & 3 deletions src/alert-manager/src/alert-handler/controllers/mail.js
Original file line number Diff line number Diff line change
@@ -181,9 +181,6 @@ const sendEmailToUser = async (req, res) => {
console.log(
`alert-handler successfully send email to ${username} at ${userEmail}`,
);
res.status(200).json({
message: `alert-handler successfully send email to ${username} at ${userEmail}`,
});
})
.catch(function (data) {
console.error('alert-handler failed to send email to user');
@@ -193,6 +190,10 @@ const sendEmailToUser = async (req, res) => {
});
});
});

res.status(200).json({
message: `alert-handler successfully send email to users`,
});
}
};