-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add reason label to the numTotalFailedNotifications metric #3094
add reason label to the numTotalFailedNotifications metric #3094
Conversation
106cf64
to
b99dad3
Compare
notify/notify.go
Outdated
} | ||
r.metrics.numTotalFailedNotifications.WithLabelValues(r.integration.Name(), statusCodeCategory).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this increment still be inside the if statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
b99dad3
to
eb74eb5
Compare
@gotjosh @roidelapluie Could you please take a look to see if this look good to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some ideas to improve this code, but generally looks amazing. Thanks for coming to Prometheus ContribFest and this contribution!
notify/notify.go
Outdated
@@ -262,7 +265,7 @@ func NewMetrics(r prometheus.Registerer) *Metrics { | |||
Namespace: "alertmanager", | |||
Name: "notifications_failed_total", | |||
Help: "The total number of failed notifications.", | |||
}, []string{"integration"}), | |||
}, []string{"integration", "statusCode"}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be concise. Perhaps code
makes more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
notify/util.go
Outdated
return "5xx", nil | ||
} | ||
|
||
return "", fmt.Errorf("unexpected status code %v", statusCode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to fail some application logic if status code is unknown? I would just return unknown
string here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Notify function which call this i did not failed the logic, instead use the default if the status code is unexpected. I agreed empty string is less explained itself. changed to unknown
notify/notify.go
Outdated
@@ -293,7 +296,7 @@ func NewMetrics(r prometheus.Registerer) *Metrics { | |||
"telegram", | |||
} { | |||
m.numNotifications.WithLabelValues(integration) | |||
m.numTotalFailedNotifications.WithLabelValues(integration) | |||
m.numTotalFailedNotifications.WithLabelValues(integration, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not make a lot of sense... I think we want to populate here counters with each possible 3 status codes: 4xx
, 5xx
and unknown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
edf6a31
to
6ce9839
Compare
notify/util.go
Outdated
// getFailureStatusCodeCategory return the status code category for failure request | ||
// the status starts with 4 will return 4xx and starts with 5 will return 5xx | ||
// other than 4xx and 5xx input status will return an error. | ||
func getFailureStatusCodeCategory(statusCode int) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on how we use this, I feel we should not return error here, but just return default value, WDYT? Would make more clean and simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of wanting to return an error so that the user of the function has knowledge about if the status code has been successfully translated. Otherwise they will have to rely on comparing if the result = failureUnknownCategoryCode if failureUnknownCategoryCode is not the ideal result for them in error case. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, well there are no users of this function yet, and as YAGNI goes, we should assume there will be none. I would focus on our usage:
result, interErr := getFailureStatusCodeCategory(e.StatusCode)
if interErr == nil {
statusCodeCategory = result
}
We don't want to know IF the status was translated correctly, but rather we want defaultStatusCodeCategory
in this case. So I would just return defaultStatusCodeCategory
Also, now when we talk about it, util
is not a great place for it - it's pretty generic - we want it in util if we have more than one usage of it. Given the function is very small and used only in one place, why not:
- Moving this function next to
notify.go
file?
OR - Copy the content of this function to caller place? It's might be too shallow for a function, but up to you (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Bartek, the function should return 4xx
if [400,550), 5xx
if [500,599) or fallback toother
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just one nit (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last nits, thanks!
notify/util.go
Outdated
// UserAgentHeader is the default User-Agent for notification requests | ||
var UserAgentHeader = fmt.Sprintf("Alertmanager/%s", version.Version) | ||
|
||
// PossibleFailureStatusCategory is a list of possible failure status code category | ||
var PossibleFailureStatusCategory = []string{failure4xxCategoryCode, failure5xxCategoryCode, failureUnknownCategoryCode} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it's exported? Can we make it private and move it next to notify.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to notify.go
notify/util.go
Outdated
// getFailureStatusCodeCategory return the status code category for failure request | ||
// the status starts with 4 will return 4xx and starts with 5 will return 5xx | ||
// other than 4xx and 5xx input status will return an error. | ||
func getFailureStatusCodeCategory(statusCode int) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, well there are no users of this function yet, and as YAGNI goes, we should assume there will be none. I would focus on our usage:
result, interErr := getFailureStatusCodeCategory(e.StatusCode)
if interErr == nil {
statusCodeCategory = result
}
We don't want to know IF the status was translated correctly, but rather we want defaultStatusCodeCategory
in this case. So I would just return defaultStatusCodeCategory
Also, now when we talk about it, util
is not a great place for it - it's pretty generic - we want it in util if we have more than one usage of it. Given the function is very small and used only in one place, why not:
- Moving this function next to
notify.go
file?
OR - Copy the content of this function to caller place? It's might be too shallow for a function, but up to you (:
notify/notify.go
Outdated
if err != nil { | ||
r.metrics.numTotalFailedNotifications.WithLabelValues(r.integration.Name()).Inc() | ||
if e, ok := errors.Cause(err).(*ErrorWithStatusCode); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why we need to call Cause() here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the error returned from
r.exec(ctx, l, alerts...)
The error already wrapped once with errors.Wrapf
notify/util.go
Outdated
// getFailureStatusCodeCategory return the status code category for failure request | ||
// the status starts with 4 will return 4xx and starts with 5 will return 5xx | ||
// other than 4xx and 5xx input status will return an error. | ||
func getFailureStatusCodeCategory(statusCode int) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Bartek, the function should return 4xx
if [400,550), 5xx
if [500,599) or fallback toother
.
notify/notify.go
Outdated
@@ -662,8 +668,16 @@ func NewRetryStage(i Integration, groupName string, metrics *Metrics) *RetryStag | |||
func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) { | |||
r.metrics.numNotifications.WithLabelValues(r.integration.Name()).Inc() | |||
ctx, alerts, err := r.exec(ctx, l, alerts...) | |||
|
|||
statusCodeCategory := defaultStatusCodeCategory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For integrations that don't implement ErrorWithStatusCode
, this could be inaccurate (e.g. the receiving side might have returned a 400 status code). It's also not correct if Alertmanager couldn't even send the notification (template error for instance). And the email notifier is another case that doesn't fit here.
I wonder if we shouldn't abstract away from the HTTP situation and consider a more generic label like reason
with a handful number of possible values:
server error
(e.g. 5xx)client error
(e.g. 4xx)authentication error
template
other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
- DestinationServerError
- DestinationClientError
- Fatal (all code error)
- Template
- Other
authentication error will belong to DestinationClientError
And in case the integration doesn't implement the ErrorWithStatusCode, the category will belong to Other
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier would this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can start with:
- ClientError (typically templating errors, 4xx responses)
- ServerError (typically 5xx responses from the server)
- Other (e.g. receivers for which we haven't implemented ClientError/ServerError logic yet)
And we can always add more in the future where we see fit (e.g. maybe we'll want to distinguish authn failures, serialization errors, templating errors, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
8b09c90
to
73c8b96
Compare
0aac23f
to
8940bb2
Compare
8940bb2
to
205b802
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
All my comments are nits - but please let's address them, and we're ready to merge. Assuming that at this point, both @simonpasquier and @bwplotka's comments are resolved and they have nothing else to add.
@@ -662,8 +665,13 @@ func NewRetryStage(i Integration, groupName string, metrics *Metrics) *RetryStag | |||
func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) { | |||
r.metrics.numNotifications.WithLabelValues(r.integration.Name()).Inc() | |||
ctx, alerts, err := r.exec(ctx, l, alerts...) | |||
|
|||
failureReason := DefaultReason.String() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've filed #3231 so that we can do the same for all other notifiers.
00b8741
to
25b9227
Compare
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
… when unknown status code Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
25b9227
to
3f5d3a2
Compare
Build is broken right now but tests pass, I'll assume everything is proper order here until we get that fixed that merge this. |
Thanks very much for your contribution! |
…etheus#3094) * add reason label to the numTotalFailedNotifications metric Signed-off-by: Yijie Qin <qinyijie@amazon.com>
…etheus#3094) * add reason label to the numTotalFailedNotifications metric Signed-off-by: Yijie Qin <qinyijie@amazon.com>
This is to follow up the discussion on issue: #2927
Instead of creating a new metric, we will add the status code label to the numTotalFailedNotifications metric. The default status code is going to be "5xx" and if the integration receivers defined the NewErrorWithStatusCode, we will translate the returned status code and put that in the label override