-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerting] remove internal OCC issues with alertsClient #76830
Conversation
207c886
to
4062d23
Compare
During development of elastic#75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we remove OCC from methods that were currently using it, namely `update()`, `updateApiKey()`, `enable()`, `disable()`, and the `[un]mute[All,Instance]()` methods. Of these methods, OCC is really only _practically_ needed by `update()`, but even for that we don't support OCC in the API, yet; see: issue elastic#74381 . For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be used by PR elastic#75553.
4062d23
to
384d0b6
Compare
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
This is ready for review. I also tried an alternative, of wrapping all the AAD-mutating, and thus "full updates required" calls, like this: public async enable({ id }: { id: string }): Promise<void> {
return await retryForConflicts(
this.logger,
`enable(${id})`,
async () => await this.enableBasic({ id })
);
}
private async enableBasic({ id }: { id: string }): Promise<void> {
// ... body of original enable() method here
} where That felt a little gross - we'd have to make sure the "retryable" function didn't leak anything between calls. In practice, it seemed to fix the issues seen in the SIEM tests, once the alert executor started updating the new executionStatus fields. If we feel uncomfortable with the overwriting that's done in the PR at this point, we could switch one or more of the methods to use that technique instead. The |
💚 Build SucceededBuild metrics
History
To update your PR or re-run it, just comment with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Left a note about one tweak I think is worth doing to make sure our ESO and client stay in sync, but other than that all good 👍
type PartiallyUpdateableAlertAttributes = Partial< | ||
Pick<RawAlert, AlertAttributesExcludedFromAADType> | ||
>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love it :)
export const AlertAttributesExcludedFromAAD = [ | ||
'scheduledTaskId', | ||
'muteAll', | ||
'mutedInstanceIds', | ||
'updatedBy', | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be worth linking this to the values we pass into ESO here:
attributesToExcludeFromAAD: new Set([ |
That way they can't go out of sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already did - easier to see if you look at the complete file instead of the diff; in particular, line 42 (of course!) - or maybe I'm misunderstanding.
At one point I had all this stuff separated across the two files, decided it would be better to have it all in one place, hopefully the type and attr names array will be easy to keep in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, weird, not sure how I missed that 😳
Thanks Patrick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL something like this should work:
export const AlertAttributesExcludedFromAAD = [
'scheduledTaskId',
'muteAll',
'mutedInstanceIds',
'updatedBy',
];
export type AlertAttributesExcludedFromAADType = typeof AlertAttributesExcludedFromAAD[number];
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking on these changes after having an offline conversation with @pmuellr. I think the approach in this PR will work. It gets a bit tricky in the mute / unmute instance APIs where I think we may have to keep OCC there for now. There's a few other places we could omit values further to keep the partial updates limited to what is requested to change.
I think in the future as we add version
support to our APIs, it will enhance the default behaviour introduced in this PR for those who want to make sure no other updates happen concurrently (with the caveat that updating alert status happens on each execution).
@@ -495,7 +499,6 @@ export class AlertsClient { | |||
updatedBy: username, | |||
}, | |||
{ | |||
version, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument above could omit partially updatable values since they don't change in this API.
await this.unsecuredSavedObjectsClient.update('alert', id, { | ||
...attributes, | ||
enabled: true, | ||
...this.apiKeyAsAlertAttributes( | ||
await this.createAPIKey(this.generateAPIKeyName(attributes.alertTypeId, attributes.name)), | ||
username | ||
), | ||
updatedBy: username, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could omit other partially updatable attributes (minus enable
since we need it here).
await this.unsecuredSavedObjectsClient.update('alert', id, { | ||
...attributes, | ||
enabled: false, | ||
scheduledTaskId: null, | ||
apiKey: null, | ||
apiKeyOwner: null, | ||
updatedBy: await this.getUserName(), | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here:
Could omit other partially updatable attributes (minus enable since we need it here).
await partiallyUpdateAlertSavedObject(this.unsecuredSavedObjectsClient, alertId, { | ||
mutedInstanceIds, | ||
updatedBy: await this.getUserName(), | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be conflicting UX when dropping version
. If you call the API to mute two instances, there may only be one instance that ends up getting muted (last update call wins). I think we can keep version
for this scenario.
await partiallyUpdateAlertSavedObject(this.unsecuredSavedObjectsClient, alertId, { | ||
updatedBy: await this.getUserName(), | ||
mutedInstanceIds: mutedInstanceIds.filter((id: string) => id !== alertInstanceId), | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here:
This could be conflicting when dropping
version
. If you call the API to mute two instances, there may only be one instance that ends up getting muted (last update call wins). I think we can keepversion
for this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't realize the API really only let's you change a single instance in the [un]muteInstance()
APIs. So I think you're right, we may get a race condition with customers clicking on mute/unmute of instances.
In this case, I think we should probably go with a story where we keep the existing OCC calls, but wrapper the relevant alertsClient methods so they do a small, fixed number of retries (eg, 5) with a slight delay (eg, 250ms) in the case that they get 409 conflict errors. As noted in this comment #76830 (comment) - I tried out this technique as well and it managed to get the SIEM function tests working, which is where I noticed the problem.
The reason I want to do it that way is that it should avoid having the customers see 409 conflict results because of our updates to the upcoming execution_status
fields. Customers would end up having to have their own 409 conflict retry logic around EVERY alert API.
Once we have "version" support for our APIs (or just for update()
?), we can bypass the retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ I like such approach; keeping OCC for now with retry logic. This will be a better experience 👍
The delay could also be a random number between 0 and 250ms in case the API is called at the same time for multiple instances (ex: in a for loop per instance).
closing in lieu of #77838, based on some comments here and elsewhere |
Summary
During development of PR #75553, some issues came up with the
optimistic concurrency control (OCC) we were using internally within
the alertsClient, via the
version
option/property of the savedobject. The referenced PR updates new fields in the alert from the
taskManager task after the alertType executor runs. In some
alertsClient methods, OCC is used to update the alert which are
requested via user requests. And so in some cases, version conflict
errors were coming up when the alert was updated by the task manager
task, in the middle of one of these methods. Note: the SIEM function
test cases stress test this REALLY well.
In this PR, we remove OCC from methods that were currently using it,
namely
update()
,updateApiKey()
,enable()
,disable()
, and the[un]mute[All,Instance]()
methods. Of these methods, OCC is reallyonly practically needed by
update()
, but even for that we don'tsupport OCC in the API itself, yet; see: issue #74381
For cases where we know only attributes not contributing to AAD are
being updated, a new function is provided that does a partial update
on just those attributes, making partial updates for those attributes
a bit safer. That is used by the
*mute*()
methods and will be usedby PR #75553.
Note that the remaining methods which won't have OCC -
updateApiKey()
,enable()
anddisable()
(update()
will eventually have OCC in theAPI) are doing full updates because they update fields contributing to
AAD. That means it's possible for data to be overwritten without any
conflict notification, if these methods are called at the same time as
other methods.
Checklist
Delete any items that are not applicable to this PR.