Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] remove internal OCC issues with alertsClient #76830

Closed

Conversation

pmuellr
Copy link
Member

@pmuellr pmuellr commented Sep 4, 2020

Summary

During development of PR #75553, some issues came up with the
optimistic concurrency control (OCC) we were using internally within
the alertsClient, via the version option/property of the saved
object. The referenced PR updates new fields in the alert from the
taskManager task after the alertType executor runs. In some
alertsClient methods, OCC is used to update the alert which are
requested via user requests. And so in some cases, version conflict
errors were coming up when the alert was updated by the task manager
task, in the middle of one of these methods. Note: the SIEM function
test cases stress test this REALLY well.

In this PR, we remove OCC from methods that were currently using it,
namely update(), updateApiKey(), enable(), disable(), and the
[un]mute[All,Instance]() methods. Of these methods, OCC is really
only practically needed by update(), but even for that we don't
support OCC in the API itself, yet; see: issue #74381

For cases where we know only attributes not contributing to AAD are
being updated, a new function is provided that does a partial update
on just those attributes, making partial updates for those attributes
a bit safer. That is used by the *mute*() methods and will be used
by PR #75553.

Note that the remaining methods which won't have OCC - updateApiKey(),
enable() and disable() (update() will eventually have OCC in the
API) are doing full updates because they update fields contributing to
AAD. That means it's possible for data to be overwritten without any
conflict notification, if these methods are called at the same time as
other methods.

Checklist

Delete any items that are not applicable to this PR.

@pmuellr pmuellr added Feature:Alerting v8.0.0 release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 labels Sep 4, 2020
@pmuellr pmuellr self-assigned this Sep 8, 2020
@pmuellr pmuellr force-pushed the alerting/partial-update-and-occ branch from 207c886 to 4062d23 Compare September 8, 2020 21:06
During development of elastic#75553,
some issues came up with the optimistic concurrency control (OCC) we
were using internally within the alertsClient, via the `version`
option/property of the saved object.  The referenced PR updates new
fields in the alert from the taskManager task after the alertType
executor runs.  In some alertsClient methods, OCC is used to update
the alert which are requested via user requests.  And so in some
cases, version conflict errors were coming up when the alert was
updated by task manager, in the middle of one of these methods.  Note:
the SIEM function test cases stress test this REALLY well.

In this PR, we remove OCC from methods that were currently using it,
namely `update()`, `updateApiKey()`, `enable()`, `disable()`, and the
`[un]mute[All,Instance]()` methods.  Of these methods, OCC is really
only _practically_ needed by `update()`, but even for that we don't
support OCC in the API, yet; see: issue elastic#74381 .

For cases where we know only attributes not contributing to AAD are
being updated, a new function is provided that does a partial update
on just those attributes, making partial updates for those attributes
a bit safer.  That will be used by PR elastic#75553.
@pmuellr pmuellr force-pushed the alerting/partial-update-and-occ branch from 4062d23 to 384d0b6 Compare September 8, 2020 21:08
@pmuellr pmuellr changed the title [Alerting] remove / minimize internal OCC issues with alertsClient [Alerting] remove internal OCC issues with alertsClient Sep 8, 2020
@pmuellr pmuellr marked this pull request as ready for review September 8, 2020 21:39
@pmuellr pmuellr requested a review from a team as a code owner September 8, 2020 21:39
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member Author

pmuellr commented Sep 8, 2020

This is ready for review.

I also tried an alternative, of wrapping all the AAD-mutating, and thus "full updates required" calls, like this:

  public async enable({ id }: { id: string }): Promise<void> {
    return await retryForConflicts(
      this.logger,
      `enable(${id})`,
      async () => await this.enableBasic({ id })
    );
  }


  private async enableBasic({ id }: { id: string }): Promise<void> {
// ... body of original enable() method here
  }

where retryForConfliicts(() will retry the requests a small fixed number of times, with a small delay between retries, only doing retries if the thrown error was a 409 conflict.

That felt a little gross - we'd have to make sure the "retryable" function didn't leak anything between calls. In practice, it seemed to fix the issues seen in the SIEM tests, once the alert executor started updating the new executionStatus fields.

If we feel uncomfortable with the overwriting that's done in the PR at this point, we could switch one or more of the methods to use that technique instead. The update()

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Build metrics

✅ unchanged

History

  • 💚 Build #72791 succeeded 207c886f558a9bb13677b6da7ce6c20b16fa9551
  • 💔 Build #72776 failed 809e7bf5cf333ecd1c3f059d907b6b9601243415

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@gmmorris gmmorris self-requested a review September 9, 2020 13:54
Copy link
Contributor

@gmmorris gmmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Left a note about one tweak I think is worth doing to make sure our ESO and client stay in sync, but other than that all good 👍

Comment on lines +910 to +912
type PartiallyUpdateableAlertAttributes = Partial<
Pick<RawAlert, AlertAttributesExcludedFromAADType>
>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it :)

Comment on lines +12 to +17
export const AlertAttributesExcludedFromAAD = [
'scheduledTaskId',
'muteAll',
'mutedInstanceIds',
'updatedBy',
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be worth linking this to the values we pass into ESO here:

attributesToExcludeFromAAD: new Set([

That way they can't go out of sync

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already did - easier to see if you look at the complete file instead of the diff; in particular, line 42 (of course!) - or maybe I'm misunderstanding.

At one point I had all this stuff separated across the two files, decided it would be better to have it all in one place, hopefully the type and attr names array will be easy to keep in sync.

https://github.com/elastic/kibana/blob/384d0b69e1bcaacb01b37741aaa058f40df73c30/x-pack/plugins/alerts/server/saved_objects/index.ts

Copy link
Contributor

@gmmorris gmmorris Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, weird, not sure how I missed that 😳
Thanks Patrick

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL something like this should work:

export const AlertAttributesExcludedFromAAD = [
  'scheduledTaskId',
  'muteAll',
  'mutedInstanceIds',
  'updatedBy',
];
export type AlertAttributesExcludedFromAADType = typeof AlertAttributesExcludedFromAAD[number];

@mikecote mikecote self-requested a review September 9, 2020 14:17
Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking on these changes after having an offline conversation with @pmuellr. I think the approach in this PR will work. It gets a bit tricky in the mute / unmute instance APIs where I think we may have to keep OCC there for now. There's a few other places we could omit values further to keep the partial updates limited to what is requested to change.

I think in the future as we add version support to our APIs, it will enhance the default behaviour introduced in this PR for those who want to make sure no other updates happen concurrently (with the caveat that updating alert status happens on each execution).

@@ -495,7 +499,6 @@ export class AlertsClient {
updatedBy: username,
},
{
version,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument above could omit partially updatable values since they don't change in this API.

Comment on lines +621 to +629
await this.unsecuredSavedObjectsClient.update('alert', id, {
...attributes,
enabled: true,
...this.apiKeyAsAlertAttributes(
await this.createAPIKey(this.generateAPIKeyName(attributes.alertTypeId, attributes.name)),
username
),
updatedBy: username,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could omit other partially updatable attributes (minus enable since we need it here).

Comment on lines +667 to +674
await this.unsecuredSavedObjectsClient.update('alert', id, {
...attributes,
enabled: false,
scheduledTaskId: null,
apiKey: null,
apiKeyOwner: null,
updatedBy: await this.getUserName(),
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here:

Could omit other partially updatable attributes (minus enable since we need it here).

Comment on lines +739 to +742
await partiallyUpdateAlertSavedObject(this.unsecuredSavedObjectsClient, alertId, {
mutedInstanceIds,
updatedBy: await this.getUserName(),
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be conflicting UX when dropping version. If you call the API to mute two instances, there may only be one instance that ends up getting muted (last update call wins). I think we can keep version for this scenario.

Comment on lines +765 to +768
await partiallyUpdateAlertSavedObject(this.unsecuredSavedObjectsClient, alertId, {
updatedBy: await this.getUserName(),
mutedInstanceIds: mutedInstanceIds.filter((id: string) => id !== alertInstanceId),
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here:

This could be conflicting when dropping version. If you call the API to mute two instances, there may only be one instance that ends up getting muted (last update call wins). I think we can keep version for this scenario.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realize the API really only let's you change a single instance in the [un]muteInstance() APIs. So I think you're right, we may get a race condition with customers clicking on mute/unmute of instances.

In this case, I think we should probably go with a story where we keep the existing OCC calls, but wrapper the relevant alertsClient methods so they do a small, fixed number of retries (eg, 5) with a slight delay (eg, 250ms) in the case that they get 409 conflict errors. As noted in this comment #76830 (comment) - I tried out this technique as well and it managed to get the SIEM function tests working, which is where I noticed the problem.

The reason I want to do it that way is that it should avoid having the customers see 409 conflict results because of our updates to the upcoming execution_status fields. Customers would end up having to have their own 409 conflict retry logic around EVERY alert API.

Once we have "version" support for our APIs (or just for update()?), we can bypass the retries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ I like such approach; keeping OCC for now with retry logic. This will be a better experience 👍

The delay could also be a random number between 0 and 250ms in case the API is called at the same time for multiple instances (ex: in a for loop per instance).

@pmuellr
Copy link
Member Author

pmuellr commented Sep 17, 2020

closing in lieu of #77838, based on some comments here and elsewhere

@pmuellr pmuellr closed this Sep 17, 2020
@pmuellr pmuellr added the backport:skip This commit does not require backporting label Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants