Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[eventLog] retry resource creation at initialization time #136363

Merged
merged 3 commits into from
Jul 19, 2022

Conversation

pmuellr
Copy link
Member

@pmuellr pmuellr commented Jul 14, 2022

Summary

resolves #134098

Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources.

We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code.

We're using p-retry (which uses retry), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on and we don't want all the other Kibanas erroring to retry at the exact same time.

Here's what I think is happening; the logic to create resources is basically this, trying to take into account multiple Kibanas may be starting at the same time:

  • ask ES if resource exists
  • if it does, we're done
  • if it doesn't, try to create it
  • if created successfully, we're done
  • if error, see if someone else created it in the meantime
  • if so, we're done
  • if it still doesn't exist, error condition

And here's a flow that would cause the error we're seeing. If after getting an error trying to create the resource, we then ask to see if it's created yet, and it is NOT, it might be that it's just not finished being created. In this case, Kibana A is the Kibana that ends up creating the resource, and to Kibana B this becomes an error situation since it thinks the resource does not exist and couldn't be created.

sequenceDiagram
    participant KibanaA as Kibana A
    participant ES as Elasticsearch
    participant KibanaB as Kibana B
    KibanaA->>+ES: does resource exist?
    ES->>-KibanaA: no
    KibanaB->>+ES: does resource exist?
    ES->>-KibanaB: no
    KibanaA->>+ES: create resource
    KibanaB->>+ES: create resource
    ES->>-KibanaB: error creating resource
    KibanaB->>+ES: does resource exist?
    ES->>-KibanaB: no
    note right of KibanaB: error condition!
    ES->>-KibanaA: resource created
Loading

With the retries in place, the "error condition!" above will cause the retry logic to kick in, so there will be a delay, then Kibana B will start the process all over again, for this resource. Presumably, the resource will have been created after the retry delays, and so Kibana B will end up seing the resource as existing and then it's done.

Note: whether or not this is what's actually happening in the cases we've seen kinda doesn't matter - it's something you can certainly imagine happening, and so we should assume it can. We'll find out if this "fixes" the existing cases where we see it, over time, if we stop seeing the errors. If we continue to see them, then there's something else broken.

Since the event log code resource initialization is designed to create resources as needed, doing retries of the resource creation should allow them to eventually return successfully, after additional retries. Should at least lessen the number of times this occurs. We may need to play with the retry count and delays if it's still not quite enough.

Checklist

Delete any items that are not applicable to this PR.

resolves elastic#134098

Adds retry logic to the initialization of elasticsearch
resources, when Kibana starts up.  Recently, it seems
this has become a more noticeable error - that race
conditions occur where two Kibana's initializing a new
stack version will race to create the event log resources.

We believe we'll see the end of these issues with some
retries, chunked around the 4 resource-y sections of
the initialization code.

We're using [p-retry][] (which uses [retry][]), to do an
exponential backoff starting at 2s, then 4s, 8s, 16s,
with 4 retries (so 5 actual attempted calls).  Some
randomness is added, since there's a race on.

[p-retry]: https://github.com/sindresorhus/p-retry#p-retry
[retry]: https://github.com/tim-kos/node-retry#retry
@pmuellr pmuellr added release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:EventLog backport:prev-major Backport to (8.x, 8.18, 8.17, 8.16) the previous major branch and other branches in development labels Jul 14, 2022
@pmuellr pmuellr marked this pull request as ready for review July 14, 2022 12:07
@pmuellr pmuellr requested a review from a team as a code owner July 14, 2022 12:07
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member Author

pmuellr commented Jul 14, 2022

@elasticmachine merge upstream

logger: Logger;
esNames: EsNames;
esAdapter: IClusterClientAdapter;
readonly logger: Logger;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just some clean-up I noticed while editing the file. This structure is not used outside of the event log, but the fields should be read-only as they are populated via a function call which creates an object that returns this interface. Figured it would be helpful to make sure we don't accidentally update these fields in our code ...

Same with initialized a few lines below, and the new retryDelay uses the same pattern.

@@ -12,6 +12,8 @@ import { namesMock } from './names.mock';
import { IClusterClientAdapter } from './cluster_client_adapter';
import { clusterClientAdapterMock } from './cluster_client_adapter.mock';

export const MOCK_RETRY_DELAY = 20;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock uses a retry of 20ms, so we can run the retries without special jest clock stuff and have the tests run in a short amount of time, but the REAL context object uses 2000ms.

I did take a stab at using the special jest clock stuff, but looks like it's going to be difficult to get working, as we need to "run all the timers" at points where we currently have no control.

@pmuellr
Copy link
Member Author

pmuellr commented Jul 15, 2022

buildkite test this

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried verifying this by forcing an error in doesIlmPolicyExist. I did this before starting ES/Kibana locally. Noticed that I see the function getting called and the error getting thrown but I did not see any retries for this function. Then I made a change to add another console log and Kibana restarted, and then I did see the function getting called, the error getting thrown and the retries occur. Not sure why it wouldn't have retried on the first startup?

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, great catch!

@mikecote
Copy link
Contributor

mikecote commented Jul 18, 2022

Noticed that I see the function getting called and the error getting thrown but I did not see any retries for this function.

@ymao1

I added the following code within createIlmPolicyIfNotExists (including a variable at the top of the file)

async createIlmPolicyIfNotExists(): Promise<void> {
    if (!errorThrown) {
      errorThrown = true;
      throw new Error('error!');
    }

I saw the warning log and nothing after, but the event log assets were created successfully on my run. Could it be the retries are logged as warnings and don't follow up on a successful retry, making us think it failed?

@ymao1
Copy link
Contributor

ymao1 commented Jul 19, 2022

@mikecote
Ah! You're right. User error 🙈

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ymao1
Copy link
Contributor

ymao1 commented Jul 19, 2022

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@ymao1 ymao1 merged commit f6e4c2f into elastic:main Jul 19, 2022
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Jul 19, 2022
…6363)

resolves elastic#134098

Adds retry logic to the initialization of elasticsearch
resources, when Kibana starts up.  Recently, it seems
this has become a more noticeable error - that race
conditions occur where two Kibana's initializing a new
stack version will race to create the event log resources.

We believe we'll see the end of these issues with some
retries, chunked around the 4 resource-y sections of
the initialization code.

We're using [p-retry][] (which uses [retry][]), to do an
exponential backoff starting at 2s, then 4s, 8s, 16s,
with 4 retries (so 5 actual attempted calls).  Some
randomness is added, since there's a race on.

[p-retry]: https://github.com/sindresorhus/p-retry#p-retry
[retry]: https://github.com/tim-kos/node-retry#retry

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit f6e4c2f)
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Jul 19, 2022
…6363)

resolves elastic#134098

Adds retry logic to the initialization of elasticsearch
resources, when Kibana starts up.  Recently, it seems
this has become a more noticeable error - that race
conditions occur where two Kibana's initializing a new
stack version will race to create the event log resources.

We believe we'll see the end of these issues with some
retries, chunked around the 4 resource-y sections of
the initialization code.

We're using [p-retry][] (which uses [retry][]), to do an
exponential backoff starting at 2s, then 4s, 8s, 16s,
with 4 retries (so 5 actual attempted calls).  Some
randomness is added, since there's a race on.

[p-retry]: https://github.com/sindresorhus/p-retry#p-retry
[retry]: https://github.com/tim-kos/node-retry#retry

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit f6e4c2f)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
7.17
8.3

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Jul 19, 2022
…136647)

resolves #134098

Adds retry logic to the initialization of elasticsearch
resources, when Kibana starts up.  Recently, it seems
this has become a more noticeable error - that race
conditions occur where two Kibana's initializing a new
stack version will race to create the event log resources.

We believe we'll see the end of these issues with some
retries, chunked around the 4 resource-y sections of
the initialization code.

We're using [p-retry][] (which uses [retry][]), to do an
exponential backoff starting at 2s, then 4s, 8s, 16s,
with 4 retries (so 5 actual attempted calls).  Some
randomness is added, since there's a race on.

[p-retry]: https://github.com/sindresorhus/p-retry#p-retry
[retry]: https://github.com/tim-kos/node-retry#retry

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit f6e4c2f)

Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>
kibanamachine added a commit that referenced this pull request Jul 19, 2022
…136646)

resolves #134098

Adds retry logic to the initialization of elasticsearch
resources, when Kibana starts up.  Recently, it seems
this has become a more noticeable error - that race
conditions occur where two Kibana's initializing a new
stack version will race to create the event log resources.

We believe we'll see the end of these issues with some
retries, chunked around the 4 resource-y sections of
the initialization code.

We're using [p-retry][] (which uses [retry][]), to do an
exponential backoff starting at 2s, then 4s, 8s, 16s,
with 4 retries (so 5 actual attempted calls).  Some
randomness is added, since there's a race on.

[p-retry]: https://github.com/sindresorhus/p-retry#p-retry
[retry]: https://github.com/tim-kos/node-retry#retry

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit f6e4c2f)

Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>
@pmuellr
Copy link
Member Author

pmuellr commented Jul 28, 2022

I tried verifying this by forcing an error in doesIlmPolicyExist. I did this before starting ES/Kibana locally. Noticed that I see the function getting called and the error getting thrown but I did not see any retries for this function. Then I made a change to add another console log and Kibana restarted, and then I did see the function getting called, the error getting thrown and the retries occur. Not sure why it wouldn't have retried on the first startup?

I always run node scripts/build_kibana_platform_plugins after yarn kbn bootstrap, because at one point the plugin build would keep Kibana from starting, or maybe it would restart after it had started building the plugins, or something. So wondering if it's possible that first run was not using the code with your explicit throw in it, until after that build was done, which maybe was before your test? Seems unlikely, but the only thing I can guess. Or rando weird node issues.

@pmuellr
Copy link
Member Author

pmuellr commented Jul 28, 2022

I saw the warning log and nothing after, but the event log assets were created successfully on my run. Could it be the retries are logged as warnings and don't follow up on a successful retry, making us think it failed?

I had a thought that maybe we should add a "success" message after at least one retry, but I think I ended up thinking this was would just be more noise. Now I'm not sure :-). Seems like it wouldn't be that noisy, and would probably save someone 10 minutes diagnosing a problem here. Should we open a new issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-major Backport to (8.x, 8.18, 8.17, 8.16) the previous major branch and other branches in development ci:cloud-deploy Create or update a Cloud deployment Feature:EventLog release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.17.6 v8.3.3 v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ResponseOps]: error creating event log index template at startup
8 participants