[eventLog] retry resource creation at initialization time #136363

pmuellr · 2022-07-14T04:36:43Z

Summary

Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources.

We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code.

We're using p-retry (which uses retry), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on and we don't want all the other Kibanas erroring to retry at the exact same time.

Here's what I think is happening; the logic to create resources is basically this, trying to take into account multiple Kibanas may be starting at the same time:

ask ES if resource exists
if it does, we're done
if it doesn't, try to create it
if created successfully, we're done
if error, see if someone else created it in the meantime
if so, we're done
if it still doesn't exist, error condition

And here's a flow that would cause the error we're seeing. If after getting an error trying to create the resource, we then ask to see if it's created yet, and it is NOT, it might be that it's just not finished being created. In this case, Kibana A is the Kibana that ends up creating the resource, and to Kibana B this becomes an error situation since it thinks the resource does not exist and couldn't be created.

sequenceDiagram
    participant KibanaA as Kibana A
    participant ES as Elasticsearch
    participant KibanaB as Kibana B
    KibanaA->>+ES: does resource exist?
    ES->>-KibanaA: no
    KibanaB->>+ES: does resource exist?
    ES->>-KibanaB: no
    KibanaA->>+ES: create resource
    KibanaB->>+ES: create resource
    ES->>-KibanaB: error creating resource
    KibanaB->>+ES: does resource exist?
    ES->>-KibanaB: no
    note right of KibanaB: error condition!
    ES->>-KibanaA: resource created

With the retries in place, the "error condition!" above will cause the retry logic to kick in, so there will be a delay, then Kibana B will start the process all over again, for this resource. Presumably, the resource will have been created after the retry delays, and so Kibana B will end up seing the resource as existing and then it's done.

Note: whether or not this is what's actually happening in the cases we've seen kinda doesn't matter - it's something you can certainly imagine happening, and so we should assume it can. We'll find out if this "fixes" the existing cases where we see it, over time, if we stop seeing the errors. If we continue to see them, then there's something else broken.

Since the event log code resource initialization is designed to create resources as needed, doing retries of the resource creation should allow them to eventually return successfully, after additional retries. Should at least lessen the number of times this occurs. We may need to play with the retry count and delays if it's still not quite enough.

Checklist

Delete any items that are not applicable to this PR.

Unit or functional tests were updated or added to match the most common scenarios

resolves elastic#134098 Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources. We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code. We're using [p-retry][] (which uses [retry][]), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on. [p-retry]: https://github.com/sindresorhus/p-retry#p-retry [retry]: https://github.com/tim-kos/node-retry#retry

elasticmachine · 2022-07-14T12:07:45Z

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr · 2022-07-14T18:13:26Z

@elasticmachine merge upstream

pmuellr · 2022-07-14T19:29:58Z

x-pack/plugins/event_log/server/es/context.ts

-  logger: Logger;
-  esNames: EsNames;
-  esAdapter: IClusterClientAdapter;
+  readonly logger: Logger;


This is just some clean-up I noticed while editing the file. This structure is not used outside of the event log, but the fields should be read-only as they are populated via a function call which creates an object that returns this interface. Figured it would be helpful to make sure we don't accidentally update these fields in our code ...

Same with initialized a few lines below, and the new retryDelay uses the same pattern.

pmuellr · 2022-07-14T19:34:49Z

x-pack/plugins/event_log/server/es/context.mock.ts

@@ -12,6 +12,8 @@ import { namesMock } from './names.mock';
 import { IClusterClientAdapter } from './cluster_client_adapter';
 import { clusterClientAdapterMock } from './cluster_client_adapter.mock';

+export const MOCK_RETRY_DELAY = 20;


The mock uses a retry of 20ms, so we can run the retries without special jest clock stuff and have the tests run in a short amount of time, but the REAL context object uses 2000ms.

I did take a stab at using the special jest clock stuff, but looks like it's going to be difficult to get working, as we need to "run all the timers" at points where we currently have no control.

pmuellr · 2022-07-15T14:05:22Z

buildkite test this

ymao1

I tried verifying this by forcing an error in doesIlmPolicyExist. I did this before starting ES/Kibana locally. Noticed that I see the function getting called and the error getting thrown but I did not see any retries for this function. Then I made a change to add another console log and Kibana restarted, and then I did see the function getting called, the error getting thrown and the retries occur. Not sure why it wouldn't have retried on the first startup?

mikecote

Changes LGTM, great catch!

mikecote · 2022-07-18T20:07:28Z

Noticed that I see the function getting called and the error getting thrown but I did not see any retries for this function.

@ymao1

I added the following code within createIlmPolicyIfNotExists (including a variable at the top of the file)

async createIlmPolicyIfNotExists(): Promise<void> {
    if (!errorThrown) {
      errorThrown = true;
      throw new Error('error!');
    }

I saw the warning log and nothing after, but the event log assets were created successfully on my run. Could it be the retries are logged as warnings and don't follow up on a successful retry, making us think it failed?

ymao1 · 2022-07-19T13:16:18Z

@mikecote
Ah! You're right. User error 🙈

ymao1

LGTM!

ymao1 · 2022-07-19T13:16:49Z

@elasticmachine merge upstream

kibana-ci · 2022-07-19T14:21:02Z

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

💛 Build #58002 was flaky e906466
💔 Build #57878 failed e906466
💚 Build #57616 succeeded 1ccd8d7

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…6363) resolves elastic#134098 Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources. We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code. We're using [p-retry][] (which uses [retry][]), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on. [p-retry]: https://github.com/sindresorhus/p-retry#p-retry [retry]: https://github.com/tim-kos/node-retry#retry Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit f6e4c2f)

kibanamachine · 2022-07-19T14:30:33Z

💚 All backports created successfully

Status	Branch	Result
✅	7.17
✅	8.3

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

…136647) resolves #134098 Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources. We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code. We're using [p-retry][] (which uses [retry][]), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on. [p-retry]: https://github.com/sindresorhus/p-retry#p-retry [retry]: https://github.com/tim-kos/node-retry#retry Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit f6e4c2f) Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>

…136646) resolves #134098 Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources. We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code. We're using [p-retry][] (which uses [retry][]), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on. [p-retry]: https://github.com/sindresorhus/p-retry#p-retry [retry]: https://github.com/tim-kos/node-retry#retry Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit f6e4c2f) Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>

pmuellr · 2022-07-28T17:17:27Z

I tried verifying this by forcing an error in doesIlmPolicyExist. I did this before starting ES/Kibana locally. Noticed that I see the function getting called and the error getting thrown but I did not see any retries for this function. Then I made a change to add another console log and Kibana restarted, and then I did see the function getting called, the error getting thrown and the retries occur. Not sure why it wouldn't have retried on the first startup?

I always run node scripts/build_kibana_platform_plugins after yarn kbn bootstrap, because at one point the plugin build would keep Kibana from starting, or maybe it would restart after it had started building the plugins, or something. So wondering if it's possible that first run was not using the code with your explicit throw in it, until after that build was done, which maybe was before your test? Seems unlikely, but the only thing I can guess. Or rando weird node issues.

pmuellr · 2022-07-28T17:26:36Z

I saw the warning log and nothing after, but the event log assets were created successfully on my run. Could it be the retries are logged as warnings and don't follow up on a successful retry, making us think it failed?

I had a thought that maybe we should add a "success" message after at least one retry, but I think I ended up thinking this was would just be more noise. Now I'm not sure :-). Seems like it wouldn't be that noisy, and would probably save someone 10 minutes diagnosing a problem here. Should we open a new issue?

pmuellr marked this pull request as ready for review July 14, 2022 12:07

pmuellr requested a review from a team as a code owner July 14, 2022 12:07

EricDavisX added the ci:deploy-cloud label Jul 14, 2022

Merge branch 'main' into eventlog/134098

e906466

pmuellr commented Jul 14, 2022

View reviewed changes

ymao1 reviewed Jul 18, 2022

View reviewed changes

mikecote approved these changes Jul 18, 2022

View reviewed changes

ymao1 approved these changes Jul 19, 2022

View reviewed changes

Merge branch 'main' into eventlog/134098

424400f

ymao1 merged commit f6e4c2f into elastic:main Jul 19, 2022

kibanamachine added the v8.4.0 label Jul 19, 2022

kibanamachine mentioned this pull request Jul 19, 2022

[7.17] [eventLog] retry resource creation at initialization time (#136363) #136646

Merged

kibanamachine mentioned this pull request Jul 19, 2022

[8.3] [eventLog] retry resource creation at initialization time (#136363) #136647

Merged

kibanamachine added the v8.3.3 label Jul 19, 2022

kibanamachine added the v7.17.6 label Jul 19, 2022

tylersmalley added ci:cloud-deploy Create or update a Cloud deployment and removed ci:deploy-cloud labels Aug 17, 2022

ymao1 mentioned this pull request Oct 12, 2022

[Response Ops][Alerting] Research best practices for bootstrapping alerts as data indices #141146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[eventLog] retry resource creation at initialization time #136363

[eventLog] retry resource creation at initialization time #136363

pmuellr commented Jul 14, 2022 •

edited by kibanamachine

Loading

elasticmachine commented Jul 14, 2022

pmuellr commented Jul 14, 2022

pmuellr Jul 14, 2022

pmuellr Jul 14, 2022

pmuellr commented Jul 15, 2022

ymao1 left a comment

mikecote left a comment

mikecote commented Jul 18, 2022 •

edited

Loading

ymao1 commented Jul 19, 2022

ymao1 left a comment

ymao1 commented Jul 19, 2022

kibana-ci commented Jul 19, 2022

kibanamachine commented Jul 19, 2022

pmuellr commented Jul 28, 2022

pmuellr commented Jul 28, 2022

[eventLog] retry resource creation at initialization time #136363

[eventLog] retry resource creation at initialization time #136363

Conversation

pmuellr commented Jul 14, 2022 • edited by kibanamachine Loading

Summary

Checklist

elasticmachine commented Jul 14, 2022

pmuellr commented Jul 14, 2022

pmuellr Jul 14, 2022

Choose a reason for hiding this comment

pmuellr Jul 14, 2022

Choose a reason for hiding this comment

pmuellr commented Jul 15, 2022

ymao1 left a comment

Choose a reason for hiding this comment

mikecote left a comment

Choose a reason for hiding this comment

mikecote commented Jul 18, 2022 • edited Loading

ymao1 commented Jul 19, 2022

ymao1 left a comment

Choose a reason for hiding this comment

ymao1 commented Jul 19, 2022

kibana-ci commented Jul 19, 2022

💚 Build Succeeded

Metrics [docs]

History

kibanamachine commented Jul 19, 2022

💚 All backports created successfully

Questions ?

pmuellr commented Jul 28, 2022

pmuellr commented Jul 28, 2022

pmuellr commented Jul 14, 2022 •

edited by kibanamachine

Loading

mikecote commented Jul 18, 2022 •

edited

Loading