-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[eventLog] retry resource creation at initialization time #136363
Conversation
resolves elastic#134098 Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources. We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code. We're using [p-retry][] (which uses [retry][]), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on. [p-retry]: https://github.com/sindresorhus/p-retry#p-retry [retry]: https://github.com/tim-kos/node-retry#retry
Pinging @elastic/response-ops (Team:ResponseOps) |
@elasticmachine merge upstream |
logger: Logger; | ||
esNames: EsNames; | ||
esAdapter: IClusterClientAdapter; | ||
readonly logger: Logger; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just some clean-up I noticed while editing the file. This structure is not used outside of the event log, but the fields should be read-only as they are populated via a function call which creates an object that returns this interface. Figured it would be helpful to make sure we don't accidentally update these fields in our code ...
Same with initialized
a few lines below, and the new retryDelay
uses the same pattern.
@@ -12,6 +12,8 @@ import { namesMock } from './names.mock'; | |||
import { IClusterClientAdapter } from './cluster_client_adapter'; | |||
import { clusterClientAdapterMock } from './cluster_client_adapter.mock'; | |||
|
|||
export const MOCK_RETRY_DELAY = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mock uses a retry of 20ms, so we can run the retries without special jest clock stuff and have the tests run in a short amount of time, but the REAL context object uses 2000ms.
I did take a stab at using the special jest clock stuff, but looks like it's going to be difficult to get working, as we need to "run all the timers" at points where we currently have no control.
buildkite test this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried verifying this by forcing an error in doesIlmPolicyExist
. I did this before starting ES/Kibana locally. Noticed that I see the function getting called and the error getting thrown but I did not see any retries for this function. Then I made a change to add another console log and Kibana restarted, and then I did see the function getting called, the error getting thrown and the retries occur. Not sure why it wouldn't have retried on the first startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM, great catch!
I added the following code within
I saw the warning log and nothing after, but the event log assets were created successfully on my run. Could it be the retries are logged as warnings and don't follow up on a successful retry, making us think it failed? |
@mikecote |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@elasticmachine merge upstream |
💚 Build SucceededMetrics [docs]
History
To update your PR or re-run it, just comment with: |
…6363) resolves elastic#134098 Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources. We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code. We're using [p-retry][] (which uses [retry][]), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on. [p-retry]: https://github.com/sindresorhus/p-retry#p-retry [retry]: https://github.com/tim-kos/node-retry#retry Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit f6e4c2f)
…6363) resolves elastic#134098 Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources. We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code. We're using [p-retry][] (which uses [retry][]), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on. [p-retry]: https://github.com/sindresorhus/p-retry#p-retry [retry]: https://github.com/tim-kos/node-retry#retry Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit f6e4c2f)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
…136647) resolves #134098 Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources. We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code. We're using [p-retry][] (which uses [retry][]), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on. [p-retry]: https://github.com/sindresorhus/p-retry#p-retry [retry]: https://github.com/tim-kos/node-retry#retry Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit f6e4c2f) Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>
…136646) resolves #134098 Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources. We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code. We're using [p-retry][] (which uses [retry][]), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on. [p-retry]: https://github.com/sindresorhus/p-retry#p-retry [retry]: https://github.com/tim-kos/node-retry#retry Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit f6e4c2f) Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>
I always run |
I had a thought that maybe we should add a "success" message after at least one retry, but I think I ended up thinking this was would just be more noise. Now I'm not sure :-). Seems like it wouldn't be that noisy, and would probably save someone 10 minutes diagnosing a problem here. Should we open a new issue? |
Summary
resolves #134098
Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources.
We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code.
We're using p-retry (which uses retry), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on and we don't want all the other Kibanas erroring to retry at the exact same time.
Here's what I think is happening; the logic to create resources is basically this, trying to take into account multiple Kibanas may be starting at the same time:
And here's a flow that would cause the error we're seeing. If after getting an error trying to create the resource, we then ask to see if it's created yet, and it is NOT, it might be that it's just not finished being created. In this case, Kibana A is the Kibana that ends up creating the resource, and to Kibana B this becomes an error situation since it thinks the resource does not exist and couldn't be created.
With the retries in place, the "error condition!" above will cause the retry logic to kick in, so there will be a delay, then Kibana B will start the process all over again, for this resource. Presumably, the resource will have been created after the retry delays, and so Kibana B will end up seing the resource as existing and then it's done.
Note: whether or not this is what's actually happening in the cases we've seen kinda doesn't matter - it's something you can certainly imagine happening, and so we should assume it can. We'll find out if this "fixes" the existing cases where we see it, over time, if we stop seeing the errors. If we continue to see them, then there's something else broken.
Since the event log code resource initialization is designed to create resources as needed, doing retries of the resource creation should allow them to eventually return successfully, after additional retries. Should at least lessen the number of times this occurs. We may need to play with the retry count and delays if it's still not quite enough.
Checklist
Delete any items that are not applicable to this PR.