-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High availability while rolling out pods doesn't seem to work #10141
Comments
Don't get too attached to the In While when rolling out only a total of |
Created some more reproducible scenarios (I used the operator just because it was easier to integrate the tests.) Logs in the good scenario with custom image and sleep: ✅
When using default image without sleep:
In both cases logs from the collector are fine:
|
It is likely that this is happening bc our healthcheck extension needs improved: open-telemetry/opentelemetry-collector-contrib#26661 |
@TylerHelmuth should I try with v2 or you mean that even with v2 it is still need improvements? |
I mean the current version of |
Got it, Thanks for the details Later I can try to test with it and add the results here. So far the workaround would be only custom image with |
I gave healthcheckv2 a try open-telemetry/opentelemetry-operator@ca88a7e But not lucky :-/ I know it is in progress, but still not sure if it would be an issue in the healthcheck 🤔 |
Describe the bug
We noticed that while we rollout new changes on our collector it the applications report some warnning/error on connection refused.
Our collector is managed by the Operator and we noticed already that it doesn't have a readinessProbe, which we will share a fix for it open-telemetry/opentelemetry-operator#2943
But even with the readiness while simulating a rollout and receiving many requests, some of them get dropped.
Steps to reproduce
Our scenario to reproduce it.
we are using siege to make lots of requests to the collector while we roll it out
Meanwhile siege is making requests we go in parallel and start a rollout
Pods gets replaced but after siege conclude we see that some requests were dropped
To fix that we have added a preStop lifecycle
With the
lifecycle
of a simple sleep 10s the results are much better with100%
of availability. 🕺 🙌But, I didn't want to use a custom image to have
sleep
command and I wonder if something at otel-collector could be done to make it work as expected.It seems that during the graceful shutdown something gets wrong and make requests to not be answered.
What did you expect to see?
While rolling out Pods and replicas>=2 not request should be lost.
What did you see instead?
Requests are lost when rolling out new collector pods and not possible to workaround with sleep command.
What version did you use?
0.98.0
What config did you use?
Environment
EKS 1.26 and kind 1.29
Additional context
The text was updated successfully, but these errors were encountered: