Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opentelemetry Operator auto instrumentation inject failed during leader election #1797

Open
max-f-z opened this issue Jun 5, 2023 · 3 comments
Labels
area:auto-instrumentation Issues for auto-instrumentation

Comments

@max-f-z
Copy link

max-f-z commented Jun 5, 2023

Hi everyone,

Thanks for putting effort in this project, we've been using O11y Operator for the past year and half in production and quite enjoyed this, big kudos to the team.

recently, we've encountered this problem and hopefully we could get some help from the community.

in our GKE cluster, we have O11y Operator installed with 2 replicas.

opentelemetry-operator-controller-manager-6b48cbb96d-w9pbv on node gke-platform-prod-node-1
opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv on node gke-platform-prod-node-2

during a maintaining window of GKE (node pool upgrade), we observed that following events from kubernetes event logs

05/06/2023 03:25:39.000	Created pod: opentelemetry-operator-controller-manager-6b48cbb96d-slx9j 
05/06/2023 03:25:39.000	Successfully assigned infra/opentelemetry-operator-controller-manager-6b48cbb96d-slx9j to gke-platform-prod-node-3
05/06/2023 03:28:39.000	opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv_bbdb544e-495c-469f-ba3e-d5fbe79d322d became leader
05/06/2023 04:20:17.000	Created pod: opentelemetry-operator-controller-manager-6b48cbb96d-spxvp
05/06/2023 04:20:17.000	Successfully assigned infra/opentelemetry-operator-controller-manager-6b48cbb96d-spxvp to gke-platform-prod-node-4
05/06/2023 04:22:59.000	opentelemetry-operator-controller-manager-6b48cbb96d-slx9j_965e3e33-6261-4752-9af7-b68a7244e821 became leader

looks like gke-platform-prod-node-1 was drained at the time of 03:25:39.000
opentelemetry-operator-controller-manager-6b48cbb96d-w9pbv was terminated with the deletion of node 1. and new pod opentelemetry-operator-controller-manager-6b48cbb96d-slx9j scheduled on new node to fulfill our 2 replica request.

pod opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv became the leader at 03:28:39.000

after approximately one hour, we have gke-platform-prod-node-2 drained for update at the time of 04:20:17.000
opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv was terminated with the deletion of node 2. and new pod opentelemetry-operator-controller-manager-6b48cbb96d-spxvp scheduled on new node to fulfill our 2 replica request.

pod opentelemetry-operator-controller-manager-6b48cbb96d-slx9j became the leader at 04:22:59.000

during the time of leadership shift. 03:25:39.000 to 03:28:39.000, 04:20:17.000 to 04:22:59.000
we found that all pods which should be mutated and add the instrumentation failed. these pods were started without instrumentation configured.

so our questions would be

  1. is 2 replica o11y operator pods a correct setup for leader election? if not, what's the suggested replica counts?
  2. leader election is taking longer than expected, how should we shorten this gap between leadership shift?

also browsed in open issues, found this issue #1073 might be relevant, it would be nice if someone can help confirm.

Any help on this is appreciated. Thanks in advance.

@pavolloffay
Copy link
Member

This is also related to #742

Would changing some of these parameters help you to resolve the issue?

@max-f-z
Copy link
Author

max-f-z commented Jun 5, 2023

@pavolloffay thanks for shed the light on this. appreciate the prompt response.

this sure looks like a right path. I see currently the feature for making lease time configurable is not available now.

I'll talk to the team and probably do some internal tests with customized images for now.

Thanks again for the quick response. will update here if there is any progress

@pavolloffay pavolloffay added area:auto-instrumentation Issues for auto-instrumentation and removed area:controller labels Jan 30, 2024
@jaronoff97
Copy link
Contributor

I think this is a flavor of #1329

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:auto-instrumentation Issues for auto-instrumentation
Projects
None yet
Development

No branches or pull requests

3 participants