Opentelemetry Operator auto instrumentation inject failed during leader election #1797

max-f-z · 2023-06-05T09:14:34Z

Hi everyone,

Thanks for putting effort in this project, we've been using O11y Operator for the past year and half in production and quite enjoyed this, big kudos to the team.

recently, we've encountered this problem and hopefully we could get some help from the community.

in our GKE cluster, we have O11y Operator installed with 2 replicas.

opentelemetry-operator-controller-manager-6b48cbb96d-w9pbv on node gke-platform-prod-node-1
opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv on node gke-platform-prod-node-2

during a maintaining window of GKE (node pool upgrade), we observed that following events from kubernetes event logs

05/06/2023 03:25:39.000	Created pod: opentelemetry-operator-controller-manager-6b48cbb96d-slx9j 
05/06/2023 03:25:39.000	Successfully assigned infra/opentelemetry-operator-controller-manager-6b48cbb96d-slx9j to gke-platform-prod-node-3
05/06/2023 03:28:39.000	opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv_bbdb544e-495c-469f-ba3e-d5fbe79d322d became leader
05/06/2023 04:20:17.000	Created pod: opentelemetry-operator-controller-manager-6b48cbb96d-spxvp
05/06/2023 04:20:17.000	Successfully assigned infra/opentelemetry-operator-controller-manager-6b48cbb96d-spxvp to gke-platform-prod-node-4
05/06/2023 04:22:59.000	opentelemetry-operator-controller-manager-6b48cbb96d-slx9j_965e3e33-6261-4752-9af7-b68a7244e821 became leader

looks like gke-platform-prod-node-1 was drained at the time of 03:25:39.000
opentelemetry-operator-controller-manager-6b48cbb96d-w9pbv was terminated with the deletion of node 1. and new pod opentelemetry-operator-controller-manager-6b48cbb96d-slx9j scheduled on new node to fulfill our 2 replica request.

pod opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv became the leader at 03:28:39.000

after approximately one hour, we have gke-platform-prod-node-2 drained for update at the time of 04:20:17.000
opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv was terminated with the deletion of node 2. and new pod opentelemetry-operator-controller-manager-6b48cbb96d-spxvp scheduled on new node to fulfill our 2 replica request.

pod opentelemetry-operator-controller-manager-6b48cbb96d-slx9j became the leader at 04:22:59.000

during the time of leadership shift. 03:25:39.000 to 03:28:39.000, 04:20:17.000 to 04:22:59.000
we found that all pods which should be mutated and add the instrumentation failed. these pods were started without instrumentation configured.

so our questions would be

is 2 replica o11y operator pods a correct setup for leader election? if not, what's the suggested replica counts?
leader election is taking longer than expected, how should we shorten this gap between leadership shift?

also browsed in open issues, found this issue #1073 might be relevant, it would be nice if someone can help confirm.

Any help on this is appreciated. Thanks in advance.

The text was updated successfully, but these errors were encountered:

pavolloffay · 2023-06-05T10:06:29Z

This is also related to #742

Would changing some of these parameters help you to resolve the issue?

max-f-z · 2023-06-05T11:19:06Z

@pavolloffay thanks for shed the light on this. appreciate the prompt response.

this sure looks like a right path. I see currently the feature for making lease time configurable is not available now.

I'll talk to the team and probably do some internal tests with customized images for now.

Thanks again for the quick response. will update here if there is any progress

jaronoff97 · 2024-06-03T16:31:20Z

I think this is a flavor of #1329

pavolloffay added the area:controller label Jun 5, 2023

pavolloffay added area:auto-instrumentation Issues for auto-instrumentation and removed area:controller labels Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opentelemetry Operator auto instrumentation inject failed during leader election #1797

Opentelemetry Operator auto instrumentation inject failed during leader election #1797

max-f-z commented Jun 5, 2023

pavolloffay commented Jun 5, 2023

max-f-z commented Jun 5, 2023

jaronoff97 commented Jun 3, 2024

Opentelemetry Operator auto instrumentation inject failed during leader election #1797

Opentelemetry Operator auto instrumentation inject failed during leader election #1797

Comments

max-f-z commented Jun 5, 2023

pavolloffay commented Jun 5, 2023

max-f-z commented Jun 5, 2023

jaronoff97 commented Jun 3, 2024