You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for putting effort in this project, we've been using O11y Operator for the past year and half in production and quite enjoyed this, big kudos to the team.
recently, we've encountered this problem and hopefully we could get some help from the community.
in our GKE cluster, we have O11y Operator installed with 2 replicas.
opentelemetry-operator-controller-manager-6b48cbb96d-w9pbv on node gke-platform-prod-node-1 opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv on node gke-platform-prod-node-2
during a maintaining window of GKE (node pool upgrade), we observed that following events from kubernetes event logs
05/06/2023 03:25:39.000 Created pod: opentelemetry-operator-controller-manager-6b48cbb96d-slx9j
05/06/2023 03:25:39.000 Successfully assigned infra/opentelemetry-operator-controller-manager-6b48cbb96d-slx9j to gke-platform-prod-node-3
05/06/2023 03:28:39.000 opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv_bbdb544e-495c-469f-ba3e-d5fbe79d322d became leader
05/06/2023 04:20:17.000 Created pod: opentelemetry-operator-controller-manager-6b48cbb96d-spxvp
05/06/2023 04:20:17.000 Successfully assigned infra/opentelemetry-operator-controller-manager-6b48cbb96d-spxvp to gke-platform-prod-node-4
05/06/2023 04:22:59.000 opentelemetry-operator-controller-manager-6b48cbb96d-slx9j_965e3e33-6261-4752-9af7-b68a7244e821 became leader
looks like gke-platform-prod-node-1 was drained at the time of 03:25:39.000 opentelemetry-operator-controller-manager-6b48cbb96d-w9pbv was terminated with the deletion of node 1. and new pod opentelemetry-operator-controller-manager-6b48cbb96d-slx9j scheduled on new node to fulfill our 2 replica request.
pod opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv became the leader at 03:28:39.000
after approximately one hour, we have gke-platform-prod-node-2 drained for update at the time of 04:20:17.000 opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv was terminated with the deletion of node 2. and new pod opentelemetry-operator-controller-manager-6b48cbb96d-spxvp scheduled on new node to fulfill our 2 replica request.
pod opentelemetry-operator-controller-manager-6b48cbb96d-slx9j became the leader at 04:22:59.000
during the time of leadership shift. 03:25:39.000 to 03:28:39.000, 04:20:17.000 to 04:22:59.000
we found that all pods which should be mutated and add the instrumentation failed. these pods were started without instrumentation configured.
so our questions would be
is 2 replica o11y operator pods a correct setup for leader election? if not, what's the suggested replica counts?
leader election is taking longer than expected, how should we shorten this gap between leadership shift?
also browsed in open issues, found this issue #1073 might be relevant, it would be nice if someone can help confirm.
Any help on this is appreciated. Thanks in advance.
The text was updated successfully, but these errors were encountered:
Hi everyone,
Thanks for putting effort in this project, we've been using O11y Operator for the past year and half in production and quite enjoyed this, big kudos to the team.
recently, we've encountered this problem and hopefully we could get some help from the community.
in our GKE cluster, we have O11y Operator installed with 2 replicas.
opentelemetry-operator-controller-manager-6b48cbb96d-w9pbv
on nodegke-platform-prod-node-1
opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv
on nodegke-platform-prod-node-2
during a maintaining window of GKE (node pool upgrade), we observed that following events from kubernetes event logs
looks like
gke-platform-prod-node-1
was drained at the time of03:25:39.000
opentelemetry-operator-controller-manager-6b48cbb96d-w9pbv
was terminated with the deletion of node 1. and new podopentelemetry-operator-controller-manager-6b48cbb96d-slx9j
scheduled on new node to fulfill our 2 replica request.pod
opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv
became the leader at03:28:39.000
after approximately one hour, we have
gke-platform-prod-node-2
drained for update at the time of04:20:17.000
opentelemetry-operator-controller-manager-6b48cbb96d-2n5wv
was terminated with the deletion of node 2. and new podopentelemetry-operator-controller-manager-6b48cbb96d-spxvp
scheduled on new node to fulfill our 2 replica request.pod
opentelemetry-operator-controller-manager-6b48cbb96d-slx9j
became the leader at04:22:59.000
during the time of leadership shift.
03:25:39.000
to03:28:39.000
,04:20:17.000
to04:22:59.000
we found that all pods which should be mutated and add the instrumentation failed. these pods were started without instrumentation configured.
so our questions would be
also browsed in open issues, found this issue #1073 might be relevant, it would be nice if someone can help confirm.
Any help on this is appreciated. Thanks in advance.
The text was updated successfully, but these errors were encountered: