Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 Bug]: Chart 0.36.0 Distribiutor keeps restarting #2407

Closed
piotrlaczykowski opened this issue Sep 23, 2024 · 14 comments · Fixed by #2408
Closed

[🐛 Bug]: Chart 0.36.0 Distribiutor keeps restarting #2407

piotrlaczykowski opened this issue Sep 23, 2024 · 14 comments · Fixed by #2408

Comments

@piotrlaczykowski
Copy link

What happened?

After update do Chart 0.36.0 and selenium/distributor:4.25.0-20240922 pod keep restarting
(combined from similar events): Liveness probe failed: 13:49:47.762 DEBUG [Probe.Liveness] - Session Queue Size: 49, Session Count: 0, Max Session: 0 13:49:47.763 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.

Command used to start Selenium Grid with Docker (or Kubernetes)

.

Relevant log output

.

Operating System

Kubernetes

Docker Selenium version (image tag)

4.25.0-20240922

Selenium Grid chart version (chart version)

0.36.0

Copy link

@piotrlaczykowski, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@VietND96
Copy link
Member

I believe it will be fixed by #2408
Root cause: Basic auth is removed from GraphQL endpoint, which is not support in current stable KEDA core. Scaler could not scaling Nodes due to 401
Once chart 0.36.1 is out. basicAuth.embeddedUrl: true will fix this

@piotrlaczykowski
Copy link
Author

I believe it will be fixed by #2408 Root cause: Basic auth is removed from GraphQL endpoint, which is not support in current stable KEDA core. Scaler could not scaling Nodes due to 401 Once chart 0.36.1 is out. basicAuth.embeddedUrl: true will fix this

It didn't helped for us

@VietND96
Copy link
Member

kubectl logs for pod keda-operator, any error could be seen there?

@VietND96
Copy link
Member

Also, did you enable SE_REJECT_UNSUPPORTED_CAPS in hub/router? In case autoscaling with min replicas=0, this should not be enabled.

@piotrlaczykowski
Copy link
Author

keda-operator.log

Also, did you enable SE_REJECT_UNSUPPORTED_CAPS in hub/router? In case autoscaling with min replicas=0, this should not be enabled.

I have minimum replicas 0 and I don't have this SE_REJECT_UNSUPPORTED_CAPS.

So what should I do?

@VietND96
Copy link
Member

I saw node Edge scaled and run properly, but node Chrome keep 0 for a long time in logs

2024-09-23T09:46:22Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-edge-node", "scaledJob.Namespace": "cicd", "Number of running Jobs": 24}
2024-09-23T09:46:22Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-edge-node", "scaledJob.Namespace": "cicd", "Number of pending Jobs ": 1}
2024-09-23T09:46:22Z INFO scaleexecutor Creating jobs {"scaledJob.Name": "selenium-edge-node", "scaledJob.Namespace": "cicd", "Effective number of max jobs": 0}
2024-09-23T09:46:22Z INFO scaleexecutor Creating jobs {"scaledJob.Name": "selenium-edge-node", "scaledJob.Namespace": "cicd", "Number of jobs": 0}
2024-09-23T09:46:22Z INFO scaleexecutor Created jobs {"scaledJob.Name": "selenium-edge-node", "scaledJob.Namespace": "cicd", "Number of jobs": 0}
2024-09-23T09:46:23Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "cicd", "Number of running Jobs": 0}
2024-09-23T09:46:23Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "cicd", "Number of pending Jobs ": 0}

@piotrlaczykowski
Copy link
Author

I saw node Edge scaled and run properly, but node Chrome keep 0 for a long time in logs

2024-09-23T09:46:22Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-edge-node", "scaledJob.Namespace": "cicd", "Number of running Jobs": 24}
2024-09-23T09:46:22Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-edge-node", "scaledJob.Namespace": "cicd", "Number of pending Jobs ": 1}
2024-09-23T09:46:22Z INFO scaleexecutor Creating jobs {"scaledJob.Name": "selenium-edge-node", "scaledJob.Namespace": "cicd", "Effective number of max jobs": 0}
2024-09-23T09:46:22Z INFO scaleexecutor Creating jobs {"scaledJob.Name": "selenium-edge-node", "scaledJob.Namespace": "cicd", "Number of jobs": 0}
2024-09-23T09:46:22Z INFO scaleexecutor Created jobs {"scaledJob.Name": "selenium-edge-node", "scaledJob.Namespace": "cicd", "Number of jobs": 0}
2024-09-23T09:46:23Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "cicd", "Number of running Jobs": 0}
2024-09-23T09:46:23Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "cicd", "Number of pending Jobs ": 0}

because we don't use chrome nodes. I should delete it actually.

@VietND96
Copy link
Member

Ok, so any tests passed in the run with "Number of running Jobs": 24?
If no test is able to run, can you disable the liveness probe in hub/router, to see how long nodes took to register to hub
Also, if you are deploy via Helm command, can you dry run with helm template and attach all resources YAML output?

@piotrlaczykowski
Copy link
Author

No tests can run.
Can you give me the commands I should type? It would be much easier with artifactHub regarding versioning and naming.
I rolled back to 0.35.2
Also we use kubernetes 1.21 and keda 2.8.2 :)

@VietND96
Copy link
Member

Ok, let me see if anyone is facing the same. In CI, tests covered K8s version range 1.25 to 1.31, KEDA latest stable version 2.15.1.
A major change in this version is scaler param url update to fetch from resource TriggerAuthentication - which was available from KEDA core >=2.9 - as docs mentioned https://keda.sh/docs/2.9/scalers/selenium-grid-scaler/

@VietND96
Copy link
Member

It would be much easier with artifactHub regarding versioning and naming

I think you can check this https://artifacthub.io/packages/helm/selenium-grid/selenium-grid. We are not the owner, however I saw it is up-to-date.

@piotrlaczykowski
Copy link
Author

Ok, let me see if anyone is facing the same. In CI, tests covered K8s version range 1.25 to 1.31, KEDA latest stable version 2.15.1. A major change in this version is scaler param url update to fetch from resource TriggerAuthentication - which was available from KEDA core >=2.9 - as docs mentioned https://keda.sh/docs/2.9/scalers/selenium-grid-scaler/

Yep it still doesn't work :/

@VietND96
Copy link
Member

What if you keep using chart 0.35.2 and replace new image tag 4.25.0-20240922?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants