Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selenium Trigger schedules two Jobs #4833

Closed
maxnitze opened this issue Jul 31, 2023 · 14 comments
Closed

Selenium Trigger schedules two Jobs #4833

maxnitze opened this issue Jul 31, 2023 · 14 comments
Labels
bug Something isn't working

Comments

@maxnitze
Copy link

Report

KEDA starts a second job for Selenium tests once the Pod of the first job reports ready. Then one of the job takes over the test and is completed afterwards. The other keeps running indefinitely doing nothing.

Expected Behavior

When I start a Selenium test, I expect only one job to be started.

Actual Behavior

A second job is started once the first reports ready.

Steps to Reproduce the Problem

  1. define a ScaledJob with selenium-trigger
  2. start a Selenium test using the selenium grid

Logs from KEDA operator

2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Effective number of max jobs": 1}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Created jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Effective number of max jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Created jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 2}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 2}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}

Not sure, if this is enough. But this is the log I see when starting a job.

KEDA Version

2.11.2

Kubernetes Version

1.23

Platform

Other

Scaler Details

Selenium

Anything else?

It seems to be inconsistent which of the job gets the task assigned. Most of the time the second pod got it. But today I had a case where the image was not available on the node of the second job yet. So it started pulling (which took some time). In the meantime the Selenium test was executed in the Pod of the first job.

@maxnitze maxnitze added the bug Something isn't working label Jul 31, 2023
@JorTurFer
Copy link
Member

Hi
Could you share your ScaledJob?

@maxnitze
Copy link
Author

It is generated from this template in the docker selenium chart with these values:

selenium-grid:
  ingress:
    enabled: true
    [ ... ]

  hub:
    [ ... ]

  autoscaling:
    enableWithExistingKEDA: true
    scalingType: job

  chromeNode:
    enabled: true
    maxReplicaCount: 16
    extraEnvironmentVariables:
      - name: TZ
        value: Europe/Berlin
      - name: SCREEN_WIDTH
        value: "1920"
      - name: SCREEN_HEIGHT
        value: "1080"

Here's the deployed ScaledJob:

---
# Source: selenium-grid/charts/selenium-grid/templates/chrome-node-scaledjobs.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: selenium-chrome-node
  namespace: selenium-grid
  annotations:
    helm.sh/hook: post-install,post-upgrade
  labels:
    app: selenium-chrome-node
    app.kubernetes.io/name: selenium-chrome-node
    app.kubernetes.io/managed-by: helm
    app.kubernetes.io/instance: selenium-grid
    app.kubernetes.io/version: 4.10.0-20230607
    app.kubernetes.io/component: selenium-grid-4.10.0-20230607
    helm.sh/chart: selenium-grid-0.19.0
spec:
  maxReplicaCount: 16
  pollingInterval: 10
  scalingStrategy:
    strategy: accurate
  triggers:
    - type: selenium-grid
      metadata:
        browserName: chrome
        unsafeSsl: "true"
        url: 'http://selenium-hub.selenium-grid:4444/graphql'
  jobTargetRef:
    parallelism: 1
    completions: 1
    backoffLimit: 0
    template:
      metadata:
        labels:
          app: selenium-chrome-node
          app.kubernetes.io/name: selenium-chrome-node
          app.kubernetes.io/managed-by: helm
          app.kubernetes.io/instance: selenium-grid
          app.kubernetes.io/version: 4.10.0-20230607
          app.kubernetes.io/component: selenium-grid-4.10.0-20230607
          helm.sh/chart: selenium-grid-0.19.0
        annotations:
          checksum/event-bus-configmap: 067216946d8fd5d28d5536ce6c29523a20ad868f23c81cacef3edade6508cf01
      spec:
        restartPolicy: Never
        containers:
          - name: selenium-chrome-node
            image: selenium/node-chrome:4.10.0-20230607
            imagePullPolicy: IfNotPresent
            env:
              - name: TZ
                value: Europe/Berlin
              - name: SCREEN_WIDTH
                value: "1920"
              - name: SCREEN_HEIGHT
                value: "1080"
            envFrom:
              - configMapRef:
                  name: selenium-event-bus-config
              - configMapRef:
                  name: selenium-node-config
            ports:
              - containerPort: 5555
                protocol: TCP
            volumeMounts:
              - name: dshm
                mountPath: /dev/shm
            resources:
              limits:
                cpu: "1"
                memory: 1Gi
              requests:
                cpu: "1"
                memory: 1Gi
            
            
        terminationGracePeriodSeconds: 30
        volumes:
          - name: dshm
            emptyDir:
              medium: Memory
              sizeLimit: 1Gi

@JorTurFer
Copy link
Member

Could you try deploying the chart with this value set to default.

accurate does some calculations that could generate 1.xxx, deploying 2 jobs in some cases.

If that doesn't solve the issue, please enable debug logs in KEDA operator pod

@maxnitze
Copy link
Author

Can confirm, this seems to work.

Could you explain what those cases are? Or can I somewhere read up on them? What is the downside of setting the strategy to default instead of accurate?

@maxnitze
Copy link
Author

I read through the strategy part in here, but I did no quite get what this means, tbh.

Are there any downsides for my use-case (scheduling Chrome and Firefox pods for Selenium tests)? If not would it make sense to create a PR in the Selenium repo to fix the default there?

@JorTurFer
Copy link
Member

I don't think that you will have any trouble with the change. TBH, IDK why they set accurate. We suggest using accurate only in the case of knowing that we job is completed just at the end and not in the meantime. Docs explain how they work (a bit below) but the main difference is how both strategies take into account the current jobs.

@JorTurFer
Copy link
Member

JorTurFer commented Jul 31, 2023

Accurate uses the pending job count and Default uses the running job count to calculate how to scale them, but in general, I always use default. I don't know if opening a PR to change it on Selenium repo is worth, but definitively I'd open an issue asking about this topic, maybe they have a good reason that I don't see (I don't know about selenium more than the minimum required for the scaler)

@JorTurFer
Copy link
Member

As this isn't a KEDA issue, I close it.
Feel free to reopen it if you think that it's something in KEDA

@maxnitze
Copy link
Author

maxnitze commented Aug 8, 2023

Unfortunately this is not the fix. The Selenium Grid does not include the already running sessions in its queue anymore. So the default session does not work for us (see #4865 where I started a discussion about the calculation in the default strategy).

I tried to come up with a custom strategy to get this working, but I don't think it is possible with the config values given.

Could you explain the calculation for a scale in the accurate strategy to me?

maxValue[sic] = min(scaledJob.MaxReplicaCount(), divideWithCeil(queueLength, targetAverageValue))

(I assume the maxValue in the docs should be maxScale)

if (maxScale + runningJobCount) > maxReplicaCount {
	return maxReplicaCount - runningJobCount
}
return maxScale - pendingJobCount

see https://keda.sh/docs/2.11/concepts/scaling-jobs/

Could you elaborate where exactly the issue with the additional job comes from? Why does the scale calculation only include the pendingJobCount in the case, where there are enough "free slots" for all sessions? Is nthat maybe the reason?

@maxnitze
Copy link
Author

maxnitze commented Aug 8, 2023

Hey @JorTurFer ,

unfortunately I cannot re-open this issue. And I'm still not sure how to set the strategy in my case.

@JorTurFer JorTurFer reopened this Aug 8, 2023
@JorTurFer
Copy link
Member

I have reopened the issue but I'm on vacations till 15th, maybe anyone can help or if not, I'll check it after coming back

@stale
Copy link

stale bot commented Oct 7, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Oct 7, 2023
@JorTurFer
Copy link
Member

f**k me, I didn't answer it. I couldn't reproduce it. If you enable debug logs, you will see the queue value and the desired jobs, could you share that info? and sorry because I thought that I answered :(

@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Oct 7, 2023
@maxnitze
Copy link
Author

maxnitze commented Nov 4, 2023

After we tested this in August we decided to go live with it as-is or now with the accurate strategy. We regularly had a look, if there were additional jobs waiting, but we do not seem to have the issue anymore. Maybe it was just something that happens, when there is only a very, very limited number of jobs? I don't know.

I still don't understand the calculation of the scale as much as I would like. But since this seems to be a non-issue when working at scale (at least for us), we did not follow up on this anymore.

Thanks anyways :)

@maxnitze maxnitze closed this as completed Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

2 participants