ScaledJob Scaling Issue with secondary matching batch #3782

Eldarrin · 2022-10-28T13:49:00Z

Report

This is a bit esoteric, but when using a ScaledJob if I scale 20(n) jobs as a first pass, then scale another 20(n-x) jobs as a second pass (after the first 20(n) are initiated and running) the second batch does not scale. Eventually it will catch up and scale to do the jobs, but only when the previous pods have terminated. First n is great than second n-x shows it best.

This is only true when using demands based parent style scalers.

Current main branch tested

Expected Behavior

When a job is demanded it gets scaled regardless

Actual Behavior

The secondary batch of jobs do not scale until the previous batch is at least partially finished

Steps to Reproduce the Problem

run a scaledjob instance that runs n jobs
repeat the run whilst the first is active using n-x jobs
second batch does not get activated

Logs from KEDA operator

Cannot provide logs at this time, if you need them I can make available

KEDA Version

2.8.1

Kubernetes Version

1.23

Platform

Microsoft Azure

Scaler Details

azure-pipelines

Anything else?

I believe it relates to

parent-keda-templates. will fix

I think the counter is counting pods running against the previous job against the jobs running from the new job so decreasing the pending counter when it shouldn’t

Eldarrin · 2022-10-28T14:24:09Z

Just a thought, should my queuelen be also reporting running jobs? Is it that simple :)

Ok, its my code. I'll keep this issue to provide resolution. only true with parent style

Eldarrin · 2022-11-07T16:32:29Z

I've fixed the issue, but going to test more before I re-open the commit. Basically if you queue up a lot of jobs and the cluster doesn't spin up the agents and assign them to AzDO before the next keda cycle it queues another n jobs. I've added an activejob register so it won't re-add unique jobs

JorTurFer · 2022-11-10T20:25:59Z

Hum,
It doesn't make sense, I mean, it's not important if the agents are registered or not in AzDO, The scaler takes the amount of pending jobs and will try to schedule a k8s job per AzDO job.
For example, if you have 8 pending AzDO jobs, KEDA will try to create 8 k8s jobs, but not 8 per execution, 8 in total. At least, that should be the behaviour. But that's totally independent of scaler, the scaler should return the active jobs in AzDO, including pending and in progress. and KEDA will try to fulfil that amount of jobs. Obviously, active jobs are already fulfilled, and KEDA will spawn extra jobs for the pending task.
If you have 5 active AzDO jobs (with 5 k8s jobs) and you enqueue another AzDo Job, you will have 6 AzDO jobs and KEDA will evaluate 6 needed - 5 already there = 1 extra job to create

Eldarrin · 2022-11-11T09:02:46Z

So the issue occurs in more extreme cases. For instance, if you queue 1 job, the container spins and registers with AzDO, then AzDO assigns the job to the agent and keda won't match another agent as it is already matched.
However, if you spin up 10+ jobs simultaneously then you may get 3 spin up in time and registered and matched in AzDO, BUT, 7 jobs are in a k8s "pending" or "ContainerCreating" state so have not yet registered with AzDO.
Then, the KEDA timer flips around again (say 30 secs) and as far as AzDO is concerned it still has 7 unmatched jobs so KEDA spins up another 7 jobs. Now you have 10 jobs in AzDO, but 17 pods running. Eventually (if they all got assigned on this run) your 10 jobs finish and you are left with 7 idle jobs that won't match until new AzDO jobs are added.

This fix ensures that you don't get a pod spun up for a job that is already awaiting a pod, regardless of the speed your cluster can spin up new jobs.

JorTurFer · 2022-11-13T00:12:07Z

The problem is that this approach requires of having state, and scalers shouldn't have state because they can be recreated at any time or you could have more than 1 instance, etc
I'm trying to reproduce your error because the behaviour you are describing looks weird

Eldarrin · 2022-11-13T14:04:32Z

Create an ado job with strategy parallel 20, that's what I was using for testing. I'll try and reproduce with the other AzDO scaler types. Certainly happens with parent style as that uses "matching agent".

I'll look for a stateless method too

Eldarrin · 2022-12-02T12:31:57Z

set PR to draft, you were right @JorTurFer after a long run state gets it upset.

JorTurFer · 2022-12-02T12:37:47Z

I have to apologize, my life has been complicated last weeks and I haven't checked this, it's still in my TODO list before release @Eldarrin

Eldarrin · 2022-12-02T12:39:31Z

No worries, it worked out well. I've been doing a long-running simulation and it gets itself out of kilter. The worst is its fixed with a reboot but each test takes about 7-10 days to simulate real world lol

stale · 2023-01-31T21:55:44Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale · 2023-02-08T01:51:23Z

This issue has been automatically closed due to inactivity.

JorTurFer · 2023-02-08T07:25:50Z

I need to check this yet

stale · 2023-04-09T14:49:45Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale · 2023-04-16T22:30:37Z

This issue has been automatically closed due to inactivity.

JorTurFer · 2023-04-17T06:01:09Z

@Eldarrin
Are you using a custom scaling strategy? could you share your ScaledJob?

Eldarrin · 2023-04-17T09:41:12Z

@JorTurFer I'm not using anything custom, basically my samples in my repo are the type. https://github.com/Eldarrin/keda-azdo-example
TBH I'm not even sure this is still occurring,

JorTurFer · 2023-04-17T09:57:28Z

I ask because we faced with a similar behavior when one squad set:

scalingStrategy:
    strategy: "custom"
    customScalingQueueLengthDeduction: 1
    customScalingRunningJobPercentage: "0.5"

Basically, they copied and pasted the ScaledJob spec from docs and once they set the default behavior, everything worked again properly xD

Eldarrin · 2023-04-17T12:58:32Z

Made it do it :) I can't share it as I had to use my an ado that supports parallel jobs.
Using this pipeline: https://github.com/Eldarrin/keda-azdo-example/blob/main/overload-pipeline.yaml

Run it about 3 times and you'll be left with agents still running.

I don't think it cares about demands, parents or basic

stale · 2023-06-16T13:17:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale · 2023-06-23T14:03:13Z

This issue has been automatically closed due to inactivity.

Eldarrin added the bug Something isn't working label Oct 28, 2022

Eldarrin mentioned this issue Oct 28, 2022

Fixed issue (#3782) #3784

Closed

6 tasks

JorTurFer assigned Eldarrin Nov 7, 2022

Eldarrin mentioned this issue Nov 8, 2022

Fixed issue #3782 (job duplication in azure pipelines) #3829

Closed

stale bot added the stale All issues that are marked as stale due to inactivity label Jan 31, 2023

stale bot closed this as completed Feb 8, 2023

JorTurFer reopened this Feb 8, 2023

stale bot removed the stale All issues that are marked as stale due to inactivity label Feb 8, 2023

stale bot added the stale All issues that are marked as stale due to inactivity label Apr 9, 2023

stale bot closed this as completed Apr 16, 2023

JorTurFer reopened this Apr 17, 2023

stale bot removed the stale All issues that are marked as stale due to inactivity label Apr 17, 2023

Eldarrin mentioned this issue Apr 20, 2023

Azure DevOps Scaler Demands Must Be a Subset #4195

Closed

stale bot added the stale All issues that are marked as stale due to inactivity label Jun 16, 2023

stale bot closed this as completed Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ScaledJob Scaling Issue with secondary matching batch #3782

ScaledJob Scaling Issue with secondary matching batch #3782

Eldarrin commented Oct 28, 2022 •

edited

Loading

Eldarrin commented Oct 28, 2022 •

edited

Loading

Eldarrin commented Nov 7, 2022

JorTurFer commented Nov 10, 2022

Eldarrin commented Nov 11, 2022 •

edited

Loading

JorTurFer commented Nov 13, 2022 •

edited

Loading

Eldarrin commented Nov 13, 2022 •

edited

Loading

Eldarrin commented Dec 2, 2022

JorTurFer commented Dec 2, 2022

Eldarrin commented Dec 2, 2022 •

edited

Loading

stale bot commented Jan 31, 2023

stale bot commented Feb 8, 2023

JorTurFer commented Feb 8, 2023

stale bot commented Apr 9, 2023

stale bot commented Apr 16, 2023

JorTurFer commented Apr 17, 2023

Eldarrin commented Apr 17, 2023

JorTurFer commented Apr 17, 2023

Eldarrin commented Apr 17, 2023 •

edited

Loading

stale bot commented Jun 16, 2023

stale bot commented Jun 23, 2023

ScaledJob Scaling Issue with secondary matching batch #3782

ScaledJob Scaling Issue with secondary matching batch #3782

Comments

Eldarrin commented Oct 28, 2022 • edited Loading

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

Eldarrin commented Oct 28, 2022 • edited Loading

Eldarrin commented Nov 7, 2022

JorTurFer commented Nov 10, 2022

Eldarrin commented Nov 11, 2022 • edited Loading

JorTurFer commented Nov 13, 2022 • edited Loading

Eldarrin commented Nov 13, 2022 • edited Loading

Eldarrin commented Dec 2, 2022

JorTurFer commented Dec 2, 2022

Eldarrin commented Dec 2, 2022 • edited Loading

stale bot commented Jan 31, 2023

stale bot commented Feb 8, 2023

JorTurFer commented Feb 8, 2023

stale bot commented Apr 9, 2023

stale bot commented Apr 16, 2023

JorTurFer commented Apr 17, 2023

Eldarrin commented Apr 17, 2023

JorTurFer commented Apr 17, 2023

Eldarrin commented Apr 17, 2023 • edited Loading

stale bot commented Jun 16, 2023

stale bot commented Jun 23, 2023

Eldarrin commented Oct 28, 2022 •

edited

Loading

Eldarrin commented Oct 28, 2022 •

edited

Loading

Eldarrin commented Nov 11, 2022 •

edited

Loading

JorTurFer commented Nov 13, 2022 •

edited

Loading

Eldarrin commented Nov 13, 2022 •

edited

Loading

Eldarrin commented Dec 2, 2022 •

edited

Loading

Eldarrin commented Apr 17, 2023 •

edited

Loading