Make prevention rule set configurable #4067

tomkerkhove · 2023-01-03T08:29:11Z

Proposal

Make prevention rule set configurable by giving end-users the nobs on a per-rule level so that they can manage what validation should be applied and what not.

This is something we should do by mounting a YAML configuration file with the various rules and provide a reference page with the available rules.

Use-Case

Give control to the end-user.

Anything else?

No response

tomkerkhove · 2023-01-03T08:29:14Z

Let's make sure not to use environment variables as this will not scale over time, imo

JorTurFer · 2023-01-03T11:41:04Z

I don't get what do you want to achieve, is this something for the admission controller?

tomkerkhove · 2023-01-03T11:42:45Z

Yes, if we add all these rules then end-users might want to control which ones should be on and which ones not.

JorTurFer · 2023-01-03T11:52:36Z

Are they configurable in the real life? I mean, CPU/Memory without requests won't work, multiple HPA scaling the same workload won't work either.
I don't see what user should be able to configure because those rules are mandatory, otherwise there is a misconfiguration.

But I get the idea, maybe there are other rules in the future that can be configured

tomkerkhove · 2023-01-03T11:57:07Z

But I get the idea, maybe there are other rules in the future that can be configured

This indeed!

But also, based on your argument above why are we making it an opt-in then? To give people the flexibility to choose how aggressive they want to be. Maybe a SO is scaling something that has an HPA on top of it as well because of a transition period; who knows.

It's more about giving control over enforcement/enabling everything or nothing.

JorTurFer · 2023-01-03T12:02:39Z

Maybe a SO is scaling something that has an HPA on top of it as well because of a transition period; who knows.

This doesn't work. The HPA controller will conflict them, this scenario is worse than not having autoscaling because every HPA controller cycle will change the replicas twice if there are different values

tomkerkhove · 2023-01-03T12:35:44Z

I know it does not work and it's an example but the point is that we allow end-users to mix and match rules. Today, the only option would be to enable all validation or none while there can be scenarios to turn off certain validation.

stale · 2023-03-04T15:00:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

ppanyukov · 2023-03-21T16:07:02Z

This admission hook gives me a lot of headache where we want to transition from HPA to KEDA for existing production workloads, and specifically preventing deployment when there is already another HPA. From docs:

The scaled workload (scaledobject.spec.scaleTargetRef) is already autoscaled by another other sources (other ScaledObject or HPA).

It makes it very difficult to switch from vanilla HPA to KEDA-managed HPA. Consider this scenario for production:

We have this configured in Helm chart:

deployment:
  replicaCount: 3

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20

And the HPA currently set desiredReplicaCount = 15.

Now imagine wa want to transition to KEDA.

We'd have Helm configured like this:

useKeda: true

The rendered templates will remove HPA and add KEDA ScaledObject. But we can't deploy this because Helm will want to add ScaledObject while HPA is still there and so deployment will fail.

With this hard validation migration is complex. We'd need to delete our HPA first. Then add KEDA. But what happens in this case? Since the HPA is gone, k8s is going to set replicaCount = 3 instead of desired 15, therefore pretty much killing our app due to load.

The workaround is way too complex: need to set replicaCount equal to the current desiredReplicaCount as reported by HPA, then deploy. Then delete HPA. Then deploy KEDA. Then wait a bit. Then set replicaCount back to 3.

This can be easily avoided if we could temporarily disable this validation, and temporarily have both HPA and KEDA (with exactly same CPU profile) for a brief period of time.

zroubalik · 2023-03-21T16:12:23Z

This admission hook gives me a lot of headache where we want to transition from HPA to KEDA for existing production workloads, and specifically preventing deployment when there is already another HPA. From docs:

The scaled workload (scaledobject.spec.scaleTargetRef) is already autoscaled by another other sources (other ScaledObject or HPA).

It makes it very difficult to switch from vanilla HPA to KEDA-managed HPA. Consider this scenario for production:

We have this configured in Helm chart:
deployment:
  replicaCount: 3

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
And the HPA currently set desiredReplicaCount = 15.

Now imagine wa want to transition to KEDA.

We'd have Helm configured like this:
useKeda: true
The rendered templates will remove HPA and add KEDA ScaledObject. But we can't deploy this because Helm will want to add ScaledObject while HPA is still there and so deployment will fail.

With this hard validation migration is complex. We'd need to delete our HPA first. Then add KEDA. But what happens in this case? Since the HPA is gone, k8s is going to set replicaCount = 3 instead of desired 15, therefore pretty much killing our app due to load.

The workaround is way too complex: need to set replicaCount equal to the current desiredReplicaCount as reported by HPA, then deploy. Then delete HPA. Then deploy KEDA. Then wait a bit. Then set replicaCount back to 3.

This can be easily avoided if we could temporarily disable this validation, and temporarily have both HPA and KEDA (with exactly same CPU profile) for a brief period of time.

@ppanyukov as a workaround, you can disable (scale the webhook controller to 0) after the migration you can enable it again.

ppanyukov · 2023-03-21T16:13:22Z

And speaking of more trouble with admission webhook and Helm. I've hit another issue when we cannot uninstall any Helm releases which use KEDA for auto-scaler because the admission webhook errors out and finilisers don't complete and we are left with "hung" and broken deployments.

Here is what I get from the logs:

keda-operator-77d5b46d4-fbjzf keda-operator 2023-03-21T15: 45: 25Z	

ERROR	Failed to update ScaledObject after removing a finalizer	{
    "controller": "scaledobject",
    "controllerGroup": "keda.sh",
    "controllerKind": "ScaledObject",
    "ScaledObject": {
        "name": "my-app",
        "namespace": "ops"
    },
    "namespace": "ops",
    "name": "my-app",
    "reconcileID": "e996beff-a348-4b99-b479-ccd7eead01dc",
    "finalizer": "finalizer.keda.sh",
    "error": "admission webhook \"vscaledobject.kb.io\" denied the request: Deployment.apps \"my-app\" not found"
}

github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).finalizeScaledObject
	/workspace/controllers/keda/scaledobject_finalizer.go: 78
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).Reconcile
	/workspace/controllers/keda/scaledobject_controller.go: 157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go: 122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go: 323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go: 274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go: 235

See? And the reason this happens because Helm removes Deployment first. Here is the order of uninstall.

client.go:477: [debug] Starting delete for "my-app" Ingress
client.go:477: [debug] Starting delete for "my-app" Service
client.go:477: [debug] Starting delete for "my-app" Deployment
client.go:477: [debug] Starting delete for "my-app" ConfigMap
client.go:477: [debug] Starting delete for "my-app-1" Mapping
client.go:477: [debug] Starting delete for "my-app-2" Mapping
client.go:477: [debug] Starting delete for "my-app" ScaledObject
uninstall.go:148: [debug] purge requested for my-app
release "my-app" uninstalled

ppanyukov · 2023-03-21T16:19:07Z

@ppanyukov as a workaround, you can disable (scale the webhook controller to 0) after the migration you can enable it again.

Well, maybe. It's a good suggestion to be fair. But it's yet another "mysterious dependency" in a million-piece puzzle that we'd be grateful if we didn't have to think about.

Any suggestions for the "Helm uninstall" workaround? Scaling admission hook to zero in this case if very easy to forget to do, and once Helm release gets into a botched state, it's not so easy to fix things. Although actually I wonder if the finalisers will actually complete if I disable the admission webhook post-uninstall? Need to test this.

EDIT: it does look like the reconciler eventually runs a cycle and does clean up things when the Admission webhook is disabled after Helm uninstall. It's still not great. Should I report this "uninstall" as a separate issue? I would say this validation that deployment should exist is wrong as it depends on the order. And we should be able to deploy things in any order, first because we cannot always control it, and sometimes we may want to configure ScaleObjects separately before there are any deployments.

JorTurFer · 2023-03-21T18:59:00Z

This can be easily avoided if we could temporarily disable this validation, and temporarily have both HPA and KEDA (with exactly same CPU profile) for a brief period of time.

Be careful because this maybe won't happen as you expect, 2 HPAs (the old one and the new one generated by KEDA) can have unexpected behaviors, The HPA controller doesn't scale the value instantly in the first cycle AFAIK, so in the first cycle it will potentially scale in your workload, reducing your active instances. You can try this just adding another HPA and checking if this happens or not, but I'd ensure it

sometimes we may want to configure ScaleObjects separately before there are any deployments

This is something unsupported at this moment, the only option for this is totally disabling the webhooks

About the deletion itself, definitively the webhook shouldn't block the deletion but maybe there is any error or any unexpected behavior. I need to check it in depth because I have an idea about what can be happening

JorTurFer · 2023-03-21T21:11:55Z

sometimes we may want to configure ScaleObjects separately before there are any deployments

I have reviewed the code and you can deploy ScaledObjects before any deployment if you aren't using cpu/memory scalers, if you are using them, you can't because the resources are required for validating them

tomkerkhove · 2023-03-22T08:27:04Z

I think it's good that we have fixed the issue but the main point remains End-users should have an easy way to disable one/multiple/all rules.

While scaling in the webhooks component/opting-out in Helm is a mitigation, I believe we should make the rules configurable.

ppanyukov · 2023-03-22T12:08:49Z

Yes @JorTurFer saw your PR, thanks for that, it will make our lives much easier :)

Hopefully some consensus will be reached on other validation issues as well.

tomkerkhove · 2023-05-23T05:30:55Z

Any thoughts on this @kedacore/keda-maintainers? I think having them configurable through a YAML is still valuable and trivial to do.

tomkerkhove · 2023-07-10T05:55:00Z

@JorTurFer @zroubalik After giving another thought and discussion with @howang-ms - Why not introduce a CRD to have proper resource validation/etc rather than mounting a YAML?

I think this would be a nice addition to have typed rules in a ClusterPreventionRules resource that has every entry from https://keda.sh/docs/2.11/concepts/admission-webhooks/ listed.

Allows us to also use that to offer to end-users for proper tooling by just doing a kubectl get -o wide to see what is applied

JorTurFer · 2023-07-10T07:57:30Z

I think that it'd be a nice addition, maybe it's a bit overkill at this moment because we only have 3-5 checks, but if we want to extend it, it'd be nice. I'm curious about if there is any tool to generate scripts that we can inject there. I'm thinking in passing the check code using the CRD instead of having to code it inside the webhooks, that'd bring a total flexibility for extending the checks to platform providers

tomkerkhove · 2023-07-10T09:42:28Z

I think that it'd be a nice addition, maybe it's a bit overkill at this moment because we only have 3-5 checks, but if we want to extend it, it'd be nice.

This is to ease the usage for long-term indeed, but also we've recently had people that were blocked to migrate to KEDA because of HPA already being around. Although we have the feature now, they could have migrated faster if they could have disabled that one rule already.

So scaling the rules it not one driver, being able to customize what is there today already is another one.

I'm curious about if there is any tool to generate scripts that we can inject there. I'm thinking in passing the check code using the CRD instead of having to code it inside the webhooks, that'd bring a total flexibility for extending the checks to platform providers

Maybe, but that is out of scope here

tomkerkhove · 2024-06-19T08:06:39Z

@JorTurFer @zroubalik Any objection if we start making this configurable?

zroubalik · 2024-06-25T21:00:42Z

No objections from my side

tomkerkhove · 2024-07-01T07:14:59Z

@JorTurFer Do you have any?

JorTurFer · 2024-07-24T10:52:23Z

No, go ahead :)

tomkerkhove added needs-discussion feature-request All issues for new features that have not been committed to prevention labels Jan 3, 2023

tomkerkhove added this to Roadmap - KEDA Core Jan 3, 2023

github-project-automation bot moved this to Proposed in Roadmap - KEDA Core Jan 3, 2023

stale bot added the stale All issues that are marked as stale due to inactivity label Mar 4, 2023

tomkerkhove added stale-bot-ignore All issues that should not be automatically closed by our stale bot stale All issues that are marked as stale due to inactivity and removed stale All issues that are marked as stale due to inactivity labels Mar 6, 2023

stale bot removed the stale All issues that are marked as stale due to inactivity label Mar 6, 2023

JorTurFer mentioned this issue Mar 21, 2023

Admission webhook blocks ScaledObject deletion blocking the finalizer removal #4396

Closed

tomkerkhove mentioned this issue Apr 17, 2023

Support ScaledObject taking over existing HPAs with the same name while they are not managed by other ScaledObject #4457

Closed

tomkerkhove removed the feature-request All issues for new features that have not been committed to label May 23, 2023

tomkerkhove added needs-discussion planning:microsoft-engineering and removed needs-discussion planning:microsoft-engineering labels May 23, 2023

tomkerkhove moved this from Proposed to To Do in Roadmap - KEDA Core Jul 10, 2023

tomkerkhove added feature All issues for new features that have been committed to and removed needs-discussion labels Jul 10, 2023

tomkerkhove assigned SpiritZhou Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make prevention rule set configurable #4067

Make prevention rule set configurable #4067

tomkerkhove commented Jan 3, 2023

tomkerkhove commented Jan 3, 2023

JorTurFer commented Jan 3, 2023

tomkerkhove commented Jan 3, 2023

JorTurFer commented Jan 3, 2023

tomkerkhove commented Jan 3, 2023

JorTurFer commented Jan 3, 2023

tomkerkhove commented Jan 3, 2023

stale bot commented Mar 4, 2023

ppanyukov commented Mar 21, 2023

zroubalik commented Mar 21, 2023

ppanyukov commented Mar 21, 2023

ppanyukov commented Mar 21, 2023 •

edited

Loading

JorTurFer commented Mar 21, 2023

JorTurFer commented Mar 21, 2023

tomkerkhove commented Mar 22, 2023

ppanyukov commented Mar 22, 2023

tomkerkhove commented May 23, 2023

tomkerkhove commented Jul 10, 2023 •

edited

Loading

JorTurFer commented Jul 10, 2023 •

edited

Loading

tomkerkhove commented Jul 10, 2023

tomkerkhove commented Jun 19, 2024

zroubalik commented Jun 25, 2024

tomkerkhove commented Jul 1, 2024

JorTurFer commented Jul 24, 2024

Make prevention rule set configurable #4067

Make prevention rule set configurable #4067

Comments

tomkerkhove commented Jan 3, 2023

Proposal

Use-Case

Anything else?

tomkerkhove commented Jan 3, 2023

JorTurFer commented Jan 3, 2023

tomkerkhove commented Jan 3, 2023

JorTurFer commented Jan 3, 2023

tomkerkhove commented Jan 3, 2023

JorTurFer commented Jan 3, 2023

tomkerkhove commented Jan 3, 2023

stale bot commented Mar 4, 2023

ppanyukov commented Mar 21, 2023

zroubalik commented Mar 21, 2023

ppanyukov commented Mar 21, 2023

ppanyukov commented Mar 21, 2023 • edited Loading

JorTurFer commented Mar 21, 2023

JorTurFer commented Mar 21, 2023

tomkerkhove commented Mar 22, 2023

ppanyukov commented Mar 22, 2023

tomkerkhove commented May 23, 2023

tomkerkhove commented Jul 10, 2023 • edited Loading

JorTurFer commented Jul 10, 2023 • edited Loading

tomkerkhove commented Jul 10, 2023

tomkerkhove commented Jun 19, 2024

zroubalik commented Jun 25, 2024

tomkerkhove commented Jul 1, 2024

JorTurFer commented Jul 24, 2024

ppanyukov commented Mar 21, 2023 •

edited

Loading

tomkerkhove commented Jul 10, 2023 •

edited

Loading

JorTurFer commented Jul 10, 2023 •

edited

Loading